Abstract

Recent works in self-supervised learning have demonstrated strong performance on scene-level dense prediction tasks with either object-centric or region-based correspondence objectives. In this paper, we present Region-to-Object Representation Learning (R2O) which unifies region-based and object-centric pretraining. R2O operates by training an encoder to dynamically refine region-based segments into object-centric masks and then jointly learns representations of the contents within the mask. R2O uses a region refinement module to group small image regions, generated using a region-level prior, into larger regions that are semantically similar to objects by clustering region-level features. As pretraining progresses, R2O follows a region-to-object curriculum which encourages learning region-level features early on and gradually progresses to train object-centric representations. Representations learned using R2O lead to state-of-the art performance in semantic segmentation for PASCAL VOC (+0.7 mIOU) and Cityscapes (+0.4 mIOU) and instance segmentation on MS COCO (+0.3 APmk). Further, after pretraining on ImageNet, our encoders surpass existing state-of-the-art in unsupervised object segmentation on the Caltech-UCSD Birds 200-2011 dataset (+2.9 mIoU) without any further training.

Region to Object Mask Refinement

Our goal is to find semantically meaningful segmentations of images while jointly learning good representations for underlying semantics. We formulate this goal as a bilevel optimization problem. Our approach is to iteratively perform a mask-refinement step that refines region-level priors into object-centric masks and a representation learning step that optimizes representational invariance for mask-level features. Throughout the process, the number of segments is gradually reduced and the models's receptive field evolves from small neighborhoods to object-centric masks.

Visualization

We visualize the masks generated by R2O in the mask refinement step during the pretraining on ImageNet-1K after 100, 200 and 300 epochs. We demonstrate early masks consist of random image segments which gradually become object-centric. This qualitative analysis affirms our assumption.

Related Works

If you found our work interesting, please also consider looking into some closely related works like Detcon and Odin.