Self-Supervised Visual Feature Learning with Deep Neural Networks: A Survey

12 min readNov 24, 2021

This post summarizes a survey paper-Self supervised Visual Feature Learning with DNNs.

The post discusses the need for self-supervised learning, pretext tasks and downstream tasks involved with this learning and Image and Video Visual feature learning methods in detail.

Introduction

Large scale labeled datasets are required to train neural networks in order to obtain better performance for computer vision applications. However, collection and annotation of large scale data is a very time-consuming and expensive task. For example, the Kinetic dataset used to train ConvNets for human action recognition contains 500,000 videos belonging to 600 categories and each video lasts for around 10 seconds. In order to collect and annotate such large scale data, it took a lot of time by many Amazon Turk workers.

To avoid time-consuming and expensive labeling tasks, self-supervised learning (a subset of unsupervised learning) is proposed to learn visual features from unlabeled large scale image and video data.

Before deep diving into the topic, it's better to understand some terms related to self-supervised learning:

Self-supervised Learning: A learning method in which ConvNets are explicitly trained with supervisory signals which are generated from the data itself (self-supervision) by leveraging its structure.

Human-annotated labels: are the data labels which are manually annotated by human workers.

Pretext Task: Pre-designed tasks for networks to solve, and visual features are learned by learning objective functions of pretext tasks. The pretext tasks can be predictive tasks, generative tasks, contrasting tasks, or a combination of them.

Pseudo label: The labels used in pretext tasks, which are generated based on the structure of data for pretext tasks.

Downstream Task: Downstream tasks are computer vision applications that can be used to evaluate the quality of features learned by self-supervised learning. These applications can greatly benefit from the pre-trained models when training data is scarce.

As shown in Figure 1, In self-supervised learning, the idea is to propose a pretext task for the network to solve. The pseudo labels are automatically generated based on data attributes. The network(ConvNet) is trained to learn pretext task objective function and visual features are learned. After self-supervised training is done, the learned visual features can be transferred to various downstream tasks such as image classification, object detection, human action recognition. Figure 1 shows the pipeline of self-supervised learning.

Deep Network Architectures used for Self Supervised Learning:

No matter the category of learning technique, they share similar network architectures.

For Image Feature Learning, 2D ConvNets such as AlexNet, VGG, GoogLeNet, ResNet, and DenseNet, etc.

For Video Feature learning, In order to extract both spatial and temporal features from videos, 2DConvNet-based methods, 3DConvNet-based methods, and LSTM-based methods are used. The 2DConvNet-based methods apply 2DConvNet on every single frame, and the image features of multiple frames are used as video features. The 3DConvNet-based methods use 3D convolution to simultaneously extract both spatial and temporal features from multiple frames. The LSTM-based methods employ LSTM to model long term dynamics within a video.

In Figure 2, the pseudo labels P for pretext task are automatically generated without human annotations. ConvNet is optimized by minimizing the error between the prediction of ConvNet O and the pseudo labels P. After the training on the pretext task is finished, ConvNet models are obtained that can capture visual features for images or videos.

Datasets used:

Datasets gathered for supervised learning can be used for self-supervised learning without the use of human annotated labels. Quality evaluation for the learned features is conducted by fine-tuning on high level tasks with relatively small datasets such object detection, semantic segmentation etc.

Image datasets include ImageNet, Places, Places365, SUNCG, MNIST, SVHN, CIFAR10, STL-10,PASCAL VOC.

Video datasets include YFCC100M, SceneNet,RGB-D, Moment-in-Time, Kinetics,UCF101, HMDB51.

Audio datasets include AudioSet, ESC50, DCASE

3D object datasets include ShapeNet, ModelNet40, and ShapeNet.

Commonly used Pretext tasks and Downstream tasks:

Visual Feature Learning using Pretext tasks:-To relieve the burden of large-scale data annotation, many pretext tasks have been designed for self-supervised learning. These include object segmentation, image in painting, image colorization, temporal order verification etc. For example, in image colorization, the task is to convert gray-scale image to colored image. To generate colored images, a network learns the structure and context information of the images. In the pretext task, the data X is a gray-scale image generated by applying linear transformation in RGB images. Pseudo labels P is RGB image itself.

Commonly used Downstream Tasks: To evaluate the quality of the learned image or video features by pre-trained models, fine-tuning is done using downstream tasks such as image classification, semantic segmentation, object detection, and action recognition etc. The performance of the transfer learning on these high-level vision tasks demonstrates the generalizability of the learned features. If the ConvNet can learn general features, the pre-trained models serve as a good starting point for downstream tasks that require similar image or video features.

Image Feature Learning:

Self supervised learning Pretext tasks are divided into 4 categories:

Generation Based Methods
Context Based Methods
Free Semantic Label Based Methods
Cross Modal Based Methods

Generation Based Image Feature Learning:-

This technique involves using GANs to generate fake images, super resolution (generate high resolution images), image in painting (restoring missing image portions) and image colorization. For these tasks, pseudo labels are images themselves and no human annotated labels are used during the model training.

Image Generation with GAN:

GAN (Generative Adversarial Network)(Fig. 4) consists of two networks: a Generator which generates fake images from latent vectors and a Discriminator which distinguishes whether the image is from real distribution (real image) or latent vector space(fake image). By playing the two-player game, the discriminator forces the generator to generate realistic images, while the generator forces the discriminator to improve its differentiation ability.

During the training, the two networks are competing against each other, making each other stronger. Moreover, the discriminator is required to capture the semantic features from images to accomplish the task. The parameters of the discriminator can serve as the pre-trained model for other computer vision tasks.

Image Generation with Inpainting:

In this approach (Fig.5), there are two networks: a generator network is to generate the missing region with the pixel-wise reconstruction loss and a discriminator network is to distinguish whether the input image is real with an adversarial loss. With the adversarial loss, the network is able to generate sharper and realistic hypotheses for the missing image region. The generator network, which is fully convolutional, has two parts: the encoder and the decoder. The encoder input is an image to be inpainted. The context encoder learns the semantic features of the image. The context decoder predicts the missing regions based on these features. Both the networks learn semantic features from images and can be transferred to other computer vision tasks.

Image Generation with Super Resolution:

For super resolution, SRGANs (Super Resolution Generative Adversarial Network) are used where a generator is used to enhance resolution of input of low resolution image and a discriminator distinguishes whether input image is from the generator. The approach has a perceptual loss which consists of an adversarial loss and a content loss.The loss function for the generator is the pixel-wise L2 loss plus the content loss which is the similarity of the feature of the predicted high-resolution image and the high-resolution original image, while the loss for the discriminator is the binary classification. With the perceptron loss, the SRGAN recovers photo-realistic textures from heavily down sampled images and shows significant gains in perceptual quality.

Image Generation with Colorization:

Image colorization involves predicting plausible color versions of a given gray scale image. To correctly color each pixel, the network needs to recognize objects and group pixels of the same part together. A fully convolutional neural network (Fig 6) consists of encoder for feature extraction and decoder for color hallucination to colorization is used.

Context Based Image Feature Learning:

This method employs context similarity, spatial and temporal structures as the supervision signals.

Learning with Context Similarity:

There are two methods for utilizing supervision signals for self-supervised learning: a predictive task and a contrastive task. For both the methods, data is first clustered into different groups with assumption that data within a similar group has high context similarity while data in different groups have low context similarity. After clustering, several clusters are obtained where the image within one cluster has a smaller distance and images from different clusters have a larger distance in feature space.

The smaller the distance in feature space, the more similar the image in the appearance in the RGB space. A ConvNet is trained to classify the data by using the cluster assignment as the pseudo label. To accomplish this task, the ConvNet needs to learn the invariance within one class and the variance among different classes. Thus, the ConvNet is able to learn semantic meaning of images.

Learning with Spatial Context Structures:

Images consist of rich spatial context information, such as relative position of different patches in an image, which can be used to design pretext tasks for self-supervised learning. The pretext task predicts the relative position of two patches from the same image or recognize the order of shuffled sequence of patches from the same image. To accomplish these pretext tasks, ConvNets learns the spatial context information such as the shape of the objects and the relative position of different object parts.

In Figure 7, given 9 patch images, there are 9! permutations, the network is unlikely to predict all of them. So in order to limit the permutations, hamming distance is employed where only a subset of permutations (that have large hamming distance) are chosen among all the permutations. Then only selected permutations are used to train ConvNet to determine shuffled image patches.

Free Semantic Label Based Image Feature Learning

The free semantic labels such as depth images, optic flows are obtained without human annotations. Since these semantic labels are generated automatically, the methods using the synthetic datasets or using them with large unlabeled images or video datasets are considered self-supervised learning methods.

Learning with labels generated by Game Engines and Hard Core Programs

Game engines generate realistic images and provide pixel level labels. Various game engines like Airsim and Carla are used to generate largescale synthetic datasets with high level semantic labels including image depth, optic flows, surface normal (Fig.10)

Due to the domain gap between synthetic and real world images, ConvNet, purely trained on synthetic dataset cannot be directly applied to real world images. To use synthetic datasets for self-supervised learning, the domain gap should be bridged.

Applying hard-core programs is another way to generate semantic labels. This method includes two steps-

Label generation by hard-core programs on images or videos.
Train ConvNet with generated labels

Figure 9: The architecture for utilizing synthetic and real-world images for self-supervised feature learning

Pathak et al.[2] proposed to learn features by training a ConvNet to segment foreground objects in each frame of a video while the label is a mask of moving objects in videos.

Video Feature Learning:

The self-supervised learning video features methods are categorized as:

Generation based
Context based
Free Semantic label based
Cross Modal based

Generation Based Video Feature Learning:

This method involves learning visual features through the process of video generation without using any human annotations. This type of method includes video generation with GAN, video colorization and video prediction. The pseudo labels in all these pretext tasks are video itself.

Learning from Video Generation:

Fig. 10. The architecture of the generator in VideoGan for video generation with GAN

VideoGAN architecture is employed as shown in Figure 10. To model the motion of objects in videos, a two-stream task network is proposed for video generation where one stream models the static region in the background and another stream models moving objects in videos as foreground. Videos are generated by the combination of foreground and background streams. After video generation training is done on large scale unlabeled data, the discriminator parameters can be transferred to other downstream tasks.

Learning from Video Colorization:

Temporal coherence is consecutive frames within short time that have similar coherent appearance. The color coherence can be used to design pretext tasks for self-supervised learning. Video colorization is a way to utilize color coherence.

Tran et al. [3] proposed a U-shaped convolutional network which is an encoder-decoder network based on 3DConvNet. The input of the network is a clip of gray-scale video clip, while the output is a colorful video clip. The encoder is a bunch of 3D convolution layers to extract features while the decoder is a bunch of 3D deconvolution layers to generate colorful video clips from the extracted feature.

The color coherence in videos is a strong supervision signal.

Temporal Context Based Learning:

Videos consist of frames of various lengths that have rich spatial and temporal information. The temporal information is used as a supervision signal for self-supervised learning. Pretext tasks include temporal order verification and temporal order recognition.

Temporal order verification is to verify whether a sequence of input frames is in correct temporal order, while temporal order recognition is to recognize the order of a sequence of input frames.

For temporal order verification, a pretext task [4] by Misra et al. has been proposed to learn image features from videos with 2D ConvNet. The steps include:

Frames with significant motions are sampled from videos according to optic flow magnitude.
Sampled frames are shuffled and fed to the network which is trained to verify whether the input data frames are in the correct order. The network captures subtle differences between the frames such as the movement of the object (person) in order to verify the temporal difference.

Cross Model Based Learning

This method usually learns features from the correspondence of multiple data streams such as RGB frame sequence, optical flow sequence, audio data and camera pose. Optical flow sequence can be generated that indicates motion in videos.

Based on the type of data used, these methods fall into three categories.

Methods that learn features by using the RGB and optical flow correspondence
Methods that learn features by utilizing the video and audio
Correspondence ego-motion that learns by utilizing the correspondence between egocentric video and ego motor sensor signals.

Usually, the network is trained to recognize if the two kinds of input data are corresponding to each other.

Learning from Visual-Audio Correspondence

This type of method jointly learns both video and audio features with heterogeneous networks. The general framework of this type of pretext tasks is shown in Fig. 11.

Figure 11:Architecture of video and audio correspondence verification

There are two subnetworks: the vision subnetwork and the audio subnetwork. The input of vision subnetwork is a single frame or a stack of image frames and the vision subnetwork learns to capture visual features of the input data. Positive data are sampled by extracting video frames and audio from the same time of one video, while negative training data is generated by extracting video frames and audio from different videos or from different times of one video. Therefore, the networks are trained to discover the correlation of video and audio data to accomplish the pretext task which could be verifying whether the input visual data and audio data are correspondents or not with cross-entropy loss.

References:-

[1] https://arxiv.org/pdf/1902.06162.pdf

[2] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,“Context encoders: Feature learning by inpainting,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2016, pp. 2536–2544.

[3] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Deep End2End Voxel2Voxel prediction,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. Workshops, 2016, pp. 17–24.

[4]I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: Unsupervised learning using temporal order verification,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 527–544.