Image/Video Deep Anomaly Detection: A Survey

Anomaly Detection (AD) is a process of identifying/detecting unusual samples which seldom appear or do not even exist in the training dataset. These samples do not confirm the expected behavior and are thus called outliers. Anomalies occur very rarely in the data. AD is strongly correlated to computer vision and image processing tasks such as security, image/video, health, and medical diagnosis, and financial surveillance.

This post summarizes Deep Learning based Image/ Video anomaly Detection survey paper-Image/Video Deep Anomaly Detection: A Survey, discuss the detailed investigation, current challenges, and future research in this direction.


Generally, there are a large number of data instances that follow target class distribution i.e. normal data. Anomalous data samples belong to out of class distribution are not present or rarely present at the expense of high computational cost. Deriving abnormal data leads to a very complicated learning process. So, researchers have tried to train models that are capable of classifying anomalous data from normal data. AD algorithms suffer from various weaknesses-

The above challenges depict AD tasks face several challenges which need to be addressed adequately and effectively. Inclined by the success of Deep Neural Networks in the various research fields, many deep learning-based solutions have been proposed in this area. In-depth investigated and most widely used approaches have considered shared concepts of normal data as a distribution or a reference model(figure 1).

Problem Formulation:

There are ‘U’ unlabeled images or video frames denoted as Xn(assuming majority if not all) that follow normal data distribution (pN). i.e. (x belongs to Xn)~pN. AD is the task of detecting if test sample ‘y’ follows pN, otherwise detected as an anomaly.

where D is a metric used to compute the distance between a given instance and normal data distribution. F is a feature extractor that maps raw data to a set of discriminative features. Fitting a distribution (pN)to the training dataset and utilizing metric D is not straightforward due to the high dimensionality and diversification of data. Previous methods used measures such as Mahalanobis distance or probability. Recently proposed techniques learn from both the reference model and detecting measures using deep neural networks such as GANs and Encoder-Decoder Networks. Based on the available number of normal (N), abnormal (A), and Unlabelled samples (U), proposed techniques have been classified into 3 categories-

Figure 2: Anomaly Detection techniques

Out of all, unsupervised techniques have been considered more general and realistic in AD tasks.

Deep Image/Video Representation:

Traditional Features:

Initially, proposed methods were handcrafted, trajectory-based, or low-level features such as Motion Boundary Histogram (MBH), Histogram of Gradient. But these approaches have weaknesses such as high false-positive rates, low performance, computationally expensive, and inadequate in discriminating normal and abnormal data samples. However, it is to be mentioned that both spatial and temporal features of a video play a significant role in AD tasks. Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM), and 3D CNN consider temporal features for video AD.

Deep Features:

A resounding success in Deep Representation Learning on computer vision tasks has inclined researchers to use learned features in place of hand-crafted features.

Two approaches to this technique include-

Deep Networks for Anamoly Detection:

Previous discussed approaches lack in joint learning of discriminative features- D (distance metric) and distribution (pN). Self Supervised Learning, Generative Networks have been used by researchers to learn end-to-end deep networks which aim to correctly predict out-of-distribution (anomalous)data.

Self Supervised Learning:

The model access normal data or data with minimal abnormalities in real-world applications. Researchers have tried to learn D and pN implicitly in order to a train model end to end on such data. DNN is trained under specific constraints to learn pN. A test sample say X, does not follow pN is if it does not satisfy the desired constraint and thus can be considered as an anomaly. Minimize Reconstruction Error, forcing latent representation to be sparse are well-known self-supervised tasks to learn distribution (pN).

The parameters of Encoder-Decoder-based methods are learned to rebuild normal data instances. The networks are trained by minimizing the equation:

Here D(E(X)) is an encoder-decoder network that learns normal distribution pN and not the anomalous data. In order to reconstruct normal instances, the parameters of D(E(X)) are optimized. If a reconstructed sample has a high Reconstruction Error(RE), it shows an anomaly. Using just RE for anomaly detection leads to high false-positive rates. Enoder decoders models are also used for AD in medical imaging. Due to the high dimensionality of image and video data, learning two networks(Enoder and Decoder) to map the input to a latent space and reconstructing the original input has a high computational cost. In order to resolve this issue, researchers proposed CNN that learns on normal training data and identifies/detects anomalous data based on different responses of a neural network to different data inputs. The decision about whether the data is normal or not is made by metric ‘D’ which is defined separately and not by the DNN. This implies neural network itself can not predict the type of data. Taking this into consideration, this approach is not end-to-end DNN.

Generative Networks:

The above-mentioned techniques i.e. pre-trained networks, self-supervised learning are not end-to-end Deep Neural Networks. Lack of outlier data is a major problem to learn end-to-end deep neural networks. In order to resolve this issue, GANs are very useful. They consist of two CNNs-Generator(G) and Discriminator (D).

Figure 3:Generative Adversarial Networks (GANS)

Initially, G and D are naive to learning. So they take random decisions. Generator G is supplied with latent distribution(random noise) as input. Discriminator D is provided input with both real images and input from Generator.

G aims to generate data samples with the same normal distribution in order to fool the Discriminator so as to manipulate it to detect G(X) which is real data. Whereas, D tries to make a distinction between real images and data generated by G. These CNNs are adversarially trained with the following objective function (Above Equation 3).

Figure 4: Anamoly Detection Using GAN

During training, G tends to generate fake (anomalous) data for D, and D is trained to classify normal and abnormal image/video frames. D is a binary classifier. Eventually, D becomes capable of acting as a one-class classifier. G acts as an encoder-decoder as it recreates normal data instances. In recent researches, G not only recreates normal data frames but also used them for pre-processing in order to improve the performance of D as an end to end anomaly detector.

GAN-based solutions have achieved great performance but these suffer from problems such as expensive training, instability, etc making them impractical for real-world problems.

Anomaly Generation:

The problem of AD can be converted to a binary classification problem where GANs can be used to generate abnormal data instead of directly using it. The idea was presented by Masoud Pourreza, The aim of this work is to train Wasserstein GAN on normal instances and exploit G before complete convergence. G generate Irregular data along with normal data from a training set that can be used for AD tasks.

Datasets Used For Image/Video Anomaly Detection:

The most popular and widely used Image and video datasets for AD are-

Image Datasets: MNIST dataset (28 X 28) grayscale images of handwritten digits from 0–9 (10 classes). CIFAR-10 and CIFAR-100 consist of images (32 X 32) with 10 and 100 classes. Caltech-256 included 30,607 images with 256 object categories. Each category contains at least 80 images.

Video Datasets-UMN (an anomaly in crowd data during panic) contains normal (people wandering around) and abnormal events (running). UCSD Pedestrian 1 and Pedestrian 2 datasets contain images with 158 X234 and 240X360. Normal objects are pedestrians and cars, bicycles and skateboarders are anomalies. Other datasets includes-CUHK Avenue containing 47 abnormalities and the UCF-Crime dataset.

Challenges & Future Directions:

Some noteworthy facts which have been ignored in Image/video AD includes-





Reference paper-