Anomaly Detection (AD) is a process of identifying/detecting unusual samples which seldom appear or do not even exist in the training dataset. These samples do not confirm the expected behavior and are thus called outliers. Anomalies occur very rarely in the data. AD is strongly correlated to computer vision and image processing tasks such as security, image/video, health, and medical diagnosis, and financial surveillance.
This post summarizes Deep Learning based Image/ Video anomaly Detection survey paper-Image/Video Deep Anomaly Detection: A Survey, discuss the detailed investigation, current challenges, and future research in this direction.
Generally, there are a large number of data instances that follow target class distribution i.e. normal data. Anomalous data samples belong to out of class distribution are not present or rarely present at the expense of high computational cost. Deriving abnormal data leads to a very complicated learning process. So, researchers have tried to train models that are capable of classifying anomalous data from normal data. AD algorithms suffer from various weaknesses-
- High False Positive rates-In most AD applications, detection of abnormal events is considered vital and critical than identifying normal data. For example, In surveillance systems, detecting abnormal events as normal leads to unreliability and safety concerns. Though, high false alarm rates bring fickleness and inefficacy.
- High Computational Cost
- Unrealistic Datasets-Available datasets are impractical to work on.
The above challenges depict AD tasks face several challenges which need to be addressed adequately and effectively. Inclined by the success of Deep Neural Networks in the various research fields, many deep learning-based solutions have been proposed in this area. In-depth investigated and most widely used approaches have considered shared concepts of normal data as a distribution or a reference model(figure 1).
There are ‘U’ unlabeled images or video frames denoted as Xn(assuming majority if not all) that follow normal data distribution (pN). i.e. (x belongs to Xn)~pN. AD is the task of detecting if test sample ‘y’ follows pN, otherwise detected as an anomaly.
where D is a metric used to compute the distance between a given instance and normal data distribution. F is a feature extractor that maps raw data to a set of discriminative features. Fitting a distribution (pN)to the training dataset and utilizing metric D is not straightforward due to the high dimensionality and diversification of data. Previous methods used measures such as Mahalanobis distance or probability. Recently proposed techniques learn from both the reference model and detecting measures using deep neural networks such as GANs and Encoder-Decoder Networks. Based on the available number of normal (N), abnormal (A), and Unlabelled samples (U), proposed techniques have been classified into 3 categories-
- Supervised (N+A): For some applications like accident detection, there is a good explanation of abnormality. So, normal (N)and anomalous (A) data gathering is easy. A CNN is made to learn on (N+A) samples in a supervised way to accurately make a distinction. Though supervised modeling produces high accuracy, it doesn't lead to generalized outcomes. The performance of the supervised model is not optimal due to an imbalance in the dataset(the number of instances of the abnormal class is less than the normal class). Moreover, due to diversity in anomalies, the training procedure is disturbed making it practically infeasible.
- Semi-supervised(N+A+U): Huge number of unlabelled (U) samples are available in AD tasks though collecting such data is tedious and computationally expensive because of diversification in anomalies and their rare occurrences. Some researchers have been proposed that learn a model on numerous unlabelled samples along with a few normal and anomalous instances ( N+A<U). Having access to both normal and abnormal events in most of the AD applications is practically impossible.
- Unsupervised (U): Here model is trained on only unlabelled data samples. Outliers are detected based on the intrinsic properties of data samples. In this category, it is assumed that just like realistic situations, abnormal events occur very occasionally in unlabelled samples. Unsupervised methods are called as One-Class Classification or OCC (in short). It involves fitting a model on the “normal” data and predicting whether new data is normal or an outlier/anomaly.
Out of all, unsupervised techniques have been considered more general and realistic in AD tasks.
Deep Image/Video Representation:
Initially, proposed methods were handcrafted, trajectory-based, or low-level features such as Motion Boundary Histogram (MBH), Histogram of Gradient. But these approaches have weaknesses such as high false-positive rates, low performance, computationally expensive, and inadequate in discriminating normal and abnormal data samples. However, it is to be mentioned that both spatial and temporal features of a video play a significant role in AD tasks. Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM), and 3D CNN consider temporal features for video AD.
A resounding success in Deep Representation Learning on computer vision tasks has inclined researchers to use learned features in place of hand-crafted features.
Two approaches to this technique include-
- Feature Learning: This technique used autoencoders (modified traditional solutions)by using learned features. Since processing an image or a video entirely in one phase is computationally expensive, the solution to this problem was to divide images into a set of patches in order to perform patch-based algorithms. Features of each patch were learned using autoencoders. The obtained features are more discriminative as the encoder is able to represent all the patches.
- Pre-trained Networks: In AD, transfer learning includes using pre-trained models and fine-tuning the hyperparameters.
Deep Networks for Anamoly Detection:
Previous discussed approaches lack in joint learning of discriminative features- D (distance metric) and distribution (pN). Self Supervised Learning, Generative Networks have been used by researchers to learn end-to-end deep networks which aim to correctly predict out-of-distribution (anomalous)data.
Self Supervised Learning:
The model access normal data or data with minimal abnormalities in real-world applications. Researchers have tried to learn D and pN implicitly in order to a train model end to end on such data. DNN is trained under specific constraints to learn pN. A test sample say X, does not follow pN is if it does not satisfy the desired constraint and thus can be considered as an anomaly. Minimize Reconstruction Error, forcing latent representation to be sparse are well-known self-supervised tasks to learn distribution (pN).
The parameters of Encoder-Decoder-based methods are learned to rebuild normal data instances. The networks are trained by minimizing the equation:
Here D(E(X)) is an encoder-decoder network that learns normal distribution pN and not the anomalous data. In order to reconstruct normal instances, the parameters of D(E(X)) are optimized. If a reconstructed sample has a high Reconstruction Error(RE), it shows an anomaly. Using just RE for anomaly detection leads to high false-positive rates. Enoder decoders models are also used for AD in medical imaging. Due to the high dimensionality of image and video data, learning two networks(Enoder and Decoder) to map the input to a latent space and reconstructing the original input has a high computational cost. In order to resolve this issue, researchers proposed CNN that learns on normal training data and identifies/detects anomalous data based on different responses of a neural network to different data inputs. The decision about whether the data is normal or not is made by metric ‘D’ which is defined separately and not by the DNN. This implies neural network itself can not predict the type of data. Taking this into consideration, this approach is not end-to-end DNN.
The above-mentioned techniques i.e. pre-trained networks, self-supervised learning are not end-to-end Deep Neural Networks. Lack of outlier data is a major problem to learn end-to-end deep neural networks. In order to resolve this issue, GANs are very useful. They consist of two CNNs-Generator(G) and Discriminator (D).
Initially, G and D are naive to learning. So they take random decisions. Generator G is supplied with latent distribution(random noise) as input. Discriminator D is provided input with both real images and input from Generator.
G aims to generate data samples with the same normal distribution in order to fool the Discriminator so as to manipulate it to detect G(X) which is real data. Whereas, D tries to make a distinction between real images and data generated by G. These CNNs are adversarially trained with the following objective function (Above Equation 3).
During training, G tends to generate fake (anomalous) data for D, and D is trained to classify normal and abnormal image/video frames. D is a binary classifier. Eventually, D becomes capable of acting as a one-class classifier. G acts as an encoder-decoder as it recreates normal data instances. In recent researches, G not only recreates normal data frames but also used them for pre-processing in order to improve the performance of D as an end to end anomaly detector.
GAN-based solutions have achieved great performance but these suffer from problems such as expensive training, instability, etc making them impractical for real-world problems.
The problem of AD can be converted to a binary classification problem where GANs can be used to generate abnormal data instead of directly using it. The idea was presented by Masoud Pourreza, The aim of this work is to train Wasserstein GAN on normal instances and exploit G before complete convergence. G generate Irregular data along with normal data from a training set that can be used for AD tasks.
Datasets Used For Image/Video Anomaly Detection:
The most popular and widely used Image and video datasets for AD are-
Image Datasets: MNIST dataset (28 X 28) grayscale images of handwritten digits from 0–9 (10 classes). CIFAR-10 and CIFAR-100 consist of images (32 X 32) with 10 and 100 classes. Caltech-256 included 30,607 images with 256 object categories. Each category contains at least 80 images.
Video Datasets-UMN (an anomaly in crowd data during panic) contains normal (people wandering around) and abnormal events (running). UCSD Pedestrian 1 and Pedestrian 2 datasets contain images with 158 X234 and 240X360. Normal objects are pedestrians and cars, bicycles and skateboarders are anomalies. Other datasets includes-CUHK Avenue containing 47 abnormalities and the UCF-Crime dataset.
Challenges & Future Directions:
Some noteworthy facts which have been ignored in Image/video AD includes-
- False-positive rate: There are solutions to AD tasks with great accuracy in detecting outliers but they come with very high false-positive rates. Ideally, the solution must have accuracy with low false-positive rates.
- Fairness: Skewed datasets, limited features are responsible for unfairness in AD. This is primarily due to the insufficient availability of anomalous data.
- Safety: A minor manipulation to input sample confuses DNN in a way that misclassifies the input data. Thus, DNNs are prone to adversarial attacks.
- Realistic Datasets: Datasets used for AD tasks are far from realistic situations.
- Early Detection: Proposed Solutions correctly detect anomaly which is either over or near to end. Late Detection of an anomaly in the case of videos is unacceptable. So, early detection of such events is highly critical. A well-timed alarm can be beneficial in minimizing or preventing loss caused by anomalous events occurrences.