Chapter 13: Anomaly Detection Part 1: Density Estimation

Anomaly Detection

An algorithm for detecting abnormal examples. Use extensive in quality assurance to check for sub-standard products. For example given a data set of aircraft engines , we want to identify which are the normal engines vs the anomalous engines.

Density Estimation

Suppose our example contains only one feature so each example x is a real number and given a x_test example , we want to check if x_test is an anomaly or not? Our model would be as follows:

If probability p(_test) < Epsilon (ε) -> then label as anomaly

Gaussian Distribution

Also known as normal distribution , terms associated with the bell curve

– mean(μ) which is the average of the values

μ = 1/m ^mΣ_i=1 x ⁱ

– variance(σ²) which average of the sum of the squared difference from mean

σ² = 1/m ^mΣ_i=1 (xⁱ – μ)²

– standard deviation(σ) which is (σ²)^1/2

– area under bell curve = 1

– p(x;μ,σ²) -> probability of x parameterized by μ and σ²

= 1/(2π)^1/2σ * exp ( – (x-μ)² / 2σ² )

Algorithm for Anomaly Detection

p(x) = p(x₁; μ₁, ₁σ²).p(x₂; μ₂, _nσ²)…p(x_n; μ₁, _nσ²)

= ⁿΠ_j=1p(x_j; μ_j, _jσ²) –> The product of p(X1) * p(X2) * ….. p(Xn) where X is a n dimensional vector)

Estimating p(x) is also known the problem of density estimation

Putting it all together:

1. Choose features which might be indicative of anomalous examples

2. Fit parameters for μ₁ to μ_n and ₁σ² to _nσ²

3. Given new example x, compute p(x)

p(x) = ⁿΠ_j=1p(x_j; μ_j, _jσ²)

Anomaly if p(x) < ε

Developing and Evaluating an Anomaly Detection System

Previously we learned that having a single value would be helpful in evaluating our supervised learning algorithm , it is the same anomaly detection. Also we should split our data set into training, cv and test set.

Training set contains the normal examples and are unlabeled. Both CV and Test sets are labelled (y = 1 if x is abnormal and 0 otherwise) with different abnormal examples inserted into CV and Test set . Ideally we always want our CV and Test set to contain different examples.

Evaluation

Compute the parameters of mean , variance etc of the model of p(x) based on training set

On cv set , given x , we predict y=0 if p(x) > epsilon or y=1 if p(x) <= epsilon.

Since we might have skewed examples where there contains much more normal examples. We too need to consider what are the precision / recall values based on the number true positives , false positives etc . Therefore it is good to use the F1 score as an evaluation metric

We can choose the value of epsilon by using evaluating a list of choices on the CV set , then pick the best value which work best on the CV set , then take that value and test it on the Test set. If we want to add or remove features to our model , we should too evaluate the model on the CV set.

Anomaly Detection vs. Supervised Learning

When we should use anomaly detection vs using supervised learning ?

If we had very small number of abnormal examples -> we should use anomaly detection because since there are a large number of negative examples, it will be very hard to use a supervised learning (neural network / logistic regression) to classify normal / abnormal examples.

Choosing What Features to Use

If our feature , say x1 , does not look gaussian (does not have the shape of the bell curve) , we can try to transform x1 using log , square root , cube root functions etc..

So back to question of how do we choose our features for anomaly detection ?

We can use an error analysis procedure similar to the procedure for supervised learning. Run the algorithm on the CV test and look at the examples where it was wrong and determined if we can use new features to improve the performance.

Most common problem in anomaly detection is that the values of p(x) is comparable (say large) for both normal and anomalous examples , so it is hard for our anomaly detection model to use the current set of features to determine anomalous examples. We really need look at the individual wrongly labeled examples and see if we can create new features to ensure p(newX) <= epsilon.

We can use combination features such as x1/x2. (eg x1 = cpu load , x2 = network traffic, the logic is if x1 is high and x2 is low , then a high x1/x2 might be a good indication that a particular machine is not operating normally)