June 2015 – ML Study Notes

Multivariate Gaussian Distribution

Previously we mentioned that a common problem with anomaly detection is that some anomalous examples has p(x) > ε . This decrease performance of the anomaly detection algorithm.

From the picture, we see the green anomalous examples when observed as individual feature (right part of pic) (x1, x2) has a pretty decent probability within the Gaussian distribution. If we plot x1, x2 on a contour plot (left part of pic) , we see the green examples are lying further out from the center mean.

Therefore we have,Multivariate Gaussian Distribution, a enhanced model of the Gaussian distribution to illustrate the correlation between x1 and x2.

We will model p(x) all as whole. Mean μ becomes a n dimensional vector and σ become Σ which is n x n covariance matrix used in PCA. Initially σ just denote the standard deviation of a single feature.

Adjusting the diagonal value of the elements in the covariance matrix changes the shape of the “hill” like those discuss in the topic of PCA.

From the picture above , decreasing the diagonal values from 1 to 0.6 decreases the variance (the range of values which x1 and x2 can take on) of x1 and x2 and make the hill skinner but taller. The integral (total) volume under the surface of the hill is 1 to be consistent with normal distribution. Like a skinner “hill” is “taller” to make up for the loss of the volume.

From the pictures above , adjusting the off diagonal changes the correlation between x and y. A positive value depict a positive correlation and vice versa. As the values increases , we see the oval becoming thinner and approaching a x=y or x=-y kind of correlation.

Adjusting the values of mu changes the location of the peak of the hill and shift the center of the distribution as illustrated in the picture.

Anomaly Detection using the Multivariate Gaussian Distribution

Compute mu and Sigma and plug in a new example x into the formula as illustrated in picture below.

Note that mu here is a n dimensional vector.

Original model vs Multivariate Model

When we have smaller training set , original model will be better because when m < n , Sigma cannot be inverted. Suppose we can manually come up with new features , we should stick with original model . Note new feature does not mean x3 = x1 + x2 –> this combination does not have capture new information or new relationship , it simply add both values together. Contrast this if x3 = cpu load / memory , the value is more informative and give us insights. A higher or lower than mean ratio suggest high possibility that a particular machine is acting strangely.