Chapter 12: Dimensionality Reduction (Principal Component Analysis)

What is it ?

An idea which we can use to reduce the number of features in x. Therefore we want to reduce x from a n dimensional vector to a k dimensional vector where k < n

Motivation

1. save memory / disk space –> somewhat trivial

2. make learning algorithms run faster –> a 100 dimensional vector is easier to compute than a 1000 dimensional

3. better data visualization –> easier to visualize a 2D or 3D plot but hard to visualize a multidimensional plot

The technique to reduce the dimensions to via Principal Component Analysis (PCA) which is like a compression method.

Principal Component Analysis Problem Formulation

The idea is to have a line or k dimensional plane where we can project our examples onto. If we want to reduce n dimensional to k dimensional , we will have k * u vectors with each u vector is a certain direction. The plane or line will be the one which has the lowest cost error. It is the job of PCA to provide us this plane.

Note: The cost error is the length of the line perpendicular to the plane. The first point of the line is the original x and the 2nd point is the location of the point projected onto the plane. If we draw a line connecting these 2 points perpendicular to the plane and measure the distance , then the value is the cost error for this specific x.

This line is not the projection line which we saw in SVM.

PCA is not linear regression although it seems like we are too trying to feed a straight line fit to the data in PCA. However PCA and Linear Regression are totally different algorithms. The objectives are totally different. In linear regression we are trying to predict y from x . In PCA , we are trying to reduce the number of dimension of x and all x or examples are treated equally and there is no concept of y and there is nothing to predict, we merely want to “simplify’ each x.

Principal Component Analysis Algorithm (How to implement PCA ?)

The vectors which are projected onto the plane or line will be called Z. The directional vectors which make up the plane will be called U.

Important: It is best to perform feature scaling and mean normalization on the data before carrying out PCA.

1. Compute “convariance” matrix denoted by capital sigma Σ where;

Σ = 1/m * ^mΣ_i(xⁱ)(xⁱ)^T

2. Compute “eigenvectors” of the convariance matrix Σ

Σ will be a n * n matrix where (xⁱ) is a n x 1 feature vector. (xⁱ)^T will be a 1 x n matrix so altogether we have m of (n x n) matrices , sum them all up and take the average.

[U, S, V] = svd(Sigma) (function svd is “singular value decomposition” which we can called from octave/matlab)

After getting [U, S, V] , the U matrix will contain all the directional u vectors . U is a n x n matrix where each column represent one directional u vector. So to reduce x from a n to k dimensional vector , we take the first k columns from U and put together into a matrix called U_reduce :

zⁱ = (U_reduce)^T . xⁱ
U_reduce is a n x k matrix so transposing it produce a k * n matrix
xⁱ is a n vector , so the dot product will be z which is a k x 1 vector

The Quiz Question for this topic is the following slide:

Initially I was caught off and don’t really understand it until thinking a bit deeper.

z_j represent the j^th element in z which is a k dimensional vector. So z_j is a real number. This real number is produced by the the dot product of the j^th row in U_reduce and x, which is a n x 1 matrix.
j^th row in U_reduce is a 1 x n row vector , so dot product of [1 x n] and [n x 1] is [1 x 1] real number. Therefore answer is the 4th choice.

Reconstruction from Compressed Representation

How do we retrieve the original x vector from the compressed z vector ?

z = (U_reduce)^T.x
x_approx = (U_reduce). z

Note: (U_reduce) is a [n x k] matrix

Choosing the Number of Principal Components

How to choose the value of K ?

First we need to understand 2 terms which are

1. Average squared projection error : 1/m * ^mΣ_i=1 || xⁱ – _approxxⁱ ||²

Above is the value which PCA is trying to minimise.

2. Total variation in the data : 1/m * ^mΣ_i=1 || xⁱ ||²

Above is the average squared distance away from the origin. Each x is vector shooting out from the origin, so || xⁱ ||² is the squared distance. Sum up the squared distances and compute the average.

We choose k such that

1/m * ^mΣ_i=1 || xⁱ – _approxxⁱ ||²
———————————– <= 0.01
1/m * ^mΣ_i=1 || xⁱ ||²

0.01 means that a 99 % variance is retained , 0.05 means that 95 % variance is retained etc.

A reasonable method to choose k will be we can start off with k = 1 , compute the U reduce matrix to compute the z vectors then compute the x approx vectors , apply the check to see if the difference is <= 0.01 . If yes , then we stop else we increment k by 1 and repeat the process. This seems to be troublesome method because we need to compute u , z and x approx in each iteration. Luckily there is simpler method.

[U, S , V] = svd(Sigma)

We can use make of S to apply the condition instead of recomputing the different vectors. S is a special n x n matrix whereby all the diagonal values are non-zeros and the rest of the elements are 0. And we can replace:

1/m * ^mΣ_i=1 || xⁱ – _approxxⁱ ||²
———————————– <= 0.01
1/m * ^mΣ_i=1 || xⁱ ||²

with

^kΣ_i=1 S_ii
———— >= 1 – 0.01 for any given k
ⁿΣ_i=1 S_ii

Advice for Applying PCA

PCA should be applied on the training set only. This means the input Sigma, to the SVD function, should be derive from the training set data. Then use matrix S to determine the value of k. After that we can use the learned Ureduce matrix and applied it to the CV and test set to map Xcv, Xtest to Zcv , Ztest respectively.

PCA is a BAD method to address overfitting , we should be regularization instead.

When designing a ML system , try without applying PCA and see how does the system perform using only the raw/original data first. If it turn out to be slow or hard to visualize data , then add in the PCA .