May 2015 – ML Study Notes

Cost Function

In neural networks where we have multiple Hypothesis at the output layer to address multiclassification problems, we will need to adjust the cost function to sum up all the cost of each Hypothesis. Also since now we have more than 1 layer of units and thus more than 1 set of theta , we too need to adjust the regularization term.

Previously we have

J(Θ) = -1/m * [ ^m∑_i=1 yⁱ log(H_Θ(xⁱ )) + (1 – yⁱ ) log (1 – H_Θ(xⁱ )) ] + λ/2m * ^m∑_j=1Θ_j^2

Now we have

J(Θ) = -1/m * [ ^m∑_i=1^K∑_k=1ⁱy_k log(H_Θ(xⁱ ))_k + (1 – ⁱy_k ) log (1 – H_Θ(xⁱ ))_k ] +

λ/2m * ^L-1∑_l=1^S_l∑_i=1^S_l+1∑_j=1^lΘ_ji^2

NOTE: Cost Function of Neural Networks are non-convex

Interpretation of cost function

K denotes the number of classes we have. Previously H_Θ(X) output a real value, now since we have K of them, H_Θ(X) is K dimensional vector. H_Θ(X)_i denote the i^thoutput. So for every m in M examples, we have K (K >= 3) real values to sum instead of one, therefore we added ^K∑_k=1 to the cost function.

L denotes the total number of layers (input + hidden + output). S_l denotes the number of neuron/units (excluding the bias units) for layer l. Example for our given neural below , we have S1= 4 , S2 = 5 , S3 = 1, L = 3

As for the regularization term, since our thetas now are 2 dimensional matrixes for layer 1 and 2. Given the picture above , we have Theta1 associated with layer 1 which has a dimension of 5 * (4+1) and Theta2 is a 1 * (5+1) matrix then we add every theta^2 in the 2 matrixes. (Simply summing every i , j and l theta)

Number of rows in the theta matrix is determined by the number of units/neurons in S_l+1 layer and the current row number is denoted by j index.

Number of columns in the theta matrix is determined by the number of units/neurons in S_l layer and the current column number is denoted by i index.

Recall that if a layer L has j neurons and the next layer L + 1 has n neurons, then the theta matrix associated with layer L will have a dimension of n * (j + 1). Therefore the row index j is looping from j=1 to S_l+1.

Backpropagation Algorithm

It is a method used to compute the gradient of the cost function. The gradient is the output from the partial derivative term of the cost function and is used to feed into gradient descent or the optimised methods. The goal like for linear/logistics regression is to minimise the cost function J(theta).

To understand this method , we first need to understand forward propagation. It is actually the method used to “activate” the units/neurons in the hidden and output layers. Given the same neural network as in the picture above , we have L=3 , a¹ , a² and a³ where:

a¹ = X + bias unit which will be a 4+1 dimensional vector

z² = Θ¹ * a₁

a² = g(z²) which will be a 5 dimensional vector

then a² = a₂ + bias unit ( a2 = [1 ; a2]) to be a 6 dimensional vector

z³ = Θ² * a₂

a³ = g(z³) = a real number

For backpropagation , we would introduce a new term δ (Delta) which denote the “error” of the unit j in layer l. And

Example: Given our neural network with L=3, we will start to compute from the output layer.

³δ₁ = ³a₁ – y₁ which compute the error for node 1 layer 3. We can too vectorized this via δ³= a³ – y.

Then:

δ² = (Θ²)^Tδ³ .* g'(z²)

Both z²and (Θ²)^Tδ³ is a 6 * 1 dimensional vector and we are applying an element wise .*

So to generalize , we have δ^l = (Θ^l)^Tδ^l+1 .* g'(z^l)

g'(z²) is the derivative term of the g(z²) which can be computed as below

g'(z²) = a² .* ( 1 – a²)

Note: The 1 here refer to a vector of ones.

Also there is no error for layer 1 therefore δ¹ is not valid because layer 1 is the input layer using actual data and contains no errors whereby the rest of the layers are computed from previous layers and might contain error. (This is what I think, might be wrong 🙂 )

We see that we start to compute the delta from the output layer then work backwards hence the name backward propagation.

After computing all δ , we would need to “accumulate” them. To do this we would introduce ^lΔ_ij which a matrix set to all 0 initially. Δ is the capital of the greek letter δ.

Then apply the update rule which is

^lΔ_ij = ^lΔ_ij + ^la_j^l+1δ_i

To summarize

Set ^lΔ_ij = 0 (for all l, i and j)

For i = 1 to m examples

1. Set a¹ = xⁱ
2. Compute the forward activation of a^l for l = 2 … L
3. Compute the δ^l for l = L , L-1 … 2..
4. Δ^l = Δ^l + δ^l+1(a^l)^T –> vectorized version of update rule

Then after looping all examples :

^lD_ij = 1/m * [ ^lΔ_ij + λ^lΘ_ij] if j!=0
^lD_ij = 1/m* ^lΔ_ij if j=0

^lD_ij is the partial derivative term of J(Θ) for ^lΘ_ij which output the gradient of ^lΘ_ij

My own interpretation (might be wrong 🙂 )

Seems like we only have Δ for layer 1, 2 … L-1.

Number of Δ matrixes = Number of Θ matrixes where the dimension of Δ^l is the same as Θ^l. Forward propagation to compute ^la_j for all l and j using some initial theta for all training examples. Backward computation of δ for all l and j too then accumulate δ. Given the sample picture, we should have 2 Δ which are Δ1 and Δ2 and have (m*Δ1) of Δ1 and (m*Δ2) of Δ2 .

The accumulation of Δ1 = Δ1^{1st example} + ……+ Δ1^{mth example}

Δ2 = Δ2^{1st example} + ……+ Δ2^{mth example}

Therefore, we need to divide by m to get the average.

Gradient Checking

Due to the subtle problems associated with backward propagation, we use gradient checking method to verify that our computed derivates (^lD_ij) are correct. The way we going are to do it to

1. first choose an epilson (ε) which is very small , say (10^-4)

2. For the particular theta(Θ) which backward propagation is computing, we choose 2 points on the x axis which are Θ + ε and Θ – ε. So the slope of these 2 points (J(Θ + ε) – J(Θ – ε)) / 2ε. –> change in y / change in x .

3. If the calculated gradient for ^lΘ_ij is approximately close to value of(J(Θ + ε) – J(Θ – ε)) / 2ε then we we know that backward propagation is implemented correctly.

4. Repeat 2 and 3 for all thetas.

Random Initialization

Problem: if we were to set all elements of our thetas matrices to 0 or the same value 1, this would make all activation units to have the same value and the derivatives of each unit to be the same as well and ultimately limited to actually one feature.

Example , given our sample neural network:

neural-network

We would have a 5 * (4 +1) theta1 matrix for layer 1. Suppose each element has the value of 1. This would mean
²a₁ = g(¹Θ₁₀X₀ + ¹Θ₁₁X₁ +¹Θ₁₂X₂ + ¹Θ₁₃X₃ + ¹Θ₁₄X₄ )

²a₂ = g(¹Θ₂₀X₀ + ¹Θ₂₁X₁ +¹Θ₂₂X₂ + ¹Θ₂₃X₃ + ¹Θ₂₄X₄ )

If all thetas above are the same then ²a₁ = ²a₂ = …… = ²a₅ , this also mean ²δ₁ = ²δ₂ = …²δ₅ . (Why ? See below)

²δ = (Θ²)^Tδ³ .* g'(z²)

g'(z²) = a² .* ( 1 – a²) —> if each element of a² is the same, then each element of g'(z²) vector will be the same. We know each element in (Θ²) is the same value, so (Θ²)^Tδ³ produce a vector of 5 identical values.(It is like [2;2;2;2;2] .* [3;3;3;3;3]).

This mean the gradient of ¹Θ₁₀ = ¹Θ₂₀ = …= ¹Θ₅₅, for all i and j of ¹Θ. After one iteration of gradient , all theta of ¹Θ will still be identical then we again have ²a₁ = ²a₂ = …… = ²a₅ after activation of a². Please note this does not mean that the value of a² after one iteration is the same value of a² before the update rule. This example just illustrate that the condition whereby the value of each element in a² is still the same , but they could be different from below the first iteration. ( Example. Before one iteration , a² = [2;2;2;2;2] , after one iteration and update ¹Θ , a² becomes [4;4;4;4;4]. )

Therefore we need to perform random initialization of thetas matrices by random choosing the values between [-ε, ε]

After-thoughts about above explanation after thinking it through for a few times

Above example is based on the assumption that we only have one classifier. Suppose we have multiple classifiers (say 4) at output layer, then not every element of δ³ will be the same. Consider the case y(i) = [1;0;0;0] and a3 = [0,0,0,0] , then δ³= [0;0;0;0] – [1;0;0;0] = [-1; 0;0;0] . Then I thought all these while we had been dealing with each element of activation nodes being the same , so what happen if now our δ³ is [-1; 0;0;0] ? Will my own interpretation still hold true ? This give me thinking deeper for a while. It turns out it still hold true even we have K classifiers (K >= 3) . ²a₁ is the sum of all the weighted Xs or input features or a¹ –> 4 green nodes combined to become ²a₁ , so is the same for other activation nodes in layer 2. Likewise ²δ₁ is the sum of the weighted “errors” in δ³. Imagine we have 4 red nodes at output layer and the vector representing H is [-1;0;0;0]. Since the weights/thetas are same value, then

²δ₁ = W1*H(1) + W2*H(2) + W3*H(3) + W4*H(4) * g'(²z₁)

²δ₅ = W1*H(1) + W2*H(2) + W3*H(3) + W4*H(4) * g'(²z₅)

which mean ²δ₁ = ²δ₂ = ²δ₃ = ²δ₄ = ²δ₅ because W1 = W2 = W3 = W4 and ²z₁= ²z₅

Therefore initial explanation for K = 1 should still hold for K >= 3

Random method

1. set intial_theta = rand(m * n) –> each element will be a value between 0 and 1 via rand function

2. inital_theta * 2 ε – ε

Therefore each element in intial_theta will between [-ε, ε]. (0 * 2 ε – ε will set the lower bound to ε , 1 * 2 ε – ε will set the upper bound to ε )

Putting it all together

Training a neural network

1. Select a network architecture ( basically choosing the number of hidden layers, 1 is the default , have some number of hidden units in every layer , usually the more the better)

2. Randomly initialize weights

3. Write code to compute cost function J(Θ)

4. For i in m examples

Perform forward propagation for xⁱ

Perform backward propagation to compute the deltas for each layer

Accumulate the deltas using Delta

5. Compute the partial derivative terms

6. Use gradient checking to check the partial derivatives using backprop against using numerical estimate of gradient of cost function.

6. Disable gradient checking code

7. Use gradient descent or advance optimization method with backprop to minimize J as a function of the thetas.