Chapter 7: Neural Networks (Learning)

Cost Function

In neural networks where we have multiple Hypothesis at the output layer to address  multiclassification problems, we will need to adjust the cost function to sum up all the cost of each Hypothesis. Also since now we have more than 1 layer of units and thus more than 1 set of theta , we too need to adjust the regularization term.

Previously we have

J(Θ) = -1/m * [ mi=1 yi log(HΘ(xi )) + (1 – yi ) log (1 – HΘ(xi )) ] + λ/2m *  mj=1Θj^2

Now we have

J(Θ) = -1/m * [ mi=1Kk=1iyk log(HΘ(xi ))k + (1 – iyk ) log (1 – HΘ(xi ))k ]  +

λ/2m *  L-1l=1Sli=1Sl+1j=1lΘji^2

NOTE: Cost Function of Neural Networks are non-convex 

Interpretation of cost function

K denotes the number of classes we have.  Previously HΘ(X)  output a real value, now since we have K of them, HΘ(X) is K dimensional vector. HΘ(X)i denote the ith output. So for every m in M examples, we have K (K >= 3)  real values to sum instead of one, therefore we added Kk=1 to the cost function.

L denotes the total number of layers (input + hidden + output). Sl denotes the number of neuron/units (excluding the bias units) for layer l. Example for our given neural below , we have  S1 = 4 , S2 = 5 , S3 = 1, L = 3

neural-network

As for the regularization term, since our thetas now are 2 dimensional matrixes for layer 1 and 2. Given the picture above , we have Theta1 associated with layer 1 which has a dimension of 5 * (4+1) and Theta2 is a 1 * (5+1) matrix then we add every theta^2 in the 2 matrixes. (Simply summing every i , j and l theta)

Number of rows in the theta matrix is determined by the number of units/neurons in Sl+1 layer and the current row number is denoted by j index.

Number of columns in the theta matrix is determined by the number of units/neurons in Sl layer and the current column number is denoted by i index.

Recall that if a layer L has j neurons  and the next layer L + 1 has n neurons, then the theta matrix associated with layer L will have a dimension of n * (j + 1). Therefore the row index j is looping from j=1 to Sl+1.

Backpropagation Algorithm 

It is a method used to compute the gradient of the cost function.  The gradient is the output from the partial derivative term of the cost function and is used to feed into gradient descent or the optimised methods. The goal like for linear/logistics regression is to minimise the cost function J(theta).

To understand this method , we first need to understand forward propagation.  It is actually the method used to “activate” the units/neurons in the hidden and output layers. Given the same neural network as in the picture above , we have L=3 , a1 , a2 and a3 where:

a1 = X + bias unit which will be a 4+1 dimensional vector

z2 = Θ1 * a1

a2 = g(z2) which will be a 5 dimensional vector

then a2 = a2 + bias unit ( a2 = [1 ; a2]) to be a 6 dimensional vector

z3 = Θ2 * a2

a3 = g(z3) = a real number

For backpropagation , we would introduce a new term δ (Delta) which denote the “error” of the unit j in layer l. And

Example: Given our neural network with L=3, we will start to compute from the output layer.

3δ1 = 3a1 – y1 which compute the error for node 1 layer 3.  We can too vectorized this via  δ= a3 – y.

Then:

δ2 = (Θ2)Tδ3 .* g'(z2)

Both zand (Θ2)Tδ3 is a 6 * 1 dimensional vector and we are applying an element wise .* 

So to generalize , we have δl = (Θl)Tδl+1 .* g'(zl)

g'(z2)  is the derivative term of the g(z2) which can be computed as below

g'(z2)  = a2 .* ( 1 – a2)

Note: The 1 here refer to a vector of ones.

Also there is no error for layer 1 therefore δ1 is not valid because layer 1 is the input layer using actual data and contains no errors whereby the rest of the layers are computed from previous layers and might contain error. (This is what I think, might be wrong 🙂 )

We see that we start to compute the delta from the output layer then work backwards hence the name backward propagation.

After computing all δ , we would need to “accumulate” them. To do this we would introduce lΔij which a matrix set to all 0 initially. Δ is the capital of the greek letter δ.

Then apply the update rule which is

lΔij = lΔij + lal+1δi

To summarize

Set lΔij = 0 (for all l, i and j)

For i = 1 to m examples

1. Set a1 = xi
2. Compute the forward activation of al for l = 2 … L
3. Compute the δl for l = L , L-1 … 2..
4. Δl = Δl + δl+1(al)T –> vectorized version of update rule

Then after looping all examples :

lDij = 1/m * [ lΔij + λlΘij ] if j!=0
lDij = 1/m* lΔij  if j=0

lDij is the partial derivative term of J(Θ) for lΘij which output the gradient of lΘij

My own interpretation (might be wrong 🙂 )

Seems like we only have Δ for layer 1, 2 … L-1.

Number of Δ matrixes  = Number of Θ matrixes where the dimension of Δl is the same as Θl. Forward propagation to compute laj for all l and j using some initial theta for all training examples. Backward computation  of δ for all l and j too then accumulate δ.  Given the sample picture, we should have 2 Δ which are Δ1 and Δ2 and have (m*Δ1)  of Δ1 and (m*Δ2) of Δ2 .

The accumulation of Δ1 = Δ11st example + ……+ Δ1mth example

Δ2 = Δ21st example + ……+ Δ2mth example

Therefore, we need to divide by m to get the average.

Gradient Checking

Due to the subtle problems associated with backward propagation, we use gradient checking method to verify that our computed derivates (lDij) are correct. The way we going are to do it to

1.  first choose an epilson (ε) which is very small , say (10^-4)

2. For the particular theta(Θ) which backward propagation is computing, we choose 2 points on the x axis which are Θ + ε and Θ – ε.  So the slope of these 2 points (J(Θ + ε) – J(Θ – ε)) / 2ε.  –> change in y / change in x .

3.  If the calculated gradient for  lΘij is approximately close to value of(J(Θ + ε) – J(Θ – ε)) / 2ε then we we know that backward propagation is implemented correctly.

4.  Repeat 2 and 3 for all thetas.

Random Initialization  

Problem: if we were to set all elements of our thetas matrices to 0 or the same value 1, this would make all activation units to have the same value and the derivatives of each unit to be the same as well and ultimately limited to actually one feature.

Example , given our sample neural network:

neural-network

We would have a 5 * (4 +1)  theta1 matrix for layer 1. Suppose each element has the value of 1.  This would mean
2a1 = g(1Θ10X0 + 1Θ11X1 +1Θ12X2 + 1Θ13X3 + 1Θ14X4 )

2a2 = g(1Θ20X0 + 1Θ21X1 +1Θ22X2 + 1Θ23X3 + 1Θ24X4 )

If all thetas above are the same then 2a12a2 = …… = 2a5 , this also mean  2δ12δ2 = …2δ .   (Why ? See below)

2δ = (Θ2)Tδ3 .* g'(z2)

g'(z2)  = a2 .* ( 1 – a2) —> if each element of a2 is the same, then each element of g'(z2)  vector will be the same.   We know each element in (Θ2) is the same value, so (Θ2)Tδ3 produce a vector of 5 identical values.(It is like [2;2;2;2;2] .* [3;3;3;3;3]).

This mean the  gradient of  1Θ10 =  1Θ20  = …= 1Θ55 , for all i and j of 1Θ. After one iteration of gradient , all theta of 1Θ will still be identical then we again have  2a12a2 = …… = 2a after activation of a2. Please note this does not mean that the value of a2 after one iteration is the same value of a2 before the update rule. This example just illustrate that the condition whereby the value of each element in a2 is  still the same , but they could be different from below the first iteration. ( Example. Before one iteration , a2 = [2;2;2;2;2] , after one iteration and update 1Θ , a2 becomes [4;4;4;4;4]. )

Therefore we need to perform random initialization of thetas matrices by random choosing the values between [-ε, ε]

After-thoughts about above explanation after thinking it through for a few times

Above example is based on the assumption that we only have one classifier. Suppose we have multiple classifiers (say 4) at output layer, then not every element of δ3 will be the same. Consider the case y(i) = [1;0;0;0] and a3 = [0,0,0,0] , then δ3 = [0;0;0;0] – [1;0;0;0] = [-1; 0;0;0] . Then I thought all these while we had been dealing with each element of activation nodes being the same , so what happen if now our δ3 is [-1; 0;0;0] ? Will my own interpretation still hold true ?  This give me thinking deeper for a while. It turns out it still hold true even we have K classifiers (K >= 3) . 2a1 is the sum of all the weighted Xs or input features or a1 –> 4 green nodes combined to become 2a1 , so is the same for other activation nodes in layer 2. Likewise 2δ1 is the sum of the weighted “errors” in δ3. Imagine we have 4 red nodes at output layer and the vector representing H is [-1;0;0;0]. Since the weights/thetas are same value, then

2δ1 = W1*H(1) + W2*H(2) +  W3*H(3) + W4*H(4)  * g'(2z1)

.

.

.

2δ5 = W1*H(1) + W2*H(2) +  W3*H(3) + W4*H(4) * g'(2z5)

which mean 2δ1 = 2δ2 = 2δ3 = 2δ4 = 2δ5   because W1 = W2 = W3 = W4  and  2z2z

Therefore initial explanation for K = 1  should still hold for K >= 3

Random method

1. set intial_theta = rand(m * n) –> each element will be a value between 0 and 1 via rand function

2. inital_theta * 2 ε –  ε

Therefore each element in intial_theta will between [-ε, ε].  (0 * 2 ε –  ε will set the lower bound to  ε , 1 * 2 ε –  ε will set the upper bound to  ε )

Putting it all together

Training a neural network

1.  Select a network architecture ( basically choosing the number of hidden layers, 1 is the default , have some number of hidden units in every layer , usually the more the better)

2. Randomly initialize weights

3. Write code to compute cost function J(Θ)

4.  For i in m examples

Perform forward propagation for xi

Perform backward propagation to compute the deltas for each layer

Accumulate the deltas using Delta

5. Compute the partial derivative terms

6. Use gradient checking to check the partial derivatives using backprop against using numerical estimate of gradient of cost function.

6. Disable gradient checking code

7. Use gradient descent or advance optimization method with backprop to minimize J as a function of the thetas.