Cost Function
In neural networks where we have multiple Hypothesis at the output layer to address multiclassification problems, we will need to adjust the cost function to sum up all the cost of each Hypothesis. Also since now we have more than 1 layer of units and thus more than 1 set of theta , we too need to adjust the regularization term.
Previously we have
J(Θ) = -1/m * [ m∑i=1 yi log(HΘ(xi )) + (1 – yi ) log (1 – HΘ(xi )) ] + λ/2m * m∑j=1Θj^2
Now we have
J(Θ) = -1/m * [ m∑i=1K∑k=1iyk log(HΘ(xi ))k + (1 – iyk ) log (1 – HΘ(xi ))k ] +
λ/2m * L-1∑l=1Sl∑i=1Sl+1∑j=1lΘji^2
NOTE: Cost Function of Neural Networks are non-convex
Interpretation of cost function
K denotes the number of classes we have. Previously HΘ(X) output a real value, now since we have K of them, HΘ(X) is K dimensional vector. HΘ(X)i denote the ith output. So for every m in M examples, we have K (K >= 3) real values to sum instead of one, therefore we added K∑k=1 to the cost function.
L denotes the total number of layers (input + hidden + output). Sl denotes the number of neuron/units (excluding the bias units) for layer l. Example for our given neural below , we have S1 = 4 , S2 = 5 , S3 = 1, L = 3
As for the regularization term, since our thetas now are 2 dimensional matrixes for layer 1 and 2. Given the picture above , we have Theta1 associated with layer 1 which has a dimension of 5 * (4+1) and Theta2 is a 1 * (5+1) matrix then we add every theta^2 in the 2 matrixes. (Simply summing every i , j and l theta)
Number of rows in the theta matrix is determined by the number of units/neurons in Sl+1 layer and the current row number is denoted by j index.
Number of columns in the theta matrix is determined by the number of units/neurons in Sl layer and the current column number is denoted by i index.
Recall that if a layer L has j neurons and the next layer L + 1 has n neurons, then the theta matrix associated with layer L will have a dimension of n * (j + 1). Therefore the row index j is looping from j=1 to Sl+1.
Backpropagation Algorithm
It is a method used to compute the gradient of the cost function. The gradient is the output from the partial derivative term of the cost function and is used to feed into gradient descent or the optimised methods. The goal like for linear/logistics regression is to minimise the cost function J(theta).
To understand this method , we first need to understand forward propagation. It is actually the method used to “activate” the units/neurons in the hidden and output layers. Given the same neural network as in the picture above , we have L=3 , a1 , a2 and a3 where:
a1 = X + bias unit which will be a 4+1 dimensional vector
z2 = Θ1 * a1
a2 = g(z2) which will be a 5 dimensional vector
then a2 = a2 + bias unit ( a2 = [1 ; a2]) to be a 6 dimensional vector
z3 = Θ2 * a2
a3 = g(z3) = a real number
For backpropagation , we would introduce a new term δ (Delta) which denote the “error” of the unit j in layer l. And
Example: Given our neural network with L=3, we will start to compute from the output layer.
3δ1 = 3a1 – y1 which compute the error for node 1 layer 3. We can too vectorized this via δ3 = a3 – y.
Then:
δ2 = (Θ2)Tδ3 .* g'(z2)
Both z2 and (Θ2)Tδ3 is a 6 * 1 dimensional vector and we are applying an element wise .*
So to generalize , we have δl = (Θl)Tδl+1 .* g'(zl)
g'(z2) is the derivative term of the g(z2) which can be computed as below
g'(z2) = a2 .* ( 1 – a2)
Note: The 1 here refer to a vector of ones.
Also there is no error for layer 1 therefore δ1 is not valid because layer 1 is the input layer using actual data and contains no errors whereby the rest of the layers are computed from previous layers and might contain error. (This is what I think, might be wrong 🙂 )
We see that we start to compute the delta from the output layer then work backwards hence the name backward propagation.
After computing all δ , we would need to “accumulate” them. To do this we would introduce lΔij which a matrix set to all 0 initially. Δ is the capital of the greek letter δ.
Then apply the update rule which is
lΔij = lΔij + laj l+1δi
To summarize
Set lΔij = 0 (for all l, i and j)
For i = 1 to m examples
1. Set a1 = xi
2. Compute the forward activation of al for l = 2 … L
3. Compute the δl for l = L , L-1 … 2..
4. Δl = Δl + δl+1(al)T –> vectorized version of update rule
Then after looping all examples :
lDij = 1/m * [ lΔij + λlΘij ] if j!=0
lDij = 1/m* lΔij if j=0
lDij is the partial derivative term of J(Θ) for lΘij which output the gradient of lΘij
My own interpretation (might be wrong 🙂 )
Seems like we only have Δ for layer 1, 2 … L-1.
Number of Δ matrixes = Number of Θ matrixes where the dimension of Δl is the same as Θl. Forward propagation to compute laj for all l and j using some initial theta for all training examples. Backward computation of δ for all l and j too then accumulate δ. Given the sample picture, we should have 2 Δ which are Δ1 and Δ2 and have (m*Δ1) of Δ1 and (m*Δ2) of Δ2 .
The accumulation of Δ1 = Δ11st example + ……+ Δ1mth example
Δ2 = Δ21st example + ……+ Δ2mth example
Therefore, we need to divide by m to get the average.
Gradient Checking
Due to the subtle problems associated with backward propagation, we use gradient checking method to verify that our computed derivates (lDij) are correct. The way we going are to do it to
1. first choose an epilson (ε) which is very small , say (10^-4)
2. For the particular theta(Θ) which backward propagation is computing, we choose 2 points on the x axis which are Θ + ε and Θ – ε. So the slope of these 2 points (J(Θ + ε) – J(Θ – ε)) / 2ε. –> change in y / change in x .
3. If the calculated gradient for lΘij is approximately close to value of(J(Θ + ε) – J(Θ – ε)) / 2ε then we we know that backward propagation is implemented correctly.
4. Repeat 2 and 3 for all thetas.
Random Initialization
Problem: if we were to set all elements of our thetas matrices to 0 or the same value 1, this would make all activation units to have the same value and the derivatives of each unit to be the same as well and ultimately limited to actually one feature.
Example , given our sample neural network:
We would have a 5 * (4 +1) theta1 matrix for layer 1. Suppose each element has the value of 1. This would mean
2a1 = g(1Θ10X0 + 1Θ11X1 +1Θ12X2 + 1Θ13X3 + 1Θ14X4 )
2a2 = g(1Θ20X0 + 1Θ21X1 +1Θ22X2 + 1Θ23X3 + 1Θ24X4 )
If all thetas above are the same then 2a1 = 2a2 = …… = 2a5 , this also mean 2δ1 = 2δ2 = …2δ5 . (Why ? See below)
2δ = (Θ2)Tδ3 .* g'(z2)
g'(z2) = a2 .* ( 1 – a2) —> if each element of a2 is the same, then each element of g'(z2) vector will be the same. We know each element in (Θ2) is the same value, so (Θ2)Tδ3 produce a vector of 5 identical values.(It is like [2;2;2;2;2] .* [3;3;3;3;3]).
This mean the gradient of 1Θ10 = 1Θ20 = …= 1Θ55 , for all i and j of 1Θ. After one iteration of gradient , all theta of 1Θ will still be identical then we again have 2a1 = 2a2 = …… = 2a5 after activation of a2. Please note this does not mean that the value of a2 after one iteration is the same value of a2 before the update rule. This example just illustrate that the condition whereby the value of each element in a2 is still the same , but they could be different from below the first iteration. ( Example. Before one iteration , a2 = [2;2;2;2;2] , after one iteration and update 1Θ , a2 becomes [4;4;4;4;4]. )
Therefore we need to perform random initialization of thetas matrices by random choosing the values between [-ε, ε]
After-thoughts about above explanation after thinking it through for a few times
Above example is based on the assumption that we only have one classifier. Suppose we have multiple classifiers (say 4) at output layer, then not every element of δ3 will be the same. Consider the case y(i) = [1;0;0;0] and a3 = [0,0,0,0] , then δ3 = [0;0;0;0] – [1;0;0;0] = [-1; 0;0;0] . Then I thought all these while we had been dealing with each element of activation nodes being the same , so what happen if now our δ3 is [-1; 0;0;0] ? Will my own interpretation still hold true ? This give me thinking deeper for a while. It turns out it still hold true even we have K classifiers (K >= 3) . 2a1 is the sum of all the weighted Xs or input features or a1 –> 4 green nodes combined to become 2a1 , so is the same for other activation nodes in layer 2. Likewise 2δ1 is the sum of the weighted “errors” in δ3. Imagine we have 4 red nodes at output layer and the vector representing H is [-1;0;0;0]. Since the weights/thetas are same value, then
2δ1 = W1*H(1) + W2*H(2) + W3*H(3) + W4*H(4) * g'(2z1)
.
.
.
2δ5 = W1*H(1) + W2*H(2) + W3*H(3) + W4*H(4) * g'(2z5)
which mean 2δ1 = 2δ2 = 2δ3 = 2δ4 = 2δ5 because W1 = W2 = W3 = W4 and 2z1 = 2z5
Therefore initial explanation for K = 1 should still hold for K >= 3
Random method
1. set intial_theta = rand(m * n) –> each element will be a value between 0 and 1 via rand function
2. inital_theta * 2 ε – ε
Therefore each element in intial_theta will between [-ε, ε]. (0 * 2 ε – ε will set the lower bound to ε , 1 * 2 ε – ε will set the upper bound to ε )
Putting it all together
Training a neural network
1. Select a network architecture ( basically choosing the number of hidden layers, 1 is the default , have some number of hidden units in every layer , usually the more the better)
2. Randomly initialize weights
3. Write code to compute cost function J(Θ)
4. For i in m examples
Perform forward propagation for xi
Perform backward propagation to compute the deltas for each layer
Accumulate the deltas using Delta
5. Compute the partial derivative terms
6. Use gradient checking to check the partial derivatives using backprop against using numerical estimate of gradient of cost function.
6. Disable gradient checking code
7. Use gradient descent or advance optimization method with backprop to minimize J as a function of the thetas.