Chapter 6: Neural Networks (Representation)

General Idea

If our feature set is large, then it become very computing intensive to calculate the hypothesis. If we reduce our feature set , then we cannot learn non-linear hypothesis. Much of the machine learning problems involve large features set with the examples lies in different regions and a linear hypothesis will not work well . So we need to have large feature set and compute them efficiently by using neural networks in order to derive non-linear hypothesis.

Growth of the number of features

Suppose we have n = 100 features,  x1 …… x100. If we include all polynomial terms up to a max of 2 degree (X[i] * X[j]) where  100 > i > 1, 100 > j > 1, we will have roughly 5000 features. (This is like how many ways can i choose 2 from 101 )

If we include polynomial to 3 degree (X[a] * X[i] * X[j]), then we have 171700. (choosing 3 from 102)

The number of features really grows exponentially. Imagine computer vision problem, where each example of  our image in the training set is just a 50pixel * 50pixel image . This will equate to 2500 features (each pixel represent one feature ) . 2 degree polynomial hypothesis will result in ≈ 3.12 million.

Neuron Network Model 

neural-network

The green circles denotes the inputs features X1, X2, X3, X4

The blue and red circles can be treated like the neurons.

The blue circles more formally denotes the “activation” (a) of the green inputs, therefore we have 5 a. Looking from top down , we have 2a1, 2a2, 2a3, 2a4, 2a5.

Note: We should have X0 and 1a0 known as the “bias” unit. They output the value of 1 but we leave them out in picture.

The superscript digit denotes the layer number. So green circles is layer 1 , blue circles is the 2nd layer … The subscript denote the neuron or activation number in that layer.  So 2a denotes the activation of the fourth neuron in layer 2.

There red circle denotes the output of  HΘ(X).

2a1 = g(1Θ10X0 + 1Θ11X1  + 1Θ12X2  + 1Θ13X3  + 1Θ14X4 )

2a2 = g(1Θ20X0 + 1Θ21X1  + 1Θ22X2  + 1Θ23X3  + 1Θ24X4 )

2a3 = g(1Θ30X0 + 1Θ31X1  + 1Θ32X2  + 1Θ33X3  + 1Θ34X4 )

2a4 = g(1Θ40X0 + 1Θ41X1  + 1Θ42X2  + 1Θ43X3  + 1Θ44X4 )

2a5 = g(1Θ50X0 + 1Θ51X1  + 1Θ52X2  + 1Θ53X3  + 1Θ54X4 )

1Θ is a theta matrix. The superscript 1 denotes that this matrix is associated with layer 1 and by same logic 2Θ denotes the 2nd layer theta matrix.

If a layer L has j neurons  and the next layer L + 1 has n neurons, then the theta matrix associated with layer L will have a dimension of n * (j + 1).

So in order to compute the 5 a above , we need 5 * ( 4 + 1 ) = 5 * 5 theta matrix.

1Θ52 

The digit 5 denote the row and 2 denotes the 3rd (remember we start from 0 column) column . Superscript 1 represent the first layer . Altogether it represent the 5th row , 3rd column theta of the 1st theta matrix .

Lastly:

HΘ(X) = g(2Θ102a0 + 2Θ112a1 + 2Θ122a2 + 2Θ132a3 + 2Θ142a4 + 2Θ152a5  )

G = sigmoid function

Vectorized Implementation

a2 = g(Θ1 * X)  = z2 which is 5 dimensional vector but after adding the bias unit 2a0 , it becomes 6.

HΘ(X) = g(Θ2 * z2)

Multiclass Classification

Previously we have y  ∈ { 1 , 2 , 3 , 4 } suppose we have 4 classes.

Now we will represent y as a 4 dimensional vector where y is one of the following 4 vectors

[1;0;0;0] , [0;1;0;0] , [0;0;1;0] or [0;0;0;1]

Therefore we want HΘ(X) ≈ y

Leave a comment