Chapter 8: Deciding What to Try Next

Overview

Suppose our H make large errors in its predictions. What can we do ? We can try to

1. get more training examples

2. try smaller sets of features

3. try adding polynomial features

4. decreasing/increasing lambda

.

But how are we supposed to know which method should we use in order to make our time spent on that particular method is worthwhile rather than say, spending 6 months to collect more training examples to find out it does not find 6 months later ?

We need to learn how to evaluate a Hypothesis and implement Machine Learning Diagnostic

Diagnostic: A test to run to gain insight what is/is not work with a learning algorithm and gain insight on how to improve its performance.

 

Evaluating Hypothesis

We know if our Hypothesis is overfitting the training set , it will fail to generalize on examples. We can plot the Hypothesis a function of the input features. However if we have a lot of features, say 100, it is hard to hard to plot and see what is going on.

A simpler method is to divide our training dataset into 2 portions which are  70% training set and 30% test set. It is best to randomize and not to do any logical ordering to the whole training dataset so that we can just take the first 70 percent as training set and the rest as test set. By doing so , we are not trying to hard to fit the thetas to the whole training set which will fail to generalize on new examples. Suppose our model is a linear model of 1 degree polynomial,  then we minimise Jtrain(θ) then used those thetas learned to predict the test set examples and calculate the average of the squared errors.

Procedure for linear regression:

1. Learning parameter θ  by minimising the cost function using the 70% training set

2. Use the theta learned from step 1 on the 30% test set where:
Jtest(θ) = 1/mtest * mtesti=1(Hθ(xtest) – ytest)

 

Procedure for logistics regression:

1. Learning parameter theta by minimising the cost function using the 70% training set

2. Use the theta learned from step 1 on the 30% test set where:
Jtest(θ) = -1/mtest * [ mtesti=1 iytest * log (Hθ(xtest) ) + (1 – iytest ) * log (Hθ(xtest) ) ] –> same cost function

3. Alternative to the cost function above is using the misclassification error definition (0/1 misclassification where:

Test error = 1/mtestmtesti=1 err(Hθ(ixtest) , yi)

err(Hθ(xtest) , y) =  1 if  h(x) >= 0.5 , y = 0 or h(x) < 0.5 , y = 1

err(Hθ(xtest) , y) =  0  if  above is false

Therefore it just means if +1 for every classification error else +0 to test error.

 

Model Selection and training /validation/test sets

How should we go about choosing a polynomial model ? What degree (d) of polynomial (linear or cubic ?) should we use ?

We have a list of polynomial model ranging from degree of 1 to say 10 ,

1. we use gradient descent or some optimization methods on each model (d in 1:10) to learn the thetas which will minimise the cost error function J on the training set . Thus we will have Θ1, Θ2, Θ3, …. Θ10,  where Θd, is the thetas which best fit the training set data using the dth degree polynomial model.

2. After that logically we choose the ith degree polynomial which has the least cost error compare to other degrees on the test set where:

chosen_degree(d) = min(  Jtestd=1) , Jtestd=2),……, Jtestd=10) )

Let say d chosen is 6.

3. Then we ask how well does the 6th degree model generalize ?  Well we can check the cost error of Jtest6) . However here is where the problem lies. Because we had been chosen d to which fit the test set data , it will be highly possible that Jtest6) is an overly optimistic estimate of the generalization error . The problem can be understood by asking ourselves :

1. Why did we choose the 6 degree model ?

Well because it provide the best performance based on the test set.

2. So does that mean that if we change the test set to another test set which contain different examples, we will have a different chosen degree ?

Well I think it could be highly possible because we sort had chosen d = 6 because these test set examples seem to work best by chance or fate when d = 6 , therefore the 6th degree model might not work well on new examples not seen before and this is what we cared about most. This logic is the same as why we split the data set into training and test sets under “Evaluating Hypothesis”. The difference here is that we go one step higher by choosing what degree model should we choose where we already determine the model under “Evaluating Hypothesis” section.

3. In order to solve the overly optimistic generalization problem , instead of splitting the training data into 2 sets , we split onto 3 sets which are train set , cross validation (cv) set and test set. Thus we have Jtrain(Θ),  Jcv(Θ) and Jtest(Θ).  Then use step 1 to derive Θ1 , Θ2, …… Θ10 using Jtrain(Θ). Next we want to choose which d to use , so this time we run Jcv(Θ) on the CV set instead to choose  d where :

chosen_degree(d) = min(  Jcvd=1) , Jcvd=2),……, Jcvd=10) ) –> example d = 5

4. Next we want to ask back the same question of how well does the 5th degree model generalize ? Now we can run Jtest5) on the test set and it will be a fairer estimate since the chosen d is fitted to the CV set test but not the test set.

Diagnosing Bias vs Variance 

How do we determine our Hypothesis is suffering from a high bias , underfitting problem or a high variance overfitting problem ?

Bias(underfit) :

Jtrain will be high

Jcv will be high

Jtrain ≈ Jcv

Variance (overfit) :

Jtrain will be low

Jcv will be high

Jcv >> Jtrain

>> : much much greater

Regularization and Bias/Variance 

How do we choose regularization parameter lambda ?

Suppose our Hypothesis is a 3 degree polynomial degree and cost function to be as follows:

J(Θ) =1/2m *  mi=1(HΘ(Xi) – yi)2 + λmj=1Θj^2

Jtrain(Θ) = 1/2m *  mi=1(HΘ(Xi) – yi)2

Jcv(Θ) = 1/2m *  mi=1(HΘ(Xi) – yi)2

Jtest(Θ) = 1/2m *  mi=1(HΘ(Xi) – yi)2

And we have a range of lambda to try

1. lambda = 0

2. lambda = 0.01

3. lambda = 0.02

.

.

.

.

12. lambda = 10.24

1. For each choice of lambda , minimise J(Θ) and thus we have 12 set of thetas Θ1 , Θ2, … Θ12

2. For each choice of thetas, apply Jcv(Θ) and thus we have 12 cost errors and pick the thetas , say Θ2 which produces the lowest cost error.

3. Apply the chosen theta 2 vector to Jtest(Θ) on the test set to see how well the choice of theta generalized. (Theta 2 was derived by using usual average of cost error + the regularization term lambda 0.01 in our example)

When lambda is low:

J train will be low (overfitting/high variance) -> J Cv will be high (underfitting/high bias)

When lambda is high

J train will be high (underfitting/high bias)  ->  J Cv will be high (underfitting/high bias)

Features-and-polynom-degree-fix

 

Learning Curves

A plot of the error as a function of the number training examples. From the  lines of Jtrain(Θ) vs Jcv(Θ) , we could tell if our hypothesis model is suffering from a high bias or high variance or mixture of both.

Insights

If a learning algorithm is suffering from high bias, getting more training data will not help much.

If a learning algorithm is suffering from high variance, getting more training data will likely help.

High-variance-high-bias

 

 

Deciding what to do revisited

Below are the methods discussed at the start to improve the performance of the Hypothesis and what problem it is fixing

1. Get more training example -> fixed high variance

2. Try smaller sets of features -> fixed high variance

3. Try getting additional features -> fixed high bias

4. Try adding polynomial features -> fixed high bias

5. Try decreasing lambda -> fixed high bias

6. Try increasing lambda -> fixed high variance

Neural Networks Diagnosis

“Small” neural networks with fewer parameters which means few hidden layers and few neurons per hidden layer will be prone to underfitting problem. Smaller neural networks is like a more simple hypothesis model and prone to underfitting.  Benefit is less costly for computation.

“Bigger” neural networks is proned to overfitting , however since more hidden layers and more neurons per hidden layer is generally going to improve the performance of the learned hypothesis, we can use regularization to address the overfitting problem. More computational intensive for bigger network.

So how to choose the number of hidden layers ?

We can

1. split the training set into the train , cv and test sets

2. test the various types of neural networks (1 or 2 or 3 hidden layers) using the train set

3. see which neural network type perform the best on the cv set

 

Leave a comment