Chapter 8: Deciding What to Try Next

Overview

Suppose our H make large errors in its predictions. What can we do ? We can try to

1. get more training examples

2. try smaller sets of features

3. try adding polynomial features

4. decreasing/increasing lambda

But how are we supposed to know which method should we use in order to make our time spent on that particular method is worthwhile rather than say, spending 6 months to collect more training examples to find out it does not find 6 months later ?

We need to learn how to evaluate a Hypothesis and implement Machine Learning Diagnostic

Diagnostic: A test to run to gain insight what is/is not work with a learning algorithm and gain insight on how to improve its performance.

Evaluating Hypothesis

We know if our Hypothesis is overfitting the training set , it will fail to generalize on examples. We can plot the Hypothesis a function of the input features. However if we have a lot of features, say 100, it is hard to hard to plot and see what is going on.

A simpler method is to divide our training dataset into 2 portions which are 70% training set and 30% test set. It is best to randomize and not to do any logical ordering to the whole training dataset so that we can just take the first 70 percent as training set and the rest as test set. By doing so , we are not trying to hard to fit the thetas to the whole training set which will fail to generalize on new examples. Suppose our model is a linear model of 1 degree polynomial, then we minimise J_train(θ) then used those thetas learned to predict the test set examples and calculate the average of the squared errors.

Procedure for linear regression:

1. Learning parameter θ by minimising the cost function using the 70% training set

2. Use the theta learned from step 1 on the 30% test set where:
J_test(θ) = 1/m_test * ^mtest∑_i=1(H_θ(x_test) – y_test)²

Procedure for logistics regression:

1. Learning parameter theta by minimising the cost function using the 70% training set

2. Use the theta learned from step 1 on the 30% test set where:
J_test(θ) = -1/m_test * [ ^mtest∑_i=1ⁱy_test* log (H_θ(x_test) ) + (1 – ⁱy_test) * log (H_θ(x_test) ) ] –> same cost function

3. Alternative to the cost function above is using the misclassification error definition (0/1 misclassification where:

Test error = 1/m_test * ^mtest∑_i=1err(H_θ(ⁱx_test) , yⁱ)

err(H_θ(x_test) , y) = 1 if h(x) >= 0.5 , y = 0 or h(x) < 0.5 , y = 1

err(H_θ(x_test) , y) = 0 if above is false

Therefore it just means if +1 for every classification error else +0 to test error.

Model Selection and training /validation/test sets

How should we go about choosing a polynomial model ? What degree (d) of polynomial (linear or cubic ?) should we use ?

We have a list of polynomial model ranging from degree of 1 to say 10 ,

1. we use gradient descent or some optimization methods on each model (d in 1:10) to learn the thetas which will minimise the cost error function J on the training set . Thus we will have Θ¹, Θ², Θ³, …. Θ¹⁰, where Θ^d, is the thetas which best fit the training set data using the d^th degree polynomial model.

2. After that logically we choose the i^th degree polynomial which has the least cost error compare to other degrees on the test set where:

chosen_degree(d) = min( J_test(Θ^d=1) , J_test(Θ^d=2),……, J_test(Θ^d=10) )

Let say d chosen is 6.

3. Then we ask how well does the 6th degree model generalize ? Well we can check the cost error of J_test(Θ⁶) . However here is where the problem lies. Because we had been chosen d to which fit the test set data , it will be highly possible that J_test(Θ⁶) is an overly optimistic estimate of the generalization error . The problem can be understood by asking ourselves :

1. Why did we choose the 6 degree model ?

Well because it provide the best performance based on the test set.

2. So does that mean that if we change the test set to another test set which contain different examples, we will have a different chosen degree ?

Well I think it could be highly possible because we sort had chosen d = 6 because these test set examples seem to work best by chance or fate when d = 6 , therefore the 6th degree model might not work well on new examples not seen before and this is what we cared about most. This logic is the same as why we split the data set into training and test sets under “Evaluating Hypothesis”. The difference here is that we go one step higher by choosing what degree model should we choose where we already determine the model under “Evaluating Hypothesis” section.

3. In order to solve the overly optimistic generalization problem , instead of splitting the training data into 2 sets , we split onto 3 sets which are train set , cross validation (cv) set and test set. Thus we have J_train(Θ), J_cv(Θ) and J_test(Θ). Then use step 1 to derive Θ¹ , Θ², …… Θ¹⁰using J_train(Θ). Next we want to choose which d to use , so this time we run J_cv(Θ) on the CV set instead to choose d where :

chosen_degree(d) = min( J_cv(Θ^d=1) , J_cv(Θ^d=2),……, J_cv(Θ^d=10) ) –> example d = 5

4. Next we want to ask back the same question of how well does the 5th degree model generalize ? Now we can run J_test(Θ⁵) on the test set and it will be a fairer estimate since the chosen d is fitted to the CV set test but not the test set.

Diagnosing Bias vs Variance

How do we determine our Hypothesis is suffering from a high bias , underfitting problem or a high variance overfitting problem ?

Bias(underfit) :

J_train will be high

J_cv will be high

J_train ≈ J_cv

Variance (overfit) :

J_train will be low

J_cv will be high

J_cv>> J_train

>> : much much greater

Regularization and Bias/Variance

How do we choose regularization parameter lambda ?

Suppose our Hypothesis is a 3 degree polynomial degree and cost function to be as follows:

J(Θ) =1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)² + λ^m∑_j=1Θ_j^2

J_train(Θ) = 1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)²

J_cv(Θ) = 1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)²

J_test(Θ) = 1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)²

And we have a range of lambda to try

1. lambda = 0

2. lambda = 0.01

3. lambda = 0.02

12. lambda = 10.24

1. For each choice of lambda , minimise J(Θ) and thus we have 12 set of thetas Θ¹ , Θ², … Θ¹²

2. For each choice of thetas, apply J_cv(Θ) and thus we have 12 cost errors and pick the thetas , say Θ²which produces the lowest cost error.

3. Apply the chosen theta 2 vector to J_test(Θ) on the test set to see how well the choice of theta generalized. (Theta 2 was derived by using usual average of cost error + the regularization term lambda 0.01 in our example)

When lambda is low:

J train will be low (overfitting/high variance) -> J Cv will be high (underfitting/high bias)

When lambda is high

J train will be high (underfitting/high bias) -> J Cv will be high (underfitting/high bias)

Learning Curves

A plot of the error as a function of the number training examples. From the lines of J_train(Θ) vs J_cv(Θ) , we could tell if our hypothesis model is suffering from a high bias or high variance or mixture of both.

Insights

If a learning algorithm is suffering from high bias, getting more training data will not help much.

If a learning algorithm is suffering from high variance, getting more training data will likely help.

Deciding what to do revisited

Below are the methods discussed at the start to improve the performance of the Hypothesis and what problem it is fixing

1. Get more training example -> fixed high variance

2. Try smaller sets of features -> fixed high variance

3. Try getting additional features -> fixed high bias

4. Try adding polynomial features -> fixed high bias

5. Try decreasing lambda -> fixed high bias

6. Try increasing lambda -> fixed high variance

Neural Networks Diagnosis

“Small” neural networks with fewer parameters which means few hidden layers and few neurons per hidden layer will be prone to underfitting problem. Smaller neural networks is like a more simple hypothesis model and prone to underfitting. Benefit is less costly for computation.

“Bigger” neural networks is proned to overfitting , however since more hidden layers and more neurons per hidden layer is generally going to improve the performance of the learned hypothesis, we can use regularization to address the overfitting problem. More computational intensive for bigger network.

So how to choose the number of hidden layers ?

We can

1. split the training set into the train , cv and test sets

2. test the various types of neural networks (1 or 2 or 3 hidden layers) using the train set

3. see which neural network type perform the best on the cv set