Chapter 5: Regularization

The Problem of Overfitting

If we have too many features or high degree polynomial model for a linear/logistics regression problem, we might have a wiggly Hypothesis line which predict every example in the training set perfectly but fail to generalise new unseen examples.

J(Θ) =1/2m *  mi=1(HΘ(Xi) – yi)2≈ 0

Good Fit vs Over Fit (Linear Regression)

J(Θ) = -1/m * [ mi=1 yi log(HΘ(xi )) + (1 – yi ) log (1 – HΘ(xi ))] ≈ 0

Good Fit vs Over Fit (Logistics Regression)

We can address this by

1. Reducing the number of features (manually or using some methods)

2. Regularization

Idea of Regularization

We try to reduce the values of theta in order to produce a simpler Hypothesis which does not output a wiggly wavy line.

Suppose our HΘ= Θ0X0 + Θ1X1 + Θ2X1^2 + Θ3X1^3 + Θ4X1^4

If we make Θ3 and Θ4 very small, that mean (Θ3X1^3 + Θ4X1^4) ≈ 0 which mean our H will become

HΘ= Θ0X0 + Θ1X1 + Θ2X1^2 which would result in a simpler H.

Cost Function

J(Θ) =1/2m *  mi=1(HΘ(Xi) – yi)2 + λmj=1Θj^2

λmjΘj^2 denotes the regularization term

λ denotes the penalty  which we need to pay for making thetas small.

Suppose λ is 1000 and if do not make thetas small then J(Θ) would shoot up. Therefore it helps keep thetas small and fit the training data.

Note that j start from 1 instead of 0 because we do not regularized theta 0.

This regularized cost function allows us to regularize all thetas since we do not know which features to remove. The objective is to make HΘ(X) to be simpler and result in a smoother not wiggly line.

Why regularization helps HΘ(X) to be simpler ?

I think:

Suppose our HΘ= Θ0X0 + Θ1X1 + Θ2X1^2 + Θ3X1^3 + Θ4X1^4

If all thetas are smaller it means HΘ(X) will change slower and smaller  as X1 changes to prevent sharp bends in the line. This behavior should result in a smoother not so wavy line.

Regularized Linear Regression

Since we add the regularization term to our cost function , the partial derivative term of the gradient descent update rule need to be modified to reflect the new regularization term.

Θj = Θj – alpha * [1/m * mi=1(HΘ(Xi) – yi) * jXi + λ/m*Θj]  –> for j = 1 to n

Note : we do not regularize theta 0 therefore j start from 1.

Above is equivalent to

Θj = Θj(1 – alpha*λ/m) – alpha * 1/m * mi=1(HΘ(Xi) – yi) * jXi

1 – alpha*λ/m < 1 therefore Θj shrink towards 0.

alpha * [1/m * mi=1(HΘ(Xi) – yi) * jXi + λ/m*Θj] remains the same.

Regularized Logistics Regression

Cost function of logistics with the regularized term become

J(Θ) = -1/m * [ mi=1 yi log(HΘ(xi )) + (1 – yi ) log (1 – HΘ(xi )) ] + λ/2m *  mj=1Θj^2

Gradient descent update rule becomes

Θj = Θj – alpha * [1/m * mi=1(HΘ(Xi) – yi) * jXi + λ/m*Θj]  –> for j = 1 to n

Note: HΘ(Xi) here denotes the sigmoid function.

Leave a comment