Chapter 5: Regularization

The Problem of Overfitting

If we have too many features or high degree polynomial model for a linear/logistics regression problem, we might have a wiggly Hypothesis line which predict every example in the training set perfectly but fail to generalise new unseen examples.

J(Θ) =1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)²≈ 0

Good Fit vs Over Fit (Linear Regression)

J(Θ) = -1/m * [ ^m∑_i=1 yⁱ log(H_Θ(xⁱ )) + (1 – yⁱ ) log (1 – H_Θ(xⁱ ))] ≈ 0

Good Fit vs Over Fit (Logistics Regression)

We can address this by

1. Reducing the number of features (manually or using some methods)

2. Regularization

Idea of Regularization

We try to reduce the values of theta in order to produce a simpler Hypothesis which does not output a wiggly wavy line.

Suppose our H_Θ= Θ0X0 + Θ1X1 + Θ2X1^2 + Θ3X1^3 + Θ4X1^4

If we make Θ3 and Θ4 very small, that mean (Θ3X1^3 + Θ4X1^4) ≈ 0 which mean our H will become

H_Θ= Θ0X0 + Θ1X1 + Θ2X1^2 which would result in a simpler H.

Cost Function

J(Θ) =1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)² + λ^m∑_j=1Θ_j^2

λ^m∑_jΘ_j^2 denotes the regularization term

λ denotes the penalty which we need to pay for making thetas small.

Suppose λ is 1000 and if do not make thetas small then J(Θ) would shoot up. Therefore it helps keep thetas small and fit the training data.

Note that j start from 1 instead of 0 because we do not regularized theta 0.

This regularized cost function allows us to regularize all thetas since we do not know which features to remove. The objective is to make H_Θ(X) to be simpler and result in a smoother not wiggly line.

Why regularization helps H_Θ(X) to be simpler ?

I think:

Suppose our H_Θ= Θ0X0 + Θ1X1 + Θ2X1^2 + Θ3X1^3 + Θ4X1^4

If all thetas are smaller it means H_Θ(X) will change slower and smaller as X1 changes to prevent sharp bends in the line. This behavior should result in a smoother not so wavy line.

Regularized Linear Regression

Since we add the regularization term to our cost function , the partial derivative term of the gradient descent update rule need to be modified to reflect the new regularization term.

Θ_j = Θ_j – alpha * [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * _jXⁱ + λ/m*Θ_j] –> for j = 1 to n

Note : we do not regularize theta 0 therefore j start from 1.

Above is equivalent to

Θ_j = Θ_j(1 – alpha*λ/m) – alpha * 1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * _jXⁱ

1 – alpha*λ/m < 1 therefore Θ_j shrink towards 0.

alpha * [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * _jXⁱ + λ/m*Θ_j] remains the same.

Regularized Logistics Regression

Cost function of logistics with the regularized term become

J(Θ) = -1/m * [ ^m∑_i=1 yⁱ log(H_Θ(xⁱ )) + (1 – yⁱ ) log (1 – H_Θ(xⁱ )) ] + λ/2m * ^m∑_j=1Θ_j^2

Gradient descent update rule becomes

Θ_j = Θ_j – alpha * [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * _jXⁱ + λ/m*Θ_j] –> for j = 1 to n

Note: H_Θ(Xⁱ) here denotes the sigmoid function.

	jin long lee on CS7641
	Nandhini on CS7641
	xiangnan on Exercise 4
	jin long lee on Exercise 2 Part 1
	brin on Exercise 2 Part 1

ML Study Notes

Chapter 5: Regularization

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply