Chapter 2: Linear Regression With One Variable

Model Representation

Training Set —feed into —> Learning Algorithm — produces —> Hypothesis H_Θ

H_Θ(X) is a mapping function which take an input value X and produces an output value which is the prediction.

X in this case (one variable scenario) represent a real number.

Θ (Theta) represents the parameters or weights of the Hypothesis function.
Θ in this case (one variable scenario) represent a 2 dimensional vector [Θ₀; Θ₁]
Θ₀ is like the constant value in H in this case.

Therefore H_Θ(X) = Θ₀ + Θ₁X₁ = real number

H_Θ(X) = Θ₀ + Θ₁X₁ will produce a straight line (linear function) fit to our training data set.

What we want ideally ?

We want a H_Θ(X) which give us the best possible straight line fit to our data. It is the job of the learning algorithm to give us this function H.

How is the learning algorithm going do that ?

The training examples of past data are fixed. What can changed are the values of Θ. Therefore given a particular choice of Θ, we will end up with a different Hypothesis function definition. We first need a way to determine if our choice of Θ is optimal or not and the way to do that is via the cost function (J).

Cost Function

J(Θ₀,Θ₁) =1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)²

m denotes number of rows in training set
i is the index into training set , denotes which row (like array index)
Y is the correct or right answer for the particular row
H_Θ(Xⁱ) will compute the predicted value determined by chosen theta Θ

Interpretation of Cost Function J(Θ₀,Θ₁)

H_Θ(Xⁱ) – yⁱ will give us the difference between the correct value yⁱ and predicted value H_Θ(Xⁱ) for one particular row(i) (for i =0 to m) of the training set.

(H_Θ(Xⁱ) – yⁱ)² squared the difference which produces a real number >= 0

The summation ∑ will sum up all the squared differences or errors from row 1 to m and let say the value is S.

1/m of the S will return the average of the squared differences out of a total of m differences (1/m * S) and let say the average is V. This value should be the Variance.

1/2 of V will reduce it by half

Therefore the cost function is computing half of the variance.

Idea behind cost function J(Θ₀,Θ₁)

If the cost function output 0 for a particular choice of Θ₀,Θ₁ , then we found the perfect hypothesis because the difference of H_Θ(Xⁱ) – yⁱis 0 for i=1 to m which means we correctly predicted all output values for all input values.

If the cost function of a particular choice (A) of theta (example Θ₀= 10 and Θ₁=2 ) output say 100000 and another particular choice(B) of theta (example Θ₀= 6 and Θ₁=3 ) output say 10. Then we can say that choice B is better than A because the cost is lower.

Therefore we want to minimize the cost function J(Θ₀,Θ₁), we want to find the values of Θ₀,Θ₁ which can produce the smallest possible cost. With that optimal choice of theta , the learning algorithm will produce the H function which will give us the best possible straight line fit to the data.

Convex Function

Suppose we had only one Θ₁ , if we plot J( Θ₁) as a function of Θ₁which takes on values from negative to positive values. J( Θ₁) will turn out to be a U shaped function. So the best choice of Θ₁which will minimize the cost function will be the value which out the lowest point of the U shape. Luckily the cost function J for regression problems will produce a U and there is only one global optimal. (If we had 2 thetas then the graph will be like a 3d bowl like shape where the global optimal is the bottom of the bowl, contour graphs will produce ellipse)

Gradient Descent

For the different possible choices of Θ₀,Θ₁ , we end up with a different cost produced by cost function(J) and different H_Θ(X) function which produces different straight lines. The shape of J(Θ₀,Θ₁) is like a bowl shape and our goal is to minimize J(Θ₀,Θ₁). We can use gradient descent to help us arrive at the bottom of the “bowl” which reveal the optimal values of Θ₀,Θ_1.

Outline of gradient descent method

Start with some Θ₀,Θ_1.
Keep changing Θ₀,Θ_1.to reduce J(Θ₀,Θ₁) till we converge at the bottom of the “bowl”

Initialize Θ₀,Θ₁ to some value then

repeat until convergence {

Θ_j := Θ_j – ∝[∂/∂Θ_j J(Θ₀,Θ₁)] (for j=0 and j=1) –> simultaneous update for all theta

}

∝ denotes the learning rate (alpha)

[∂/∂Θ_j J(Θ₀,Θ₁)] denotes the partial derivative term of the cost function J(Θ₀,Θ₁)

The derivative term will give us the slope of a point on the “bowl” like shaped plot

Intuition of gradient descent method

Suppose we had only one theta and our cost function is J(Θ₁) =1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)² and we know this produces a U shape plot.

If the slope is positive, the gradient * alpha will give us a positive value.

If the slope is negative, the gradient * alpha will give us a negative value.

The further away from the global optimal , the higher the gradient which means the steeper the slope.

Supposed we start at some value which correspond to the upper right side of the U bowl. The slope will thus be positive and relatively steep and we will end up with Θ₁– a larger positive number. This take us one big step left towards the bottom and reduces the value of Θ₁. As we approach the global optimal , the slope will become less and less steep which means the gradient will be lower and lower which mean the steps taken will become smaller and smaller. When we reach the global optimal , the slope is flat and gradient is 0 and thus Θ₁ will converge at that moment because Θ₁ will not change after that. Θ₁ = Θ₁ – 0

The same idea applied if we start at some value which correspond to the upper left side of the U bowl. The only difference is that we will get a negative slope which mean Θ₁ – (negative product of alpha and gradient). This mean Θ₁ = Θ₁ + a positive number which take us one step to the right.

Because the slope changes lesser and lesser as we approach the global optimal , we can have a constant alpha learning rate. When we are very far away from global optimal , we take bigger steps. As we approach the bottom , we take smaller and smaller steps till we converge.

Partial Derivative Term

∂/∂Θ0J(Θ₀,Θ₁) = 1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)

∂/∂Θ1J(Θ₀,Θ₁) = 1/m * [^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * Xⁱ]

To Conclude

Cost Function to compute the cost for a particular choice of theta

After every iteration of gradient descent, we can take the theta and plug them into the cost function to check if the cost is lower than the previous cost. We should see the cost reducing after each iteration.

When we converge in the end, we should achieve our goal for identifying theta1 and theta2 which will minimise our cost function.

Last Notes

Please leave some comments / point out my mistakes/misinterpretation/misunderstandings.

Reserved the right to be wrong 🙂

	jin long lee on CS7641
	Nandhini on CS7641
	xiangnan on Exercise 4
	jin long lee on Exercise 2 Part 1
	brin on Exercise 2 Part 1

ML Study Notes

Chapter 2: Linear Regression With One Variable

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply