Chapter 2: Linear Regression With One Variable

Model Representation 

Training Set —feed into —> Learning Algorithm — produces —> Hypothesis HΘ 

HΘ(X) is a mapping function which take an input value X and produces an output value which is the prediction.

X in this case (one variable scenario) represent a real number.

Θ (Theta) represents the parameters or weights of the Hypothesis function.
Θ in this case (one variable scenario) represent a 2 dimensional vector [Θ0; Θ1]
Θ0 is like the constant value in H in this case.

Therefore HΘ(X) = Θ0 + Θ1X = real number

HΘ(X) = Θ0 + Θ1X will produce a straight line (linear function) fit to our training data set.

What we want ideally ?

We want a HΘ(X) which give us the best possible straight line fit to our data. It is the job of the learning algorithm to give us this function H.

How is the learning algorithm going do that ?

The training examples of past data are fixed. What can changed are the values of Θ. Therefore given a particular choice of  Θ, we will end up with a different Hypothesis function definition. We first need a way to determine if our choice of Θ is optimal or not and the way to do that is via the cost function (J).

Cost Function

J(Θ01) =1/2m *  mi=1(HΘ(Xi) – yi)2

m denotes number of rows in training set
i is the index into training set , denotes which row (like array index)
Y is the correct or right answer for the particular row
HΘ(Xi) will compute the predicted value determined by chosen theta Θ

Interpretation of Cost Function J(Θ01)

HΘ(Xi) – y will give us the difference between the correct value yi and predicted value HΘ(Xi) for one particular row(i)  (for i =0 to m) of the training set.

(HΘ(Xi) – yi)2 squared the difference which produces a real number >= 0

The summation ∑ will sum up all the squared differences or errors from row 1 to m and let say the value is S.

1/m of the S will return the average of the squared differences out of a total of m differences (1/m * S) and let say the average is V. This value should be the Variance.

1/2 of V will reduce it by half

Therefore the cost function is computing half of the variance.

Idea behind cost function J(Θ01)

If the cost function output 0 for a particular choice of Θ0 , then we found the perfect hypothesis because the difference of HΘ(Xi) – yi  is 0 for i=1 to m which means we correctly predicted all output values for all  input values.

If the cost function of a particular choice (A) of theta (example Θ0= 10 and  Θ1=2 ) output say 100000 and another particular choice(B) of theta  (example Θ0= 6 and  Θ1=3 ) output say 10. Then we can say that choice B is better than A because the cost is lower.

Therefore we want to minimize the cost function J(Θ01), we want to find the values of Θ0 which can produce the smallest possible cost. With that optimal choice of theta , the learning algorithm will produce the H function which will give us  the best possible straight line fit to the data.

Convex Function 

Suppose we had only one Θ1 , if we plot J( Θ1)  as a function of  Θwhich takes on values from negative to positive values. J( Θ1)  will turn out to be a U shaped function. So the best choice of  Θwhich will minimize the cost function will be the value which out the lowest point of the U shape. Luckily the cost function J for regression problems will produce a U and there is only one global optimal. (If we had 2 thetas then the graph will be like a 3d bowl like shape where the global optimal is the bottom of the bowl, contour graphs will produce ellipse)

Gradient Descent  

For the different possible choices of Θ01 , we end up with a different cost produced by cost function(J) and different HΘ(X) function which produces different straight lines. The shape of J(Θ01)  is like a bowl shape and our goal is to minimize J(Θ01).  We can use gradient descent to help us arrive at the bottom of the “bowl” which reveal the optimal values of  Θ01.

Outline of gradient descent method

Start with some Θ01.
Keep changing Θ01. to reduce J(Θ01) till we converge at the bottom of the “bowl”

Initialize Θ01 to some value then

repeat until convergence {

Θj := Θj – ∝[∂/∂Θj J(Θ01)]    (for j=0 and j=1) –> simultaneous update for all theta

}

∝ denotes the learning rate (alpha)

[∂/∂Θj J(Θ01)] denotes the partial derivative term of the cost function J(Θ01)

The derivative term will give us the slope of a point on the “bowl” like shaped plot

Intuition of gradient descent method

Suppose we had only one theta and our cost function is J(Θ1) =1/2m *  mi=1(HΘ(Xi) – yi)2   and we know this produces a U shape plot.

If the slope is positive, the gradient * alpha  will give us a positive value.

If  the slope is negative, the gradient * alpha will give us a negative value.

The further away from the global optimal , the higher the gradient which means the steeper the slope.

Supposed we start at some value which correspond to the upper right side of the U bowl. The slope will thus be positive and relatively steep and we will end up with Θ– a larger positive number. This take us one big step left towards the bottom and reduces the value of Θ1. As we approach the global optimal , the slope will become less and less steep which means the gradient will be lower and lower which mean the steps taken will become smaller and smaller.  When we reach the global optimal , the slope is flat and gradient is 0 and thus Θ1 will converge at that moment because Θ1 will not change after that.  Θ1  = Θ1 – 0

The same idea applied if we start at some value which correspond to the upper left side of the U bowl. The only difference is that we will get a negative slope which mean Θ1 – (negative product of alpha and gradient). This mean Θ1 = Θ1 + a positive number which take us one step to the right.

Because the slope changes lesser and lesser as we approach the global optimal , we can have a constant alpha learning rate. When we are very far away from global optimal , we take bigger steps. As we approach the bottom , we take smaller and smaller steps till we converge.

Partial Derivative Term

∂/∂Θ0J(Θ01)  = 1/m * mi=1(HΘ(Xi) – yi)

∂/∂Θ1J(Θ01)  = 1/m * [mi=1(HΘ(Xi) – yi) * Xi]

To Conclude

Cost Function to compute the cost for a particular choice of theta

After every iteration of gradient descent, we can take the theta and plug them into the cost function to check if the cost is lower than the previous cost.  We should see the cost reducing after each iteration.

When we converge in the end, we should achieve our goal for identifying theta1 and theta2 which will minimise our cost function.

 

Last Notes 

Please leave some comments / point out my mistakes/misinterpretation/misunderstandings.

Reserved the right to be wrong 🙂

 

Leave a comment