Chapter 3: Linear Regression With Multiple Variables

Multiple Features

Previously we have

HΘ(X) = Θ0 + Θ1X

Now we have

HΘ(X) = Θ0X0 + Θ1X 1  + Θ2X2 + ……. + ΘnX

= 0TX

X0 is the additional feature / column in the training set , feature 0 for all rows will have a fixed value of 1.

We have n+1 features, so X is a n+1 dimensional vector [X0; X1 ……. Xn]

We have n+1 thetas, so Θ is too a n+1 dimensional vector [Θ0; Θ1 ……. Θn]

Gradient Descent for Multiple Variables

Previous our cost function is

J(Θ01) =1/2m *  mi=1(HΘ(Xi) – yi)2

Now we have

J(Θ) =1/2m *  mi=1(HΘ(Xi) – yi)2

Previously we have the following for 1 feature

repeat  {

Θ0 = Θ0 – alpha * [1/m * mi=1(HΘ(Xi) – yi)]

Θ1 = Θ1 – alpha * [1/m * mi=1(HΘ(Xi) – yi) * Xi ]

}

Now we have

repeat {

Θj = Θj – alpha * [1/m * mi=1(HΘ(Xi) – yi) * jXi ]  –> simultaneous update for Θj for j = 0 to n

}

Since X0 is a m dimensional vector containing all ones so when updating  Θ0,  we have

Θ0 = Θ0 – alpha * [1/m * mi=1(HΘ(Xi) – yi) * 1

so no changes to the computation of  since we are multiplying by 1.

Feature Scaling

Idea is to make the features take on values roughly in the same range of -1 <= x <= 1 or   -3 <= X <= 2 but not as extreme as -100 <= X <= 100 or -0.0001 <= X <= 0.0001.  This is to make gradient descent converge faster. If our features are roughly within the same scale, gradient descent will be faster.

Example

Feature 1: Size of house (0-2000 feet^2), then we can take size/2000

Feature 2: Number of bedrooms (0-5), then we take numofbedrooms/5

So we feature 1 and 2 will be in the range of 0 to 1

Mean Normalization

Idea: I think the idea is to standardize the values of the features so that gradient descent will converge faster. We can take the (feature value – the average value) / range

Example

(size of house – 1000)  / 2000 , so we can a range of -0.5 to 0.5

We can also / by the standard deviation to get the z-score.

Learning Rate

Supposed we plot the J(Θ) as a function of the number of iterations when running gradient descent.  If gradient descent is working correctly then we should see J(Θ) decreasing on every iteration.

If we see the curve is trending upwards or trending is in a wave as number of iterations  increase then alpha is too big.

If alpha is to small , the convergence time will be long.

Features and Polynomial Regression

We can create new features with polynomial terms. For example if we only have one feature (size of house) to our training set , we can have our HΘ(X) = Θ0X0 + Θ1X1 + Θ1X2^2 + + Θ1X3^3

X1= size

X2= size^2

X3= size^3

Note : If we use polynomial regression for the features, then the more we need feature scaling.

 Normal Equation 

Another method of minimizing theta without solving it in an iterative manner like gradient descent. Looks to me like an O(1) method by using the following method

Θ = [XTX]-1 * XTy

[XTX]-1 is the inverse matrix

Leave a comment