Chapter 3: Linear Regression With Multiple Variables

Multiple Features

Previously we have

H_Θ(X) = Θ₀ + Θ₁X₁

Now we have

H_Θ(X) = Θ₀X₀ + Θ₁X ₁ + Θ₂X₂ + ……. + Θ_nX_n

= 0^TX

X₀ is the additional feature / column in the training set , feature 0 for all rows will have a fixed value of 1.

We have n+1 features, so X is a n+1 dimensional vector [X₀; X₁ ……. X_n]

We have n+1 thetas, so Θ is too a n+1 dimensional vector [Θ₀; Θ₁ ……. Θ_n]

Gradient Descent for Multiple Variables

Previous our cost function is

J(Θ₀,Θ₁) =1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)²

Now we have

J(Θ) =1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)²

Previously we have the following for 1 feature

repeat {

Θ₀ = Θ₀ – alpha * [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)]

Θ1 = Θ1 – alpha * [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * Xⁱ ]

}

Now we have

repeat {

Θj = Θj – alpha * [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * _jXⁱ ] –> simultaneous update for Θj for j = 0 to n

}

Since X₀ is a m dimensional vector containing all ones so when updating Θ0, we have

Θ0 = Θ0 – alpha * [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * 1

so no changes to the computation of since we are multiplying by 1.

Feature Scaling

Idea is to make the features take on values roughly in the same range of -1 <= x <= 1 or -3 <= X <= 2 but not as extreme as -100 <= X <= 100 or -0.0001 <= X <= 0.0001. This is to make gradient descent converge faster. If our features are roughly within the same scale, gradient descent will be faster.

Example

Feature 1: Size of house (0-2000 feet^2), then we can take size/2000

Feature 2: Number of bedrooms (0-5), then we take numofbedrooms/5

So we feature 1 and 2 will be in the range of 0 to 1

Mean Normalization

Idea: I think the idea is to standardize the values of the features so that gradient descent will converge faster. We can take the (feature value – the average value) / range

Example

(size of house – 1000) / 2000 , so we can a range of -0.5 to 0.5

We can also / by the standard deviation to get the z-score.

Learning Rate

Supposed we plot the J(Θ) as a function of the number of iterations when running gradient descent. If gradient descent is working correctly then we should see J(Θ) decreasing on every iteration.

If we see the curve is trending upwards or trending is in a wave as number of iterations increase then alpha is too big.

If alpha is to small , the convergence time will be long.

Features and Polynomial Regression

We can create new features with polynomial terms. For example if we only have one feature (size of house) to our training set , we can have our H_Θ(X) = Θ₀X₀ + Θ₁X₁ + Θ₁X2^2 + + Θ₁X3^3

X₁= size

X2= size^2

X3= size^3

Note : If we use polynomial regression for the features, then the more we need feature scaling.

Normal Equation

Another method of minimizing theta without solving it in an iterative manner like gradient descent. Looks to me like an O(1) method by using the following method

Θ = [X^TX]^-1 * X^Ty

[X^TX]^-1is the inverse matrix

	jin long lee on CS7641
	Nandhini on CS7641
	xiangnan on Exercise 4
	jin long lee on Exercise 2 Part 1
	brin on Exercise 2 Part 1

ML Study Notes

Chapter 3: Linear Regression With Multiple Variables

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply