Multiple Features
Previously we have
HΘ(X) = Θ0 + Θ1X1
Now we have
HΘ(X) = Θ0X0 + Θ1X 1 + Θ2X2 + ……. + ΘnXn
= 0TX
X0 is the additional feature / column in the training set , feature 0 for all rows will have a fixed value of 1.
We have n+1 features, so X is a n+1 dimensional vector [X0; X1 ……. Xn]
We have n+1 thetas, so Θ is too a n+1 dimensional vector [Θ0; Θ1 ……. Θn]
Gradient Descent for Multiple Variables
Previous our cost function is
J(Θ0,Θ1) =1/2m * m∑i=1(HΘ(Xi) – yi)2
Now we have
J(Θ) =1/2m * m∑i=1(HΘ(Xi) – yi)2
Previously we have the following for 1 feature
repeat {
Θ0 = Θ0 – alpha * [1/m * m∑i=1(HΘ(Xi) – yi)]
Θ1 = Θ1 – alpha * [1/m * m∑i=1(HΘ(Xi) – yi) * Xi ]
}
Now we have
repeat {
Θj = Θj – alpha * [1/m * m∑i=1(HΘ(Xi) – yi) * jXi ] –> simultaneous update for Θj for j = 0 to n
}
Since X0 is a m dimensional vector containing all ones so when updating Θ0, we have
Θ0 = Θ0 – alpha * [1/m * m∑i=1(HΘ(Xi) – yi) * 1
so no changes to the computation of since we are multiplying by 1.
Feature Scaling
Idea is to make the features take on values roughly in the same range of -1 <= x <= 1 or -3 <= X <= 2 but not as extreme as -100 <= X <= 100 or -0.0001 <= X <= 0.0001. This is to make gradient descent converge faster. If our features are roughly within the same scale, gradient descent will be faster.
Example
Feature 1: Size of house (0-2000 feet^2), then we can take size/2000
Feature 2: Number of bedrooms (0-5), then we take numofbedrooms/5
So we feature 1 and 2 will be in the range of 0 to 1
Mean Normalization
Idea: I think the idea is to standardize the values of the features so that gradient descent will converge faster. We can take the (feature value – the average value) / range
Example
(size of house – 1000) / 2000 , so we can a range of -0.5 to 0.5
We can also / by the standard deviation to get the z-score.
Learning Rate
Supposed we plot the J(Θ) as a function of the number of iterations when running gradient descent. If gradient descent is working correctly then we should see J(Θ) decreasing on every iteration.
If we see the curve is trending upwards or trending is in a wave as number of iterations increase then alpha is too big.
If alpha is to small , the convergence time will be long.
Features and Polynomial Regression
We can create new features with polynomial terms. For example if we only have one feature (size of house) to our training set , we can have our HΘ(X) = Θ0X0 + Θ1X1 + Θ1X2^2 + + Θ1X3^3
X1= size
X2= size^2
X3= size^3
Note : If we use polynomial regression for the features, then the more we need feature scaling.
Normal Equation
Another method of minimizing theta without solving it in an iterative manner like gradient descent. Looks to me like an O(1) method by using the following method
Θ = [XTX]-1 * XTy
[XTX]-1 is the inverse matrix