Chapter 4: Logistics Regression

Classification

We are no longer trying to predict a real value like linear regression. For classification problems we are trying to predict 0 or 1 (discrete values). Example: spam or not spam, benign or malignant tumor etc. For Linear regression, we have H_Θ(X) =Θ^TX = real number (R) R can be <= 0 or >= 1 which is strange if we used the same model for classification problems. As y should be ∈ {0, 1}. We will use logistics regression for classification problems.

Hypothesis Representation

The H function for Logistics regression is H_Θ(X) = G(H of linear regression) = G (Θ^TX) = 1 / (1 + e^{-0^TX})

H_Θ(X) =1 / (1 + e^{-0^TX}) –> known as sigmoid or logistics function.

Plot of Sigmoid Function

From the plot, we see that 1 / (1 + e^{-0^TX}) produce an output 0 to 1. The output can be treated as a probability value.

Decision Boundary

The line which separate the region which predicts y = 0 from y = 1.

Decision Boundary Image

From the plot of the sigmoid function , we see that

if input value (X) to sigmoid is >= 0 , then 0.5 <= H_Θ(X) <= 1

if input value (X) to sigmoid is < 0 , then 0 <= H_Θ(X) < 0.5

P(y = 1 | X; Θ) = 1 whenever H_Θ(X) >= 0.5 and X >=0

P(y = 0 | X; Θ) = 0 whenever H_Θ(X) < 0.5 and X < 0

Decision boundary is the result of the chosen theta and not of the data set.

Cost function of Logistics Regression

Suppose we used the cost function of linear regression for logistics regression, the function will be non-convex plot due to non-linear property of the sigmoid function. Non-convex property will not allow us to have a global optimal.

J(Θ) =1/2m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ)² —-> linear regression’s cost function , if we substitute H function with the sigmoid function, this function become non-convex. Therefore we need another model.

Cost of Logistics Regression

Cost(H_Θ(X), y) = {

-log(H_Θ(X)) if y = 1

-log(1 – H_Θ(X)) if y = 0

}

Interpretation

First we know H_Θ(X) = 1/(1 + e^{-0^TX}) = real number (r) , therefore

if y = 1 , the cost would be = -log(r)

if y = 0 , the cost would be = -log(1 – r)

Why it works ?

We know the output of sigmoid is between 0 and 1.

If y = 1 and we predicted as 0 , the cost would shoot up and vice versa. If the cost is high due to wrong prediction, it means the choice of theta is bad which mean the penalty is justified. This function has the convex property too.

Simplified Cost Function

Cost(H_Θ(X), y) = – y log(H_Θ(X)) – (1 – y) log (1 – H_Θ(X))

Therefore we have

J(Θ) = -1/m * [ ^m∑_i=1 yⁱ log(H_Θ(xⁱ )) + (1 – yⁱ ) log (1 – H_Θ(xⁱ ))]

Gradient Descent

The update rule looks the same as linear regression. However do take note the definition of H_Θ is now the sigmoid function of 1 / (1 + e^{-0^TX}) and not Θ^TX. The goal of minimizing the cost with the optimal choice of theta is still same.

repeat {

Θj = Θj – alpha * [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * _jXⁱ ] –> simultaneous update for Θj for j = 0 to n

}

[1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * _jXⁱ ] –> derivative term which output the gradient

Optimization

There exists other methods to minimize cost function without needing for us to choose alpha (learning rate). Octave provide fminunc (minimum unconstrain) function which is described as gradient descent on steroids. Looks like a high level order function which take a function(G) , initial theta vector and option vector. Function G should return a cost and a gradient vector and give it to fminunc to compute. Options will indicate the number of iterations etc.

Gradient vector[0] = [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * 0Xⁱ]

Gradient vector[1] = [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * 1Xⁱ]

Gradient vector[n] = [1/m * ^m∑_i=1(H_Θ(Xⁱ) – yⁱ) * nXⁱ]

Note: We did not pass in the learning rate alpha to fminunc.

Multiclass Classification: One vs all

Idea is take n classes and convert it to n separate 2 class classification problem.

Suppose we want to classify or predict if a person falls into which of the following groups based on some features.

friend, lover, relatives, foe

We ask separately

1. If he/she is a friend (Yes/No) –> Classifier1 ¹H_Θ(x)

2. If he/she is a lover (Yes/No) –> Classifier1 ²H_Θ(x)

3. If he/she a relative (Yes/No) –> Classifier3 ³H_Θ(x)

4. If he/she our enemy (Yes/No) –> Classifier4 ⁴H_Θ(x)

4 Hypothesis with each Hypothesis predicting a 1 or 0 value

	jin long lee on CS7641
	Nandhini on CS7641
	xiangnan on Exercise 4
	jin long lee on Exercise 2 Part 1
	brin on Exercise 2 Part 1

ML Study Notes

Chapter 4: Logistics Regression

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply