Chapter 10: Large Margin Classification (SVM)

Optimization Objective

Previously our cost function for logistics regression is -(y log(h(x)) + (1-y) log (1 – h(x)))

h_Θ(x) = h_Θ(Θ^Tx) = sigmoid(Θ^Tx)

if y = 1 , we want Θ^Tx >> 0 so that sigmoid(Θ^Tx) = r will be close to 1 and

-log(r) = cost (will be small) . As r is closer to 0 , the cost go up resulting in an arc shape plot like “(” rotated slightly anti closewise

if y =0 , we want want Θ^Tx << 0 so that sigmoid(Θ^Tx) = r will be close to 0 and

-log(1- r) = cost (will be small) . As 1-r is closer to 1 , the cost go up resulting in an arc shape plot like “)” rotated slightly closewise

For the support vector machine , we want the plot to look like \_ for y =1 and _/ for y = 0.

The “_” part of \_ denotes the cost is 0 if input z which is actually Θ^Tx is >= 1

The “_” part of _/ denotes the cost is 0 if input z which is actually Θ^Tx is <= -1

The new function lines \_ , _/ are not produced with log and definition of the cost function of svm will be different.

Cost function of logistics regression with regularization:

J(Θ) = -1/m * [ ^m∑_i=1 yⁱ log(H_Θ(xⁱ )) + (1 – yⁱ ) log (1 – H_Θ(xⁱ )) ] + λ/2m * ⁿ∑_j=1Θ_j^2

Cost function of SVM:

J(Θ) =C. [ ^m∑_i=1- yⁱ cost₁(Θ^Txⁱ ) + (1 – yⁱ ) cost₀(Θ^Txⁱ ) ] + 1/2 * ⁿ∑_j=1Θ_j^2

Constant “1/m” is remove from the first term and from regularization term.

λ is removed too from regularisation term

New constant C is added to front

Logic of SVM cost function

Previously for logistics regression , we had the form A + λB

A = -1/m * [ ^m∑_i=1 yⁱ log(H_Θ(xⁱ )) + (1 – yⁱ ) log (1 – H_Θ(xⁱ )) ]

B = λ/2m * ^m∑_j=1Θ_j^2

λB will make B to have a heavy weight than A if λ is large

Now SVM cost form is C.A + B (no more lambda)

C (take the role of 1/λ) will be a small value and weight of A will be lower than B which is another way to maintain the tradeoff. Making B heavier than A vs making A lighter than B.

Also removing the constant 1/m will not change the result of theta learned. And the optimization objective is to minimize this new cost function.

Large Margin Intuition

The optimisation problem from I see consist of 2 objectives.

Firstly we want our thetas to satisfy the following conditions:

Θ^Tx >= 1 whenever y is 1

Θ^Tx <= -1 whenever y is 0

Secondly we want to minimize thetas which parameterized the cost function. We see from above the form of our cost function is C.A + B where A is the accumulated cost error function term. Suppose if C is huge , we really want A to be 0 if not the cost will be high.

Therefore by solving the optimisation problem which is to find theta which can keep A close to 0 and theta must satisfy the above 2 conditions. We find ourselves a Hypothesis whose decision boundary is the one which try to separate the positive and negative examples by the largest gap. Even if there are a few positive outliers inside the negative region , SVM will ignore those outliers and maintain the largest gap decision boundary like follows:

(We have a positive “x” in the negative “o” region but svm produce the linearly separable decision boundary to maintain largest gap.

Mathematics Behind Large Margin Classification

A n-dimensional vector can be represented like a hypotenuse of a right angle triangle. The inner product of 2 vector (U and V as depicted in the pictures) can be computed by U^TV and can be calculated by p . ||u||

In the picture:

||U|| is the length of vector u which is computed by Pythagoras‘ theorem.

P = length of V projected onto to U (red line) -> I do not understand how P is computed but just take it as it is first

Things to note:

1. U^TV is commutative. U^TV == V^TU (both U and V are n-dimensional vector in picture)

2. P is negative (< 0) if the angle between vector U and V is > 90 degree

(I cannot work out how U^TV = P_v * || U || = U1V1 + U2V2 , please help if you know)

=======================================================================

Inserted at a later date after some googling on vector regarding the above doubts

Reference sites

1. Dot Product

2. Unit Vector (Khan Academy)

3. Dot Product (Math is Fun)

P = Proj(V) onto U = || V || * cos Θ , where Θ is the angle between V and U

The length of U does not determine the length of P. It is like shining a torch onto V at a 90 degree angle perpendicular to U and the shadow form on U will be P. (The green light perpendicular to U can treated as the line formed by light ray)

Unit Vector is a vector whose length is 1 and vector U has an associated unit vector :u: (unit vector is written with a ^ on top of u , but since i do not know how to do that in wp , i just use :u: )

Vector U can be represented by some scalar multiples (c) of :u: because :u: has the same direction as U. So:

|| :u: || = 1

U = c * :u: and so we have || U || = c * || :u: ||

( U and || U || are different representations of the same thing , what doesn’t change is the scalar multiple c)

|| :u: || = || U || / c = 1

c = || U || = (U1U1 + U2U2)^1/2

:u: = U / c = [U1 /||U|| , U2 / ||U]

To generalize unit vector :u: = U / || U ||

P (Projection of V onto U) also can be computed via P = :u: . V –> P can also be treated as a measure of how much of V is in the same direction of :u: or U

(Both :u: and U have the same direction , length of :u: or U does not matter. If we shine a touch on V , only the angle theta and the length of V can affect P. )

So:

P = Proj(V) onto U = || V || * cos Θ

P = :u: . V = U/|| U || . V

[ U / || U || ] . V = || V || * cos Θ

( U.V = some real number , || U || = some real number, U/|| U || is the unit vector :u: , V and U are vectors, (U/ || U ||) .V = (U.V) / || U || = some real number)

Multiplying || U || on both sides will produce

U.V = || V || * cos Θ * || U ||

= P * || U || (where P = || V || * cos Θ)

if we substitute P with V. U/|| U || then

V. U / || U || = || V || * cos Θ and muliplying || U || on both sides will be

V. U = || V || * cos Θ * || U || so V.U = U.V

Of course we do know U.V = U1V1 + U2V2 = V1U1 + V2U2 = V.U, just now to see why U^TV = P . || U || in the slide.

=======================================================================

If C is huge then term A ( [ ^m∑_i=1- yⁱ cost₁(Θ^Txⁱ ) + (1 – yⁱ ) cost₀(Θ^Txⁱ ) ] ) need to be 0, so we end up with minimizing term B. Of the all the possibles choices of thetas which could satisfy the 2 contraints, we choose the smallest theta. If our choice of theta can satisfy the 2 constraints, we can have term A to be 0.

Using the information from previous slide , we treat theta vector as U vector and xⁱ example vector as V so

Θ^Tx = p . || Θ || >= 1 if y = 1

Θ^Tx = p . || Θ || <= -1 if y = 0

pⁱ is the projection of || xⁱ|| onto || Θ ||

If we have the decision boundary like the green line on the left,

we will end up with a small p (red line) which mean to satisfy the positive or negative constraints we need || Θ || to be big.

If we have our decision boundary is the one on the right, then we will end up with a larger p (red line) which mean we can have a smaller || Θ ||.

Because SVM can learn a smaller || Θ || , this result in a biggest margin hypothesis. The optimization problem is based on the assumption that C is large.

Summary (Not quite sure if my understanding is correct)

So if C is large -> we want term A to be 0. In order for term A to be 0 , we need to find the thetas that satisfy the 2 constraints.

Θ^Tx >= 1 whenever y is 1

Θ^Tx <= -1 whenever y is 0

We also see that Θ^Tx can be substituted with || Θ || . p. If we have a small p , || Θ || need to be large to meet the 2 constraint but we want to minimize instead of maximize. So SVM trained and give us the smallest possible theta vector. This small theta vector give us a larger P and produce the largest margin boundary.

How do we train SVM to learn the optimal value of the theta value ?

It was not covered but we can just use the octave/matlab SVM library 🙂

Kernels (Part 1)

We know we can create polynomial terms (X1X2 , X1^2 , X2^2 etc) to have more features. However we do not know if these features are useful or not and the question is:

Is there a different way, other than polynomial method, to create more different features ?

Kernels is a similarity function used to compute how similar the features of an example is from the landmark example. The output value from the kernel function will be a new feature.

Suppose we draw 2 landmark points, l¹ and l² on the 2D plane (x1 denote the horizontal axis and x2 denote the vertical axis

x2| l²
| l¹
|_________ x1

Given a particular example x (assumed x only have 3 features and therefore is a 3 dimensional vector), we can have:

f₁ = kernel(x , l¹) = exp ( – ||x – l¹||² / 2σ²)

f₂ = kernel(x , l²) = exp ( – ||x – l²||² / 2σ²)

more generally :

f_i = kernel(x , lⁱ)

= exp ( – ||x – lⁱ||² / 2σ²)

= exp (- ⁿ∑_a=1 (x_a – ⁱl_a)² / 2σ²) where n is the number of features

||x – lⁱ|| = euclidean distance between 2 vectors.

||x – lⁱ||² = (euclidean distance)²

(x1 -l1)² + (x2 -l2)² + (x3 -l3)² = ³∑_i=1 (x_i – l_i)²

The kernel definition : exp ( – ||x – lⁱ||² / 2σ²) is one of the many types of kernels available. The one which we are using here is called the gaussian kernel

If the given x is close to l¹ then euclidean distance will be close to 0 which result in

f1 = exp (~0/2σ²) ~ 1

which means

f2 = exp (-(large number)²/2σ²) ~ 0

e^{– some large number} = 1/ e^{some large number} ~ 0

So 0 >= f >= 1 is a measure how close x is to a landmark l

Given:

x2| l²
| l¹
|_________ x1

And suppose our Hypothesis predict 1 when:

Θ₀ + Θ₁f₁ + Θ₂f₂ >= 0

Just to reinstate:

if f₁ is near l¹ then f₁ will be close to 1

else f₁ will be close 0

same logic for f₂

if our thetas learned is -0.5 , 1 and 1 respectively for theta0 , theta1 and theta2

then we see that if x will be a positive example if x is near either landmarks

example: if x1 is near l1 , then

-0.5 + 1.(~1) + 1.(~0) = 0.5 >= 1 (positive)

if x1 is near l2 , then

-0.5 + 1.(~0) + 1.(~1) = 0.5 >= 1 (positive)

if x1 is far away from l1 and l2, then

-0.5 + 1.(~0) + 1.(~0) = -0.5 <= 0 (negative)

The effect of the value of sigma (σ)

Suppose we plot f₁ as a function of x and l¹ vectors using the kernel function

f₁ = kernel(x , l¹) = exp ( – ||x – l¹||² / 2σ²)

Then we should get a 3d plot with x₁ on the x axis , x₂ on the z axis and f₁, on the y axis.

The height of the 3d or contour plots denotes of the value of f given some x and a fixed l

If sigma is 1 , then we find a slim and nice hill shape plot,

If sigma is increased , then we find a fatter , less steep hill –> the effect is that the value of f drops at a slower rate as x changes due to not so steep hill.

If sigma is less than 1 , then we find a thinner , elongated hill –> the effect is the value of f drops at a faster rate as x go further away from landmark due to a steeper hill.

Kernels (Part 2)

How do we choose our landmarks?

In practice , we treat each example x in the training set as a landmark l.

Given a training set with m examples:

we will have m landmarks.

l¹= x¹, l²= x²,….l^m = x^m

for each example x in the training set , we can compute the new feature (f) vector associated with this particular xⁱ.

f₁ = kernel(xⁱ , l¹) = exp ( – ||xⁱ – l¹||² / 2σ²) –> real value between 0 and 1

f₂ = kernel(xⁱ , l²) = exp ( – ||xⁱ – l²||² / 2σ²) –> real value between 0 and 1

f₃ = kernel(xⁱ , l¹) = exp ( – ||xⁱ – l¹||² / 2σ²) –> real value between 0 and 1

f_i = kernel(xⁱ , lⁱ) = exp ( – ||xⁱ. – lⁱ||² / 2σ²) –> since xⁱ == lⁱ , the similarity value would be 1 because euclidean distance is 0.

f_m = kernel(xⁱ , l^m) = exp ( – ||xⁱ – l^m||² / 2σ²)

Therefore f is a m+1 dimensional vector with f₀ = 1

Thoughts about what we are actually computing ?

Since l¹= x¹, l²= x²,….l^m = x^m and let say we take the first example x¹ and compute f ¹ which is m+1 dimensional vector. Each element in vector f ¹ is a measure of how far apart x¹ is against the rest of the examples. Sort of like

x¹ vs x¹, x¹ vs x² , x¹vs x³ , …….., x¹ vs x^m

↓ ↓ ↓ ↓

¹f₁ ¹f₂ ¹f₃ ¹f_m

x¹ is then replaced with f ¹

x² is replaced with f ²

x³ is replaced with f ³

x^m is replaced with f ^m

How do we get the theta vector which is m+1 dimension ?

SVM to use sum learning algorithm to solve the minimization problem. We want the smallest (optimal) theta which can

1. given a 0 cost (if C is large) and

2. satisfy the 2 constraints

C. [ ^m∑_i=1- yⁱ cost₁(Θ^Tf ⁱ ) + (1 – yⁱ ) cost₀(Θ^Tf ⁱ ) ] + 1/2 * ^m∑_j=1Θ_j^2

A minor change is that instead of computing Θ^Tx , we are computing Θ^Tf, therefore the constraint will be

Θ^Tf >= 1 whenever y is 1

Θ^Tf <= -1 whenever y is 0

And we now have m features and m thetas. Do note that we still do not include theta 0 in the regularization term , thus j start from 1 instead of 0.

Minimising the cost function can be done using libraries …Phew!!!

Effects of the choice of SVM parameters summarized in the pic.

Using a SVM

Things to note when using a SVM

1. choice of C

2. choice of kernel

If we are using the gaussian kernel , we need to perform feature scaling

Logistic regression vs SVMs

How do we choose which one to use for a classification problem ?

Last notes

The algorithms, parameters etc which we choose does matter however what matters more are things like

1. The amount of data that we have .

2. How skilled are we at doing error analysis , debugging

3. Figuring out what new features to add / remove to improve performance etc.

	jin long lee on CS7641
	Nandhini on CS7641
	xiangnan on Exercise 4
	jin long lee on Exercise 2 Part 1
	brin on Exercise 2 Part 1

ML Study Notes

Chapter 10: Large Margin Classification (SVM)

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply