Chapter 10: Large Margin Classification (SVM)

Optimization Objective

m4

Previously our cost function for logistics regression is -(y log(h(x)) + (1-y) log (1 – h(x)))

hΘ(x) = hΘTx) = sigmoid(ΘTx)

if y = 1 , we want ΘTx  >> 0  so that sigmoid(ΘTx) = r  will be close to 1 and

-log(r) = cost (will be small) . As r is closer to 0 , the cost go up resulting in an arc shape plot like “(” rotated slightly anti closewise

if y =0 , we want want ΘTx  << 0  so that sigmoid(ΘTx) = r will be close to 0 and

-log(1- r) = cost (will be small) . As 1-r is closer to 1 , the cost go up resulting in an arc shape plot like “)” rotated slightly closewise

For the support vector machine , we want the plot to look like \_ for y =1 and _/ for y = 0.

The “_”  part of \_ denotes the cost is 0 if input z which is actually ΘTx is >= 1

The “_”  part of _/ denotes the cost is 0 if input z which is actually ΘTx is  <= -1

The new function lines \_  , _/  are not produced with log and definition of the cost function of svm will be different.

Cost function of logistics regression with regularization:

J(Θ) = -1/m * [ mi=1 yi log(HΘ(xi )) + (1 – yi ) log (1 – HΘ(xi )) ] + λ/2m *  nj=1Θj^2

Cost function of SVM:

J(Θ) =C. [ mi=1- yi cost1Txi ) + (1 – yi ) cost0Txi )  ] + 1/2 *  nj=1Θj^2

Constant  “1/m” is remove from the first term and from regularization term.

λ is removed too from regularisation term

New constant C is added to front

Logic of SVM cost function

Previously for logistics regression , we had the form A + λB

A = -1/m * [ mi=1 yi log(HΘ(xi )) + (1 – yi ) log (1 – HΘ(xi )) ]

B = λ/2m *  mj=1Θj^2

λB will make B to have a heavy weight than A if λ is large

Now SVM cost form is C.A + B (no more lambda)

C  (take the role of 1/λ)  will be a small value and weight of A will be lower than B which is another way to maintain the tradeoff. Making B heavier than A vs making A lighter than B.

Also removing the constant 1/m will not change the result of theta learned. And the optimization objective is to minimize this new cost function.

Large Margin Intuition

The optimisation problem from I see consist of 2 objectives.

Firstly we want our thetas to satisfy the following conditions:

ΘTx >= 1 whenever y is 1

ΘTx <= -1 whenever y is 0

Secondly we want to minimize thetas which parameterized the cost function. We see from above the form of our cost function is C.A + B where A is the accumulated cost error function term. Suppose if C is huge , we really want A to be 0 if not the cost will be high.

Therefore by solving the optimisation problem which is to find theta which can keep A close to 0 and theta must satisfy the above 2 conditions. We find ourselves a Hypothesis whose decision boundary is the one which try to separate the positive and negative examples by the largest gap. Even if there are a few positive outliers inside the negative region , SVM will ignore those outliers and maintain the largest gap decision boundary like follows:

(We have a positive “x” in the  negative “o” region but svm produce the linearly separable decision boundary to maintain largest gap.

outliers svm

Mathematics Behind Large Margin Classification

ml

 

A n-dimensional vector can be represented like a hypotenuse of a right angle triangle. The inner product of 2 vector (U and V as depicted in the pictures) can be computed by UTV and can be calculated by p . ||u||

In the picture:

||U|| is the length of vector u which is computed by  Pythagorastheorem.

P = length of V projected onto to U (red line) -> I do not understand how P is computed but just take it as it is first

Things to note:

1. UTV is commutative.  UTV == VTU (both U and V are n-dimensional vector in picture)

2. P is negative (< 0) if the angle between vector U and V is > 90 degree

(I cannot work out how  UTV = Pv * || U || = U1V1 + U2V2 , please help if you know)

=======================================================================

Inserted at a later date after some googling on vector regarding the above doubts

Reference sites

1. Dot Product 

2. Unit Vector (Khan Academy)

3. Dot Product (Math is Fun)

P = Proj(V) onto U = || V || * cos Θ , where Θ is the angle between V and U

The length of U does not determine the length of P.  It is like shining a torch onto V at a 90 degree angle perpendicular to U and the shadow form on U will be P. (The green light perpendicular to U can treated as the line formed by light ray)

Unit Vector is a vector whose length is 1 and vector U has an associated unit vector  :u: (unit vector is written with a ^ on top of u , but since i do not know how to do that in wp , i just use  :u: )

Vector U can be represented by some scalar multiples (c) of :u: because :u: has the same direction as U. So:

|| :u: || = 1 

U = c * :u:   and so we have  || U || = c * || :u: ||  

( U and || U || are different representations of the same thing , what doesn’t change is the scalar multiple c)

|| :u: || = || U || / c = 1

c = || U || = (U1U1 + U2U2)^1/2

:u: = U / c =  [U1 /||U|| , U2 / ||U]

To generalize unit vector :u:  = U / || U ||

P (Projection of V onto U) also can be computed via P = :u: . V –> P can also be treated as a measure of how much of V  is in the same direction of :u: or U 

(Both :u: and U have the same direction , length of :u: or U does not matter. If we shine a touch on V , only the angle theta and the length of V can affect P.  )

So:

P = Proj(V) onto U = || V || * cos Θ

P = :u: . V = U/|| U || . V

[ U / || U || ] . V = || V || * cos Θ

( U.V = some real number , || U || = some real number, U/|| U || is the unit vector :u: , V and U are vectors, (U/ || U ||) .V = (U.V)  / || U || = some real number)

Multiplying || U || on both sides will produce

U.V = || V || * cos Θ * || U ||

        = P * || U ||  (where  P = || V || * cos Θ)

if we substitute  P with  V.  U/|| U || then

V.  U / || U ||  = || V || * cos Θ and muliplying || U ||  on both sides will be 

V. U = || V || * cos Θ * || U || so V.U = U.V  

Of course we do know U.V = U1V1 + U2V2 = V1U1 + V2U2 = V.U, just now to see why UTV = P . || U || in the slide.

=======================================================================

ml2

If C is huge then term A ( mi=1- yi cost1Txi ) + (1 – yi ) cost0Txi )  ] ) need to be 0, so we end up with minimizing term B. Of the all the possibles choices of thetas which could satisfy the 2 contraints, we choose the smallest theta. If our choice of theta can satisfy the 2 constraints, we can have term A to be 0.

Using the information from previous slide , we treat theta vector as U vector and xi example vector as V so

ΘTx = p . || Θ || >= 1 if y = 1

ΘTx = p . || Θ || <=  -1 if y = 0

pi is the projection of || x|| onto || Θ ||

ml3

If we have the decision boundary like the green line on the left,

we will end up with a small p (red line) which mean to satisfy the positive or negative constraints we need  || Θ || to be big.

If we have our decision boundary is the one on the right, then we will end up with a larger p (red line) which mean we can have a smaller || Θ ||.

Because SVM can learn a  smaller || Θ || , this result in a biggest margin hypothesis. The optimization problem is based on the assumption that C is large.

Summary (Not quite sure if my understanding is correct)

So if C is large -> we want term A to be 0. In order for term A to be 0 , we need to find the thetas that satisfy the 2 constraints.

ΘTx >= 1 whenever y is 1

ΘTx <= -1 whenever y is 0

We also see that ΘTx can be substituted with || Θ || . p. If we have a small p , || Θ || need to be large to meet the 2 constraint but we want to minimize  instead of maximize. So SVM trained and give us the smallest possible theta vector. This small theta vector give us a larger P and produce the largest margin boundary.

How do we train SVM to learn the optimal value of the theta value ?

It was not covered but we can just use the octave/matlab SVM library 🙂

Kernels (Part 1)

We know we can create polynomial terms (X1X2 , X1^2 , X2^2 etc) to have more features. However we do not know if these features are useful or not and the question is:

Is there a different way, other than polynomial method,  to create more different features ?

ml3

Kernels is a similarity function used to compute how similar the features of an example is from the landmark example. The output value from the kernel function will be a new feature.

Suppose we draw 2 landmark points, l1 and l2 on the 2D plane (x1 denote the horizontal axis and x2 denote the vertical axis

x2|                l2
|   l1
|_________ x1

Given a particular example x (assumed x only have 3 features and therefore is a 3 dimensional vector), we can have:

f1 = kernel(x , l1) = exp ( – ||x – l1||2 / 2σ2)

f2 = kernel(x , l2) = exp ( – ||x – l2||2 / 2σ2)

more generally :

fi = kernel(x , li)

= exp ( – ||x – li||2 / 2σ2)

= exp (- na=1 (xaila)2 / 2σ2)  where n is the number of features

||x – li|| = euclidean distance between 2 vectors.

||x – li|| = (euclidean distance)2

(x1 -l1)2  +  (x2 -l2)2  +  (x3 -l3)2  =   3i=1 (xi – li)2

The kernel definition : exp ( – ||x – li||2 / 2σ2) is one of the many types of kernels available. The one which we are using here is called the gaussian kernel

If the given x is close to l1 then euclidean distance will be close to 0 which result in

f1 = exp (~0/2σ2) ~ 1

which means

f2 = exp (-(large number)2/2σ2) ~ 0

e– some large number = 1/ esome large number ~ 0

So  0 >= f >= 1  is a measure how close x is to a landmark l

Given:

x2|                l2
|   l1
|_________ x1

And suppose our Hypothesis predict 1 when:

Θ0 + Θ1f1 + Θ2f2  >= 0

Just to reinstate:

if f1 is near l1 then f1 will be close to 1

else f1 will be close 0

same logic for f2

if our thetas learned is -0.5 , 1 and 1 respectively for theta0 , theta1 and theta2

then we see that if x will be a positive example if x is near either landmarks

example: if x1 is near l1 , then

-0.5 + 1.(~1) + 1.(~0) = 0.5 >= 1 (positive)

if x1 is near l2 , then

-0.5 + 1.(~0) + 1.(~1) = 0.5 >= 1 (positive)

if x1 is far away from l1 and l2, then

-0.5 + 1.(~0) + 1.(~0) = -0.5 <= 0  (negative)

The effect of the value of sigma (σ)

Suppose we plot f as a function of x and l1 vectors using the kernel function

f1 = kernel(x , l1) = exp ( – ||x – l1||2 / 2σ2)

Then we should get a 3d plot with  x1 on the x axis ,  x2 on the z axis and  f1, on the y axis.

The height of the 3d or contour plots denotes of the value of f given some x and a fixed l

If sigma is 1 , then we find a slim and nice hill shape plot,

If sigma is increased , then we find a fatter , less steep hill –> the effect is that the value of f drops at a slower rate as x changes due to not so steep hill.

If sigma is less than 1 , then we find a thinner , elongated hill –> the effect is the value of f drops at a faster rate as x go further away from landmark due to a steeper hill.

ml2

 

Kernels (Part 2)

How do we choose our landmarks?

In practice , we treat each example x in the training set as a landmark l.

Given a training set with m examples:

we will have m landmarks.

l= x1, l= x2,….lm = xm

for each example x in the training set , we can compute the new feature (f) vector associated with this particular xi.

f1 = kernel(xi , l1) = exp ( – ||xi – l1||2 / 2σ2) –> real value between 0 and 1

f2 = kernel(xi , l2) = exp ( – ||xi – l2||2 / 2σ2) –> real value between 0 and 1

f3 = kernel(xi , l1) = exp ( – ||xi – l1||2 / 2σ2) –> real value between 0 and 1

.

fi = kernel(xi , li) = exp ( – ||xi. – li||2 / 2σ2) –> since xi == li , the similarity value would be 1 because euclidean distance is 0.

.

.

.

fm = kernel(xi , lm) = exp ( – ||xi – lm||2 / 2σ2)

Therefore f is a m+1 dimensional vector with f0 = 1

Thoughts about what we are actually computing  ?

Since l= x1, l= x2,….lm = xm and let say we take the first example x1 and compute f 1  which is m+1 dimensional vector. Each element in vector  f 1  is a measure of how far apart x1 is against the rest of the examples. Sort of like

x1 vs x, x1 vs x2 , xvs x3 , …….., x1 vs xm

↓              ↓              ↓                          ↓

1f1            1f2           1f3                        1fm

x1 is then replaced with  f 1

x2 is replaced with  f 2

x3 is replaced with  f 3

.

.

.

xm is  replaced with  f m

How do we get the theta vector which is m+1 dimension ?

ml4

SVM to use sum learning algorithm to solve the minimization problem. We want the smallest (optimal) theta which can

1. given a 0 cost (if C is large) and

2. satisfy the 2 constraints

C. [ mi=1- yi cost1Ti ) + (1 – yi ) cost0Ti )  ] + 1/2 *  mj=1Θj^2

A minor change is that instead of computing ΘTx , we are computing ΘTf, therefore the constraint will be

ΘTf >= 1 whenever y is 1

ΘTf <= -1 whenever y is 0

And we now have m features and m thetas. Do note that we still do not include theta 0 in the regularization term , thus j start from 1 instead of 0.

Minimising the cost function can be done using libraries …Phew!!!

Effects of the choice of SVM parameters summarized in the pic.

ml5

 

 

Using a SVM

Things to note when using a SVM

1. choice of C

2. choice of kernel

If we are using the gaussian kernel , we need to perform feature scaling

ml6

 

Logistic regression vs SVMs

How do we choose which one to use for a classification problem ?

ml7

Last notes

The algorithms, parameters etc which we choose does matter however what matters more are things like

1. The amount of data that we have .

2. How skilled are we at doing error analysis , debugging

3. Figuring out what new features to add / remove to improve performance etc.

 

 

Leave a comment