Optimization Objective
Previously our cost function for logistics regression is -(y log(h(x)) + (1-y) log (1 – h(x)))
hΘ(x) = hΘ(ΘTx) = sigmoid(ΘTx)
if y = 1 , we want ΘTx >> 0 so that sigmoid(ΘTx) = r will be close to 1 and
-log(r) = cost (will be small) . As r is closer to 0 , the cost go up resulting in an arc shape plot like “(” rotated slightly anti closewise
if y =0 , we want want ΘTx << 0 so that sigmoid(ΘTx) = r will be close to 0 and
-log(1- r) = cost (will be small) . As 1-r is closer to 1 , the cost go up resulting in an arc shape plot like “)” rotated slightly closewise
For the support vector machine , we want the plot to look like \_ for y =1 and _/ for y = 0.
The “_” part of \_ denotes the cost is 0 if input z which is actually ΘTx is >= 1
The “_” part of _/ denotes the cost is 0 if input z which is actually ΘTx is <= -1
The new function lines \_ , _/ are not produced with log and definition of the cost function of svm will be different.
Cost function of logistics regression with regularization:
J(Θ) = -1/m * [ m∑i=1 yi log(HΘ(xi )) + (1 – yi ) log (1 – HΘ(xi )) ] + λ/2m * n∑j=1Θj^2
Cost function of SVM:
J(Θ) =C. [ m∑i=1- yi cost1(ΘTxi ) + (1 – yi ) cost0(ΘTxi ) ] + 1/2 * n∑j=1Θj^2
Constant “1/m” is remove from the first term and from regularization term.
λ is removed too from regularisation term
New constant C is added to front
Logic of SVM cost function
Previously for logistics regression , we had the form A + λB
A = -1/m * [ m∑i=1 yi log(HΘ(xi )) + (1 – yi ) log (1 – HΘ(xi )) ]
B = λ/2m * m∑j=1Θj^2
λB will make B to have a heavy weight than A if λ is large
Now SVM cost form is C.A + B (no more lambda)
C (take the role of 1/λ) will be a small value and weight of A will be lower than B which is another way to maintain the tradeoff. Making B heavier than A vs making A lighter than B.
Also removing the constant 1/m will not change the result of theta learned. And the optimization objective is to minimize this new cost function.
Large Margin Intuition
The optimisation problem from I see consist of 2 objectives.
Firstly we want our thetas to satisfy the following conditions:
ΘTx >= 1 whenever y is 1
ΘTx <= -1 whenever y is 0
Secondly we want to minimize thetas which parameterized the cost function. We see from above the form of our cost function is C.A + B where A is the accumulated cost error function term. Suppose if C is huge , we really want A to be 0 if not the cost will be high.
Therefore by solving the optimisation problem which is to find theta which can keep A close to 0 and theta must satisfy the above 2 conditions. We find ourselves a Hypothesis whose decision boundary is the one which try to separate the positive and negative examples by the largest gap. Even if there are a few positive outliers inside the negative region , SVM will ignore those outliers and maintain the largest gap decision boundary like follows:
(We have a positive “x” in the negative “o” region but svm produce the linearly separable decision boundary to maintain largest gap.
Mathematics Behind Large Margin Classification
A n-dimensional vector can be represented like a hypotenuse of a right angle triangle. The inner product of 2 vector (U and V as depicted in the pictures) can be computed by UTV and can be calculated by p . ||u||
In the picture:
||U|| is the length of vector u which is computed by Pythagoras‘ theorem.
P = length of V projected onto to U (red line) -> I do not understand how P is computed but just take it as it is first
Things to note:
1. UTV is commutative. UTV == VTU (both U and V are n-dimensional vector in picture)
2. P is negative (< 0) if the angle between vector U and V is > 90 degree
(I cannot work out how UTV = Pv * || U || = U1V1 + U2V2 , please help if you know)
=======================================================================
Inserted at a later date after some googling on vector regarding the above doubts
Reference sites
1. Dot Product
P = Proj(V) onto U = || V || * cos Θ , where Θ is the angle between V and U
The length of U does not determine the length of P. It is like shining a torch onto V at a 90 degree angle perpendicular to U and the shadow form on U will be P. (The green light perpendicular to U can treated as the line formed by light ray)
Unit Vector is a vector whose length is 1 and vector U has an associated unit vector :u: (unit vector is written with a ^ on top of u , but since i do not know how to do that in wp , i just use :u: )
Vector U can be represented by some scalar multiples (c) of :u: because :u: has the same direction as U. So:
|| :u: || = 1
U = c * :u: and so we have || U || = c * || :u: ||
( U and || U || are different representations of the same thing , what doesn’t change is the scalar multiple c)
|| :u: || = || U || / c = 1
c = || U || = (U1U1 + U2U2)^1/2
:u: = U / c = [U1 /||U|| , U2 / ||U]
To generalize unit vector :u: = U / || U ||
P (Projection of V onto U) also can be computed via P = :u: . V –> P can also be treated as a measure of how much of V is in the same direction of :u: or U
(Both :u: and U have the same direction , length of :u: or U does not matter. If we shine a touch on V , only the angle theta and the length of V can affect P. )
So:
P = Proj(V) onto U = || V || * cos Θ
P = :u: . V = U/|| U || . V
[ U / || U || ] . V = || V || * cos Θ
( U.V = some real number , || U || = some real number, U/|| U || is the unit vector :u: , V and U are vectors, (U/ || U ||) .V = (U.V) / || U || = some real number)
Multiplying || U || on both sides will produce
U.V = || V || * cos Θ * || U ||
= P * || U || (where P = || V || * cos Θ)
if we substitute P with V. U/|| U || then
V. U / || U || = || V || * cos Θ and muliplying || U || on both sides will be
V. U = || V || * cos Θ * || U || so V.U = U.V
Of course we do know U.V = U1V1 + U2V2 = V1U1 + V2U2 = V.U, just now to see why UTV = P . || U || in the slide.
=======================================================================
If C is huge then term A ( [ m∑i=1- yi cost1(ΘTxi ) + (1 – yi ) cost0(ΘTxi ) ] ) need to be 0, so we end up with minimizing term B. Of the all the possibles choices of thetas which could satisfy the 2 contraints, we choose the smallest theta. If our choice of theta can satisfy the 2 constraints, we can have term A to be 0.
Using the information from previous slide , we treat theta vector as U vector and xi example vector as V so
ΘTx = p . || Θ || >= 1 if y = 1
ΘTx = p . || Θ || <= -1 if y = 0
pi is the projection of || xi || onto || Θ ||
If we have the decision boundary like the green line on the left,
we will end up with a small p (red line) which mean to satisfy the positive or negative constraints we need || Θ || to be big.
If we have our decision boundary is the one on the right, then we will end up with a larger p (red line) which mean we can have a smaller || Θ ||.
Because SVM can learn a smaller || Θ || , this result in a biggest margin hypothesis. The optimization problem is based on the assumption that C is large.
Summary (Not quite sure if my understanding is correct)
So if C is large -> we want term A to be 0. In order for term A to be 0 , we need to find the thetas that satisfy the 2 constraints.
ΘTx >= 1 whenever y is 1
ΘTx <= -1 whenever y is 0
We also see that ΘTx can be substituted with || Θ || . p. If we have a small p , || Θ || need to be large to meet the 2 constraint but we want to minimize instead of maximize. So SVM trained and give us the smallest possible theta vector. This small theta vector give us a larger P and produce the largest margin boundary.
How do we train SVM to learn the optimal value of the theta value ?
It was not covered but we can just use the octave/matlab SVM library 🙂
Kernels (Part 1)
We know we can create polynomial terms (X1X2 , X1^2 , X2^2 etc) to have more features. However we do not know if these features are useful or not and the question is:
Is there a different way, other than polynomial method, to create more different features ?
Kernels is a similarity function used to compute how similar the features of an example is from the landmark example. The output value from the kernel function will be a new feature.
Suppose we draw 2 landmark points, l1 and l2 on the 2D plane (x1 denote the horizontal axis and x2 denote the vertical axis
x2| l2
| l1
|_________ x1
Given a particular example x (assumed x only have 3 features and therefore is a 3 dimensional vector), we can have:
f1 = kernel(x , l1) = exp ( – ||x – l1||2 / 2σ2)
f2 = kernel(x , l2) = exp ( – ||x – l2||2 / 2σ2)
more generally :
fi = kernel(x , li)
= exp ( – ||x – li||2 / 2σ2)
= exp (- n∑a=1 (xa – ila)2 / 2σ2) where n is the number of features
||x – li|| = euclidean distance between 2 vectors.
||x – li||2 = (euclidean distance)2
(x1 -l1)2 + (x2 -l2)2 + (x3 -l3)2 = 3∑i=1 (xi – li)2
The kernel definition : exp ( – ||x – li||2 / 2σ2) is one of the many types of kernels available. The one which we are using here is called the gaussian kernel
If the given x is close to l1 then euclidean distance will be close to 0 which result in
f1 = exp (~0/2σ2) ~ 1
which means
f2 = exp (-(large number)2/2σ2) ~ 0
e– some large number = 1/ esome large number ~ 0
So 0 >= f >= 1 is a measure how close x is to a landmark l
Given:
x2| l2
| l1
|_________ x1
And suppose our Hypothesis predict 1 when:
Θ0 + Θ1f1 + Θ2f2 >= 0
Just to reinstate:
if f1 is near l1 then f1 will be close to 1
else f1 will be close 0
same logic for f2
if our thetas learned is -0.5 , 1 and 1 respectively for theta0 , theta1 and theta2
then we see that if x will be a positive example if x is near either landmarks
example: if x1 is near l1 , then
-0.5 + 1.(~1) + 1.(~0) = 0.5 >= 1 (positive)
if x1 is near l2 , then
-0.5 + 1.(~0) + 1.(~1) = 0.5 >= 1 (positive)
if x1 is far away from l1 and l2, then
-0.5 + 1.(~0) + 1.(~0) = -0.5 <= 0 (negative)
The effect of the value of sigma (σ)
Suppose we plot f1 as a function of x and l1 vectors using the kernel function
f1 = kernel(x , l1) = exp ( – ||x – l1||2 / 2σ2)
Then we should get a 3d plot with x1 on the x axis , x2 on the z axis and f1, on the y axis.
The height of the 3d or contour plots denotes of the value of f given some x and a fixed l
If sigma is 1 , then we find a slim and nice hill shape plot,
If sigma is increased , then we find a fatter , less steep hill –> the effect is that the value of f drops at a slower rate as x changes due to not so steep hill.
If sigma is less than 1 , then we find a thinner , elongated hill –> the effect is the value of f drops at a faster rate as x go further away from landmark due to a steeper hill.
Kernels (Part 2)
How do we choose our landmarks?
In practice , we treat each example x in the training set as a landmark l.
Given a training set with m examples:
we will have m landmarks.
l1 = x1, l2 = x2,….lm = xm
for each example x in the training set , we can compute the new feature (f) vector associated with this particular xi.
f1 = kernel(xi , l1) = exp ( – ||xi – l1||2 / 2σ2) –> real value between 0 and 1
f2 = kernel(xi , l2) = exp ( – ||xi – l2||2 / 2σ2) –> real value between 0 and 1
f3 = kernel(xi , l1) = exp ( – ||xi – l1||2 / 2σ2) –> real value between 0 and 1
.
fi = kernel(xi , li) = exp ( – ||xi. – li||2 / 2σ2) –> since xi == li , the similarity value would be 1 because euclidean distance is 0.
.
.
.
fm = kernel(xi , lm) = exp ( – ||xi – lm||2 / 2σ2)
Therefore f is a m+1 dimensional vector with f0 = 1
Thoughts about what we are actually computing ?
Since l1 = x1, l2 = x2,….lm = xm and let say we take the first example x1 and compute f 1 which is m+1 dimensional vector. Each element in vector f 1 is a measure of how far apart x1 is against the rest of the examples. Sort of like
x1 vs x1 , x1 vs x2 , x1 vs x3 , …….., x1 vs xm
↓ ↓ ↓ ↓
1f1 1f2 1f3 1fm
x1 is then replaced with f 1
x2 is replaced with f 2
x3 is replaced with f 3
.
.
.
xm is replaced with f m
How do we get the theta vector which is m+1 dimension ?
SVM to use sum learning algorithm to solve the minimization problem. We want the smallest (optimal) theta which can
1. given a 0 cost (if C is large) and
2. satisfy the 2 constraints
C. [ m∑i=1- yi cost1(ΘTf i ) + (1 – yi ) cost0(ΘTf i ) ] + 1/2 * m∑j=1Θj^2
A minor change is that instead of computing ΘTx , we are computing ΘTf, therefore the constraint will be
ΘTf >= 1 whenever y is 1
ΘTf <= -1 whenever y is 0
And we now have m features and m thetas. Do note that we still do not include theta 0 in the regularization term , thus j start from 1 instead of 0.
Minimising the cost function can be done using libraries …Phew!!!
Effects of the choice of SVM parameters summarized in the pic.
Using a SVM
Things to note when using a SVM
1. choice of C
2. choice of kernel
If we are using the gaussian kernel , we need to perform feature scaling
Logistic regression vs SVMs
How do we choose which one to use for a classification problem ?
Last notes
The algorithms, parameters etc which we choose does matter however what matters more are things like
1. The amount of data that we have .
2. How skilled are we at doing error analysis , debugging
3. Figuring out what new features to add / remove to improve performance etc.