Chapter 9: Machine Learning System Design

Prioritizing What to Work On

If we come out a list of features for some machine learning problems such as training a spam classifier , we do not know which features are most useful and we do not know if collecting more data will help in keep error low. It is hard to tell in advance which option we should be try. So we need some form of analysis to kind of guide us to make decision.

Since we do not know what features are useful , if it is going to help if we collect more data , we can start with a simple, quick and direct algorithm and test it on the CV set. Plot learning curves to decide if more data , more features going to help.

“Avoid premature optimization problem and let evidence guide our decisions” quoted by lecturer

Error Analysis

Manually examine the examples in the CV set which the classifier make error on and check if we can spot any systematic trend in in what type of example it is making errors on as to improve the feature set.

Recommended approach to perform error analysis is on the CV set instead of test set because if we develop new features by examining the test set, then we may end up choosing features that work well for the test set , then the value jtest is no longer a fair estimate of the generalization error.

Error Metrics for Skewed Classes

Skewed classes are classes with much more examples of its type than other classes. Therefore hard to rely on one metric to measure performance in terms of accuracy. Imagine if the classifier had an accuracy of 99% and 1% error, however there are only 0.5% negative examples in the data set, then the 1% error does not looks impressive already.

Therefore we can use the following matrix to calculate performance of learning algorithm.

Actual AND Predicted | Result

1 AND 1 | True positive

0 AND 1 | False positive

0 AND 0 | True negative

1 AND 0 | False negative

The more true positives and true negatives we have means that we have an higher accuracy, therefore we can introduce the following metrics to measure the accuracy.

Precision (of all the positively predicted examples, how many are predicted correctly ?)

= number of true positives / total number of positively predicted examples

= number of true positives / (number of true positive + number of false positive)

Recall (of all the actual positive examples, how many of them are predicted correctly ?)

= number of true positives / total number of actual true examples

= number of true positives / (number of true positive + number of false negative)

Trading off precision and recall

In our logistics regression model , if H(x) >= 0.5 (threshold) , we will predict 1 and 0 otherwise

If we want a higher precision , we can reduce the number of false positive by increasing the threshold to maybe 0.7, this also mean we will increase the number of false negative cases

Result of increasing threshold -> higher precision value but lower recall value

If we lower the threshold to 0.3 , then we would reduce the number false negative examples since we are predicting y=1 more often but we will increase the number of false positive

Result of lowering threshold -> higher recall , lower precision

Problem : Since we have 2 metrics , we do have have the single error cost value to determine accuracy. How can we maintain the single value number for evaluation ? Also if we have 3 algorithms each with different precision and different recall values, how do we determine if one is better than the others ?

If we select the algorithm with the highest average (p+r)/2, we can maintain the single value metric, but then it might be not be accurate , because we might up with choosing one with very high precision and very low recall or vice versa. Ideally we want an algorithm with a high precision and high recall value.

Solution :

F Score = 2 * PR / (P+R)

The algorithm with the highest F score will be chosen.

Why F score ?

If P = 1 and R = 0 , then F score = 0

If P = 0 and R =1, then F score = 0

These 2 cases will prevent us from choosing extreme algorithm

If P = 1 and R = 1 (perfect scores), then F score = 1

So each algorithm based on its precision and recall values will have a reasonable F score which is between 0 and 1.

The idea is that we want an algorithm with high P and high R to get high F score.

Data for Machine Learning

Previously we learned that blindly getting more data might not increase the performance of the algorithm , however under certain conditions, it will help significantly if we have more data.

Large data rationale

Conditions which must hold true:

1. The features vector provide sufficient information to predict y accurately.

Then how do we verify above condition ? Testable by asking a human expert for the particular problem to predict y given the same set of features. If human expert can predict y given x , then more data will help , else otherwise.

Suppose we use learning algorithm with many features, then we will have a small Jtrain error but we might tend to overfit the data.

Suppose we increase our size of original training set to a much larger and massive training set than our features , then overfitting will be unlikely.

So | Jtest – Jtrain | should be close to 0 which imply Jtest error will be small too.

By combining these 2 component (complex algorithm + massive training set) , we have both low variance and low bias H. Complex algorithm allows us to fit non-linear functions to avoid high bias and massive training set avoid high variance.