Chapter 16: Photo OCR

Photo OCR had been used as an example to demonstrate the general concepts which are used to design a machine learning system. These include models to process data , ways to generate data and analysis to see which component to allocate resources to.

Problem Description and Pipeline

Suppose we are to design a ML system which learn how to recognize words in a photo. We might want to ask how to break down the problem into a series of modules ?

We might define our process to first detect regions of the photo where text are found , segment the character and then perform a character classification.

The idea of pipeline is to define a series of smaller components which we can solve first before solving the overall problem . Also we take the results of each module and pass that data to the next component. Each component in the pipeline might be another machine learning problem. Example, the 3 components in the Photo OCR pipeline (text detection, character segmentation, character recognition) are decent separate machine learning problems.

Getting Lots of Data and Artificial Data

One method in which we can utilize to generate data is artificial data synthesis. With this method , we sort of take example and magnify it . In the picture below we divide the character “A” into 16 boxes and applied different distortions to it to come out with 16 new examples.

Other examples were introduced in the videos such as adding different background noise (such as busy road, bad connection etc ) to original audio track to “magnify” the example.

However it usually does not help if we add random/meaningless noise to our data like as stated in the picture below.

Ceiling Analysis: What Part of the Pipeline to Work on Next

Suppose we want to improve the performance of our machine learning system , as usual it would be nice to have a single metric value evaluation (F1 score , precision etc). However since we have different components in the pipeline, how do we know which one should we work ? The idea is to manually provide a 100% accurately data to the next component and see how well it does overall.

In our Photo OCR problem , we would use accuracy (1-100%) as our evaluation metric. The “accuracy” stated in the picture below denote the overall accuracy.

Suppose we manually mapped out the regions which contain text in the “text detection” component and provide these 100% accurate data to “character segmentation” , then we see what is the overall accuracy we get if we had a “perfect text detection” machine learning system.

We follow the same logic for the next component “character segmentation” and manually segment/label each example into individual characters and give this “perfect” data set to character recognition and see what is the overall accuracy.

From the picture , we see that suppose we had a “perfect text detection” machine learning component, we achieved a 17% accuracy improvement, so this tells us allocating resources on improving “text detection” in the pipeline is a worthwhile investment. On the other hand even if we have a “perfect character segmentation” component, it only improve the overall accuracy by 1% so allocat

Please note the logic is in sequence as if we are doing ceiling analysis on the “character segmentation” component, it means (“text detection” == “perfect” && “character segmentation” == “perfect”). This kind of analysis provide a upper bound on the system on what is the maximum accuracy we could get.