And at this point, things become a little tricky.
Can crowdsourcing markets scale? MTurk can tag a thousand images within a few hours. But what will happen if we place one million images in the market? Will there be enough labor to handle all of the posted tasks? How long will the task take? And what will be the cost?
Scaling by combining machine learning with crowdsourcing
Unless you can come up with ingenious ideas, the acquisition of data comes at a cost. To reduce cost, we need to reduce the need for humans to label data. To reduce the need for humans, we need to automate the process. To automate the process, we need to build machine learning models. To build machine learning models, we need humans to label data.... Infinite loop? Yes and no.
The basic idea is to use crowdsourcing in conjunction with machine learning. In particular, we leverage ideas from active learning: The idea behind active learning is to use humans only for the uncertain cases, and not for everything. Machine learning can take care of the simple cases, and ask humans to help for the most important and ambiguous cases.
We also need to have one extra thing in mind: Crowdsourcing generates noisy training data, as opposed to the perfect data that most active learning algorithms expect from humans. So, we need to perform active learning not only towards identifying the cases that are ambiguous for the model, but also figure out which human labels are more likely to be noisy, and fix them. And we also need to be proactive in identifying the quality of the workers.
In any case, after addressing the quality complications, and once we have enough data, we can use the acquired data to build basic machine learning models. The basic machine learning models can then take care of the simple cases, and free humans to handle the more ambiguous and difficult cases. Then, once we collect enough training data for the more difficult cases, we can then build an even better machine learning model. The new model will then automate an even bigger fraction of the process, leaving humans to deal with only the harder cases. And we repeat the process.
This idea was at the core of our KDD 2008 paper, and since then we have significantly expanded these techniques to work with a wider variety of cases (see our current working paper: Repeated Labeling using Multiple Noisy Labelers.)
Example: AdSafe Media.
Here is an example application, deployed in practice through AdSafe Media: Say that we want to build a classifier that recognizes porn pages. Here is an overview of the process, which follows the process of our KDD paper:
- We get a few web pages labeled as porn or not.
- We get multiple workers to label each page, to ensure quality.
- We compute the quality of each labeler, fix biases, and get better labels for the pages.
- We train a classifier that classifies pages as porn or not.
- For incoming pages, we classify them using the automatic classifier.
- If the classifier is confident, we use the outcome of the classifier
- If the classifier is not confident, the page is directed to humans for labeling (the more ambiguous the page, the more humans we need)
- Once we get enough new training data, we move to Step 4 again.
Other example: SpeakerText.
According to the press, SpeakerText is using (?) this idea: they use an automatic transcription package to generate a first rough transcript, and then use humans to improve the transcription. The high quality transcriptions can be then used to train a better model for automatic speech recognition. And the cycle continues.
Another example: Google Books.
The ReCAPTCHA technique is used as as the crowdsourcing component for digitizing books for the Google Books project. As you may have imagined, Google is actively using optical character recognition (OCR) to digitize the scanned books and make them searchable. However, even the best OCR software will not be able to recognize some words from the scanned books.
ReCAPTCHA uses the millions of users on the Internet (most notably, the 500 million Facebook users) as transcribers that fix whatever OCR cannot capture. I guess that Google reuses the fixed words in order to improve their internal OCR system, so that they can reach their goal of digitizing 129,864,880 books a little bit faster.
The limit?
I guess the Google Books and ReCAPTCHA project are really testing the scalability limits of this approach. The improvements in the accuracy of machine learning systems start being marginal once we have enough training data, and we need orders of magnitude more training data to see noticeable improvements.
Of course, with 100 million books to digitize, even an "unnoticeable" improvement of 0.01% in accuracy corresponds to 1 billion more words being classified correctly (assuming 100K words per book), and results in 1 billion less ReCAPTCHA's needed. But I am not sure how many ReCAPTCHA's are needed in order to achieve this hypothetical 0.01% improvement. Luis, if you are reading, give us the numbers :-)
But in any case, I think that 99.99% of the readers of this blog would be happy to hit this limit.