Mechanical Turk: The Foundations

Over the last year or so, I have been using Mechanical Turk as a very useful tool for my research. While in practice it is a very useful tool, there is high uncertainty about the quality of the answers that someone can get back from such a system. Some of the Turkers will be lazy and submit random answers, some will have good intentions but still submit an incorrect answer, and some others will do a good job. However, since we do not know beforehand the actual answers for the questions, we need methods for extracting the signal from the noise, evaluate the quality of the individual Turkers, and to decide how much effort to spend annotating our data.

So, here are a few questions, and pointers to related research:

Question 1

We have a set of labelers and we ask them to label a set of examples, using a predefined set of labels. We do not know the correct labels of the examples. Can we identify the correct labels of the examples and estimate the error rates for each of the labelers?

A seminal paper that answers this question is the paper "Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm" by Dawid and Skene, published in 1979. The authors show how to use an Expectation Maximization algorithm for estimating the true labels of the examples, and show how to estimate the "confusion matrices" for each labeler. Smyth et al., and Wiebe built on the work of Dawid and Skene to infer ground truth in such noisy environments.

Question 2

When we get labels that are noisy, there are two approaches for dealing with the noise. The "frequentist" approach, in which we use the "majority" of the votes (potentially weighted according to the noise of the labeler), or a "Bayesian" approach in which we consider the labels to be inherently uncertain. In the first approach, we essentially treat the data as if they are noise-free and let existing learning algorithms to work without any modification. If we take a more uncertainty-based approach, then we need to modify the learning algorithms to deal with the label uncertainty.

Silverman, B. W. Some asymptotic properties of the probabilistic teacher. IEEE Transactions on Information Theory 26, 2 (Mar. 1980)
Lugosi, G. Learning with an unreliable teacher. Pattern Recognition 25, 1 (Jan. 1992)
Smyth, P. Learning with probabilistic supervision.
Jin, R., and Ghahramani, Z. Learning with multiple labels. In Advances in Neural Information Processing Systems 15 (NIPS 2002) (2002)

Question 3

We know that the examples that we get back are not perfect. We can compensate for the noise in the data by asking the annotators to be more careful (potentially by paying them more for each example), or we can simply get more training data hoping that the signal will emerge from the noise, or we can multilabel the existing labeled data with multiple labels hoping to improve the quality of the labels.

Hal Daume discussed the tradeoff of quality vs. quantity a while back, assuming that we can “tune” the individual annotators to be slower and of higher quality or faster but generate output of lower quality.

In Mechanical Turk, we can decide whether it makes sense to allocate the budget by annotating multiple times the same example, or whether it is preferable to label more examples. If we decide to label more examples, should we do this uniformly and ask for N labels per example, or does it make sense to select carefully the examples for which we want to acquire more labels? This was a problem that we examined with Victor Sheng and Foster Provost, and we have now produced a first draft of our paper ~~"Noisy Multi-labeling for Data Mining"~~ "Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers", which outlines our approach. (Update: The paper was accepted to KDD 2008, and the reviewers thought that using the term "multilabeling" will only confuse people, since the term is also used to describe cases where an example has multiple, true labels, as opposed to multiple noisy ones with a single underlying true label.)

In short, for those of you who are bored of reading the paper: If there is noise in the data, it is often preferable to multilabel existing examples compared to acquiring new examples. Furthermore, it pays off to selectively multilabel examples that have high degree of uncertainty. How can you measure the uncertainty? There are multiple approaches:

Entropy-based: The most intuitive way to measure uncertainty is to measure the entropy of the set of assigned labels. Intuitively, the label set {+,+,+} has lower entropy than {+,+,-,+,+,-,+,+,-} and makes sense to label the latter. Appealing approach but it does not work. Why? Because in a noisy environment the entropy of the label set does not decrease. The latter set, in an environment of high noise is almost certain to be a “+” example, while the former is likely to be “+” but with comparatively lower confidence.
Label uncertainty: Based on the example above, we can now try to estimate the confidence that we have the “true” label. Using a Bayesian estimation approach, we can get quite good results (see the paper for details).
Model uncertainty: Instead of using the label uncertainty, we can use active learning approaches and multilabel the “hard” and “ambiguous” examples as judged by the machine learning model that uses the generated training data. Intuitively, this technique works well, as it identifies quickly the examples for which the classifier cannot make confident decisions and which, in turn, are probably labeled incorrectly.
Label and Model uncertainty: The best approach. Model uncertainty directs to examples that are likely to be incorrectly labeled, and label uncertainty picks the ones for which we are uncertain about their true labels. We have observed that this approach works best and consistently outperforms simpler baselines.

Future Directions

I am very excited about this line of work. Of course, there are quite a few other questions and interesting directions for future research. Here are a few:

If we identify the quality of each labeler (e.g., using the framework of Dawid and Skene), can we improve the quality of the labeling by assigning the difficult examples to the most capable annotators?
What are the tradeoffs of quality vs. payment? Does it make sense to pay higher prices to increase quality? What if the annotators become strategic knowing that lower quality work will result in resubmissions of the same work for higher payoffs?
What if feature acquisition is also noisy? Can we build unifying frameworks that unify noisy feature acquisition, noisy label acquisition, and take cost and expected utility of the generated examples into account?

Applications for Conference and Journal Reviewing

I am also thinking of this line of research has applications to conference and journal reviewing. Again there we have a finite set of noise annotators, each one with its own noise level, and we are trying to estimate the “true label” of a paper. How should we optimally manage the resources? The current SIGMOD mode of operation (2 reviewers initially, extra reviewers for potential acceptances) seems to come close to the “label and model uncertainty” that we discussed above. I guess we can mathematically model this operation further and see how we can optimize it. Of course, the (imho, classic, for its own reasons) "Reviewing the Reviewers" by Ken Church, combined with the references above, are good starting points.

A Computer Scientist in a Business School

Wednesday, March 5, 2008

Mechanical Turk: The Foundations