Sunday, October 21, 2012

New version of Get-Another-Label available

I am often asked what type of technique I use for evaluating the quality of the workers on Mechanical Turk (or on oDesk, or ...). Do I use gold tests? Do I use redundancy?

Well, the answer is that I use both. In fact, I use the code "Get-Another-Label" that I have developed together with my PhD students and a few other developers. The code is publicly available on Github.

We have updated the code recently, to add some useful functionality, such as the ability to pass (for evaluation purposes) the true answers for the different tasks, and get back answers about the quality of the estimates of the different algorithms. 

So, now, if you have a task where the answers are discrete (e.g., "is this comment spam or not?", or "how many people in the photo? (a) none, (b) 1-2, (c) 3-5, (d) more than 5", etc) then you can use the Get-Another-Label code, which supports the following:
  • Allows any number of discrete categories, not just binary
  • Allows the specification of arbitrary misclassification costs (e.g., "marking spam as legitimate has cost 1, marking legitimate content as spam has cost 5")
  • Allows for seamless mixing of gold labels and redundant labels for quality control
  • Estimates the quality of the workers that participate in your tasks. The metric is normalized to be between 0% for a worker that gives completely random labels, and 100% for a perfect worker.
  • Estimates the quality of the data that are returned back by the algorithm. The metric is normalized to be 0% for data that have the same quality as unlabeled data, and 100% for perfectly labeled data.
  • Allows the use of evaluation data, that are used to examine the accuracy of the quality control algorithms, both for the data and for the worker quality.
Currently, we support the vanilla majority voting, and the expectation-maximization algorithm to combine the labels assigned by the workers. We also support maximum likelihood, minimum cost, and "soft" classification schemes. In most cases, the expectation maximization together with the minimum cost classification approach tend to work best, but you can try it yourself.

An important side-effect of reporting the estimated quality of the data, is that you can then allocate further labeling resources in the data points that have the highest expected cost. Jing has done plenty of experiments and has concluded that, in the absence of any other information (e.g., who is the worker who will label the example), it is always best to focus the labeling efforts in the examples with the highest expected cost.

I expect this version of the code to be the last iteration of the GAL codebase. In our next step, we will transfer GAL into a web service environment, allowing for streaming, real-time estimation of worker and data quality, and also allowing for continuous labels, supporting quality-sensitive payment estimation, and many other tasks. Stay tuned: Project-Troia is just around the corner.