The two most common approaches for dealing with the problem are:
- Use "gold" data: Give to the workers questions for which we already know the answers and see how the workers are performing. (Crowdflower is using this approach.)
- Use multiple workers: Give the same answer to multiple workers and then use a latent class model ala Dawid & Skene to estimate the quality of the workers.
The nice thing about the two approaches is that they can be seamlessly combined into a unified algorithm.
What was not clear to me, though, was the relative importance of the two. How much better and faster can we estimate the quality of the workers if we use gold data? What is the rate of improving the quality estimation as we add more gold data? Generating high-quality gold data is an expensive process, so having good answers to these questions is important.
Being a professor, the moment these questions came up, I knew what I had to do: Ask a PhD student to give me the answers! (Thanks Jing!) So, I am just the messenger here, Jing did all he work and the analysis. You know where to send your thanks.
The assumptions: We have examples belonging to 2 categories. The examples are equally distributed in the two categories (i.e., 50% in each). We created a set of workers with their quality picked randomly and uniformly from the range (55% correct) to (100% correct), for an average quality of ~77%. The workers assigned (noisy) labels to the examples, with an accuracy directly proportional to their quality.
We examined the performance of the Dawid & Skene algorithm, which we modified to take into consideration the existence of gold data. We measured two things:
- Classification error: How well the algorithm estimates the correct class of the examples
- Quality estimation error: How well the algorithm estimates the quality of the workers
We experimented with having 0% gold examples in the data, 25% gold, 50% gold, and 75% gold.
Results on classification error
So, we measured the accuracy of estimating the correct class of each example. The results are listed below:
3 workers per example |
5 workers per example |
10 workers per example |
One immediate observation is that the value of having gold data is limited when we have significant number of workers per label. With 10 workers per example, no matter if we have gold data, the difference is minimal. Even with just 5 workers, the additional value of gold data is small.
The cases where it makes sense to use gold data is when we have only small number of workers per label. (Not an uncommon case!)
An interesting observation, though, is that we can achieve the same effect by simply forcing workers to work on more examples. Once a worker has given us 30 answers, the completely unsupervised algorithm can work almost as well as the algorithm that uses 75% gold data. This holds even when we have just 3 workers per example.
Of course, on an environment like Mechanical Turk, forcing workers to work on a large number of HITs may not be feasible. But we can always bundle multiple questions in a single HIT, achieving the same result.
OK, so gold data do not seem to be very useful in for getting better accuracy in class estimation.
But, it should help in estimating the quality of workers, right?
Results on worker quality estimation
For the quality estimation, we also calculated the error when having 100% gold examples. (This is the lower bound for the estimation error, of course.)
3 workers per example |
5 workers per example |
10 workers per example |
As expected, with 10 workers per example, we gain almost nothing in terms of quality estimation when we use gold data.
With 3-5 workers per example, having gold data improves the quality estimation in a rather consistent manner. Again, though, we observe that if we have each worker completing a large enough number of assignments, we can get most of the benefits of having gold data, without actually having gold data.
Why gold data then?
So, given the results above, someone would ask: Why do we even need gold data? The unsupervised approach seems to work very well!
In reality there are a few reasons for which we may still need to have gold data:
- Imbalanced datasets: When we have very imbalanced data sets, the estimation becomes more challenging. For imbalanced data, we need to quickly and preemptively test workers using data from all categories, rather than waiting for the occasional object from the minority category to appear. To give an example: if we monitor a security camera trying to detect the presence of people in prohibited areas, we want to ensure that the workers will be tested early on with images that have people in them. Otherwise it may take a long time to get a reliable estimate of their ability to classify correctly examples from the minority class.
- Very low quality of workers: When workers have very low quality, we need more workers per example and more labels per worker to replicate the results above. In this case, having gold data allows us quickly to get rid of the workers that do not meet quality standards. This is very useful on high-noise environments like the "unprotected" Amazon Mechanical Turk marketplace (by unprotected I mean without using qualification tests or other quality assurance mechanisms).
- Giving confidence to non-technical people: If you say that you test the workers with known examples, everyone understands the process. If you say that you rely on agreement between workers, or on latent class models ("what?") and on expectation maximization or Bayesian estimation ("come again?") , most people will start feeling uncomfortable. Everyone understands random tests, not everyone is willing to let unsupervised methods to direct the quality assurance process. So, even if the gold examples do not help much, it is a very reassuring factor for people that just want to know that there is a familiar, understandable, and easy-to-explain quality control mechanism in place.
- Calibrating results and giving feedback: One of the final reasons for having gold data is to be able to calibrate workers and give them feedback about the expected coding standards. For example, when rating pages as porn or not and into degrees of severity, different people have different levels of sensitivity. If we have enough "sensitive" workers in the workforce, we may end up with results that are consistent but shifted upwards in terms of severity. (Or vice versa if the coders are more tolerant, the results may be shifted downwards.) This can, incorrectly, give the impression that all data collected through crowdsourcing are wrong. However, if the final user of the data provides a few gold data as anchor points, the Dawid & Skene code gives back results that are more in line with expectations. At the same time, these gold data points can be used to give immediate feedback to the workers about their errors and implicitly direct them to use the expected rating guidelines and self-calibrate.
In practice, the last two reasons are often more important than the technical aspects of estimation. So, before starting any big crowdsourcing annotation project, spend some time and create some gold data. Or, alternatively, take a small dataset and label it using a large number of workers per example. Then verify the outcome, correct and calibrate some of the unexpected results, and run Dawid & Skene again. The generated data will be close enough to gold. Having such gold data will pay back the effort and cost multiple times during the overall process.