Saturday, September 3, 2011

Probabilities and MTurk Executives: A Troubled Story

On the Mechanical Turk blog, there is a blog post that describes the need to build custom qualifications for workers. (Update: the old link was removed after I posted this analysis, and the new version does not contain any of the problematic math analysis.) While the argument is correct, it is backed by some horrendous math analysis. For your viewing pleasure, here is the quote:

This difference in accuracy is magnified if you’re using plurality. Supposed you use plurality of 2 (asking 2 Workers the same question). With Masters, if 2 Workers with average accuracy (99%) agree on an answer there is a 98% probability that it’s the correct answer. With the broader group, if 2 Workers with average accuracy (90%) agree on an answer there is an 81% probability that it’s the correct answer. And if you happen to get 2 68% accurate Workers submitting assignments for the same HIT, the probability the answer is accurate is only 46%!

Dear Sharon:

We do appreciate your efforts on improving MTurk and on giving correct advice.

But this analysis that attempts to back a correct argument, is absolutely wrong. It is so wrong that it hurts. Just think at a very intuitive level: how is it possible to ask two workers of certain accuracy, see that they agree, and expect the accuracy of the corroborated answer to be lower? This is simply not possible!

Here is the correct analysis:



Supposed you use plurality of 2 (asking 2 Workers the same question). With Masters, if 2 Workers with average accuracy (99%) agree on an answer then the probability that this answer is incorrect is

$Pr(\mathit{incorrect}|\mathit{agreement}) = \frac{Pr(\mathit{worker1\ incorrect\ and\ worker2\ incorrect})}{Pr(\mathit{agreement})}$.

Assuming (conditional) independence of the workers:

$Pr(\mathit{incorrect}|\mathit{agreement}) = $

$\frac{Pr(\mathit{worker1\ incorrect}) \cdot Pr(\mathit{worker2\ incorrect})}{Pr(\mathit{agreement})}=\frac{(1-p)^2}{p^2+(1-p)^2}$.

where $p$ is the probability of a worker being correct.

With 99% accuracy, the probability of a worker being correct is $p=0.99$. So:

$Pr(\mathit{incorrect}|\mathit{agreement}) = \frac{0.01 \cdot 0.01}{0.01 \cdot 0.01 + 0.99 \cdot 0.99}$

$\Rightarrow Pr(\mathit{incorrect}|\mathit{agreement}) = 0.000101$.

Since $Pr(\mathit{correct}|\mathit{agreement}) = 1-Pr(\mathit{incorrect}|\mathit{agreement})$, therefore, with Masters, if 2 Workers with average accuracy (99%) agree on an answer, there is a $1-0.000101 \approx 99.99\%$ probability that it’s the correct answer.

With the broader group, if 2 Workers with average accuracy (90%) agree on an answer there is an $1-\frac{ 0.1 \cdot 0.1}{0.1 \cdot 0.1 + 0.9 \cdot 0.9} \approx  98.78\%$ probability that it’s the correct answer. And if you happen to get two 68%-accurate workers submitting assignments for the same HIT (and they both agree), the probability the answer is accurate is only $1-\frac{ 0.32 \cdot 0.32}{0.32 \cdot 0.32 + 0.68 \cdot 0.68} \approx  81.87\%$!



How Sharon got confused? The analysis that she presents calculates not the accuracy of the answer when the workers agree, but instead it calculates how often the two workers will agree and agree on the correct answer. Indeed, with workers that have 68% accuracy, we will observe agreement on the correct answer only 48% of the time. (And, 10% of the time, they will agree on the incorrect answer.) More importantly, though, 42% of the time, the two workers will disagree, and we will need to bring an extra worker, increasing the cost by 50%.

Why Sharon got confused? One explanation is that she is victim of the conjunction fallacy or that she does not understand conditional probabilities. However, I believe it is not that. I bet that she did not get puzzled by the results because the presented math confirmed another (correct) intuition that she had about the market: redundancy when relying on low-quality workers is not cost-effective.

Consider this: if you have 3 workers of 68% accuracy, the combination of the three (e.g., using majority vote) will result in an average accuracy of only 75%. In other words only 3 out of 4 times the majority will generate the correct answer. To reach 90% accuracy, we need 11 workers with 68% accuracy each. And to reach 99% accuracy, we need 39 workers of 68% accuracy! (I will present the math in a later blog post.)

Even using "moderately high quality" workers, simulating a worker that is 99% accurate tends to be an expensive proposition. We need five workers that are 90% accurate to get 99% accuracy.

So, yes, the high-quality Masters workers are worth their extra price. In fact, they are worth their weight in gold. Paying only 20% more to access a guaranteed pool of high-quality "Masters" workers is a great bargain, given the quality differences with the general worker pool.

Actually, if I were a 99% accurate worker I would feel offended that I do not get at least double or triple the running wage for the common workers. There is a great mispricing of the services provided by high-quality workers, and most requesters today exploit just this fact to keep the wages down, while still managing to get high quality results from the tested, reliable workers.