Sunday, July 4, 2010

Detecting Spammers on Mechanical Turk, Part II

In my previous post, I gave a brief overview of different techniques used to improve the quality of the results on Amazon Mechanical Turk. The main outcome of these techniques is a matrix that describes the error rate of each worker.

For example, consider the task of categorizing webpages as porn or not. We have three target categories:
  • G-rated: Pages appropriate for a general audience, children included.
  • PG-rated: Pages with adult themes but without any explicit sexual scenes, appropriate only for children above 13
  • R-rated: Pages with content appropriate for adults only.
In this case, the confusion matrix of a worker, inferred using the techniques described in my previous post, would look like:

$ \left( \begin{array}{ccc} Pr[G \rightarrow G] & Pr[G \rightarrow PG] & Pr[G \rightarrow R] \\ Pr[PG \rightarrow G] & Pr[PG \rightarrow PG] & Pr[PG \rightarrow R] \\ Pr[R \rightarrow G] & Pr[R \rightarrow PG] & Pr[R \rightarrow R] \end{array} \right) $

where $Pr[X \rightarrow Y]$ is the probability that the worker will give the answer $Y$ when given a question where the correct answer is $X$. (The sum of the elements in each line should sum up to 1.)

And here is the question that seems trivially easy: Given the confusion matrix, how can we detect the spammers?

Computing the Error Rate

The simple answer is: Just sum up the elements out of the diagonal! Since every non-diagonal element corresponds to an error, if the sum is high, the worker is a spammer. Of course, this ignores the fact that class priors will often differ. So, instead of giving equal weights to each category, we weight the errors according to the class priors (i.e., how often we expect to see each correct answer).

$\\ Pr[G] (Pr[G \rightarrow PG] + Pr[G \rightarrow R]) + \\ Pr[PG] (Pr[PG \rightarrow G] + Pr[PG \rightarrow R]) + \\ Pr[R] (Pr[R \rightarrow G] + Pr[R \rightarrow PG]) $

For example, if in confusion matrix is

$\left( \begin{array}{ccc} 0.5 & 0.3 & 0.2 \\ 0.2 & 0.6 & 0.2 \\ 0.1 & 0.1 & 0.8 \end{array} \right) $

and the class priors are 80% $G$, 15% $PG$, and 5% $R$, then the weighted error rate is

$80\% \cdot (0.3 + 0.2) + 15\% \cdot (0.2 + 0.2) + 5\% \cdot (0.1 + 0.1) = 0.47$

Notice that the error rates for the first line, which correspond to category $G$, got weighted more heavily.

Unfortunately, this method does not work very well.

When we started using this technique, we ended up marking legitimate workers as spammers (false positives), and classifying spammers as legitimate workers (false negatives). Needless to say, both mistakes were hurting us. Legitimate workers were complaining and (understandably) badmouthing us, and spammers kept polluting the results.

Let me give some more details on how such errors appear.

False Negatives: Strategic Spammers and Uneven Class Priors

Spammers on Mechanical Turk are often smart and lazy. They will try to submit answers that seem legitimate but without spending too much time. (Otherwise, they may as well do the work :-)

In our case, we were categorizing sites as porn or not. Most of the time the sites were not porn, and only 10%-20% of the time we had sites that were falling into one of the porn categories. Some workers noticed this fact, and realized that they could keep their error rate low by simply classifying everything as not-porn.

Following the standard way of computing an error rate, these workers were faring much better than legitimate workers that were misclassifying some of the not-porn sites.

Here is an illustration. With three categories (G-, PG13-, and R-rated), the confusion matrix for a spammer looks like this:

$\left( \begin{array}{ccc} 1.0 & 0.0 & 0.0 \\ 1.0 & 0.0 & 0.0 \\ 1.0 & 0.0 & 0.0 \end{array} \right) $

With class priors are 80% in G, 15% in PG13, and 5% in R, the weighted error rate of the spammer is:

$80\% \cdot (0.0 + 0.0) + 15\% \cdot (1.0 + 0.0) + 5\% \cdot (1.0 + 0.0) = 0.2$

Compare this overall error rate with the one of a legitimate worker, with rather modest error rates:

$\left( \begin{array}{ccc} 0.8 & 0.2 & 0.0 \\ 0.1 & 0.8 & 0.1 \\ 0.0 & 0.25 & 0.75 \end{array} \right) $

The error rate for this legitimate worker is:

$80\% \cdot (0.2 + 0.0) + 15\% \cdot (0.1 + 0.1) + 5\% \cdot (0.25 + 0.0) = 0.2025$

Yes, the legitimate worker appears to be worse than the spammer!

False Positives: Biased Workers

The second type of error is when we classify honest workers as spammers. Interestingly enough, when we started evaluating workers, the top "spammers" ended up being members of the internal team. Take a look at the error rate of this worker:

$\left( \begin{array}{ccc} 0.35 & 0.65 & 0.0 \\ 0.0 & 0.0 & 1.0 \\ 0.0 & 0.0 & 1.0 \end{array} \right) $

The error rate is:

$80\% \cdot (0.6 + 0.0) + 15\% \cdot (0.0 + 1.0) + 5\% \cdot (0.0 + 0.0) = 0.67$

The error rate would imply that this worker is essentially random. A clear case of a worker that should be banned.

After a careful inspection though, you can see that this is not the case. This is the confusion matrix of a worker that tends to be much more conservative than others and classifies 65% of the "G" pages as "PG13". Similarly, all the pages tat are in reality "PG13" are classified as "R". (This worker was a parent with young children and was much more strict on what content would pass as "G" vs "PG13.)

In a sense, this is a a pretty careful worker! Even though this worker does mix up R and PG13 pages, there is a very clear separation between G and PG13/R pages. Still the error rate alone would put this worker very clearly in the spam category.

Solution: Examine Ambiguity not Errors

You will notice that one thing that separates spammers from legitimate workers is the information provided by their answers. A spammer that gives the same reply all the time does not give us any information. In contrast, when the biased worker gives the answer "PG13", we know that this corresponds to a page that in reality belongs to the "G" class. Even if the answer is wrong, we can always guess the correct answer!

So, by "reversing" the errors, we can see how ambiguous are the answers of a worker, and use this information to decide whether to reject a worker or not.

You can find more details about the process in our HCOMP 2010 paper "Quality Management on Amazon Mechanical Turk".

You can also find a demo of the algorithm at http://qmturk.appspot.com/ and you can plug your own data to see how it works. The code will take as input the responses of the workers, the misclassification costs, and the "gold" data points, if you have any. The demo returns the confusion matrix for each worker, and the estimated "cost" of each worker. The output is plain text and kind of ugly but you can find what you need.

The code is also open source and available at Google Code.

If you have any questions, just drop me an email!