Wednesday, June 2, 2010

Detecting Spammers on Mechanical Turk, Part I

Anyone who has tried to use Mechanical Turk for any large scale task knows what is the biggest problem: Spammers!

Many people try initially to post small tasks on Turk and are amazed to get back high-quality results of high quality, very quickly, and with very small cost. Then, they post similar, but much larger, tasks and they realize that now the results are pretty much crap. Spammers start giving random answers, and since they spend no effort, they can easily be the most prolific workers! For example, in the Snow et al. 2008 paper, almost 40% of the submissions came from spam workers, according to Bob Carpenter's analysis.

Unfortunately, this problem is widespread, and tends to discourage new requesters. People just want to get things done on Mechanical Turk, they do not want to spend their time fighting spammers.

Spam Protection Measures: Qualification tests and "best of 3" policy

Qualification tests: Some people, to avoid spammers, use qualification tests. While some measures, such as qualification tests help, they are not a panacea. Determined spammers simply pass the qualification test, and proceed with spamming. Nevertheless, it is a good idea to have a qualification test, as it also tends to filter out workers that come and complete only a couple of HITs. The main disadvantage is that you need programming or use of the command-line tools to create and require a qualification test (i.e., the web interface is not enough).

Best of Three: Other requesters use a "best of 3" (or "2+1") policy: if two workers agree on their answer, the answer is accepted. This technique is relatively widespread across seasoned requesters (and drives some legitimate workers crazy).

Why do legitimate workers hate this verification mechanism? Because every time a worker does not agree with the other two, the submission gets rejected. This drops the acceptance ratio, an important reputation characteristics on Mechanical Turk.

Interestingly, the "best of 3" strategy led to an interesting counter-strategy from spammers: they come in pairs and vote concurrently at the same task, giving the same votes. This defeats the simple "best of 3" approach. (Update: Josh in the comments suggests a good strategy to avoid this: Randomize the HITs by picking examples from a pool, instead of having "static" HITs. This is a generally good suggestion, even if someone uses the more sophisticated tools described below.)

While the "best of 3" approach ensures some basic quality of the data, it does not work well in the long run. Legitimate workers tend to avoid such HITs, as their approval rating can plummet if there are a couple of spammers lurking around. Also, legitimate workers get penalized too severely for honest mistakes.

Spam Protection Measures: Using "gold standard" data and/or unsupervised EM algorithm

A better approach for detecting the quality of the workers is to perform the analysis over a larger set of HITs, rather than operating on single HITs, as "best of 3" does.

Using gold standard: One approach, followed often, is to slip into the stream of HITs some tasks with known correct answers. By comparing the answer given by the worker with the correct answer, we can compute the "confusion matrix" for each worker. In other words, we can see how often a worker gives answer B when the correct answer is A, how often workers says B when correct answer is B, and so on.

Using unsupervised, expectation maximization algorithm: Using gold standard works well when we have in our disposal plenty of such "gold" cases. Unfortunately, this is not always possible, especially when we want to post new tasks frequently. In this case, it makes sense to use the iterative expectation-maximization algorithm proposed by Dawid and Skene:

  1. Initialize the error-rates of the workers (e.g., assume all workers are perfect)
  2. Initialize the most likely class for each example (e.g., using majority vote)
  3. Compute the confusion matrix (aka error rates) for each worker: Compare the answers given by the worker with the estimated correct answers.
  4. Compute the most likely class for each example: Use the answers given by the workers, and weight the answers using the confusion matrix for each worker.
  5. Go to step 3, until convergence
Interestingly enough, the supervised and unsupervised techniques can be naturally combined: In step 4, for all the examples for which we know the correct answer, we simply do not update the class label. Trivially easy. (In a dual fashion, we can also incorporate trusted workers by not updating the error rates in step 3.) By providing some known data, the algorithm can actually estimate with less data the quality of the workers and the correct labels for the unknown examples.

The outcome of these algorithms is:
  • The correct label for each example
  • The confusion matrix for each worker
If you want to try this algorithm, go to our demo at http://qmturk.appspot.com/

As the output of this process, we know for each worker its corresponding confusion matrix. So, for a task with 3 possible answers, A, B, C, the confusion matrix would look like:

$ \left( \begin{array}{ccc} Pr[A \rightarrow A] & Pr[A \rightarrow B] & Pr[A \rightarrow C] \\ Pr[B \rightarrow A] & Pr[B \rightarrow B] & Pr[B \rightarrow C] \\ Pr[C \rightarrow A] & Pr[B \rightarrow C] & Pr[C \rightarrow C] \end{array} \right) $

where $Pr[A \rightarrow B]$ is the probability that the worker will classify an object that belongs to class A into class B. Of course, a perfect worker will have a confusion matrix that is perfectly diagonal

$ \left( \begin{array}{ccc} Pr[A \rightarrow A] = 1 & Pr[A \rightarrow B] = 0 & Pr[A \rightarrow C] = 0 \\ Pr[B \rightarrow A] = 0 & Pr[B \rightarrow B] = 1 & Pr[B \rightarrow C] = 0 \\ Pr[C \rightarrow A] =0 & Pr[B \rightarrow C] = 0 & Pr[C \rightarrow C] = 1 \end{array} \right) $

Interestingly enough, the opposite question is not trivial to answer: How can we identify spammers, by examining their confusion matrices? The answer in the next blog post.

4 comments:

  1. I grabbed the paper when I saw your workshop announcement and really like the active learning application. Being able to do un-, semi- or fully-supervised training is just magical in these kinds of latent random effects models.

    Massimo Poesio and I just gave a tutorial at LREC on evaluating inter-annotator agreement and inferring accuracies and the gold standard. Here a page with links to our slides:

    LREC 2010 Tutorial: Modeling Data Annotation

    Basically, the simple model's a hierarchical Bayesian version of Dawid and Skene that also estimates the priors. The more complicated model does the same thing on a logistic scale in order to estimate an item difficulty effect as well as an annotator bias and accuracy effect.

    The tutorial links to the implementations in BUGS for full Gibbs sampling or Java for collapsed Gibbs sampling (fixed priors, integrating out multinomials).

    I'd love to hear about anyone who's extending these kinds of models to add multilevel components for annotator or item attributes or for replications at the task level.

    ReplyDelete
  2. I was just about to mention Bob Carpenter's work, but I see he beat me to it. I have to admit that we haven't actually tried out his system yet (we're planning on incorporating it in our next annotation project), but it looks very useful.

    For simple experiments, we typically use what you call the "gold standard," which is fairly standard practice in psychology experiments (on the Web or in the lab), since it's always good to prove that your participants are paying attention and understand the task (not always true even in the best of circumstances).

    As you say, for more complicated projects (like the large annotation projects we've been running lately on Turk), that doesn't work, particularly if you want the same worker to be able to do an unrestricted number of HITs (you'd need a lot of filler questions in order to avoid repetition -- and if you repeat, you're helping out the spammers).

    I think that there is a slightly more sophisticated version of the "best of 3" method than the one you've described. We feed every worker a randomly sampled set of utterances to tag. So no two workers see the same set. This prevents spammers from working in 2s, and it also spreads out the risk that a legitimate worker will have low agreement just by being paired with a bunch of spammers (as long as the proportion of spammers is low).

    ReplyDelete
  3. Heh, am I the only one amused by the irony of seeing a blog spam comment (above, "You could be making $2,000 to $5,000 of extra income every month!") on this post?

    ReplyDelete
  4. Josh, your suggestion to randomize the HITs is good and should be used in general, not only in "best of 3". For example, the EM algorithm assumes independence of the annotations, so having a few colluding spammers can screw things up if we keep redundancy low. Randomizing indeed helps.

    Greg, the spam is now deleted. They just managed to post the spam comments while I was on vacation :-)

    ReplyDelete