Thursday, December 16, 2010

Mechanical Turk: Now with 40.92% spam.

At this point, Amazon Mechanical Turk has reached the mainstream. Pretty much everyone knows about the concept. Post small tasks online, pay people cents, and get thousands of micro-tasks completed.

Unfortunately, this resulted in some unfortunate trends. Anyone who frequents just a little bit the market will notice the tremendous number of spammy HITs. (HIT = a task posted for completion in the market; stands for Human Intelligence Task). "Test if the ads in my website work". "Create a Twitter account and follow me". "Like my YouTube video". "Download this app". "Write a positive review on Yelp". A seemingly endless amount of spam HITs come to the market, mainly with the purpose of spamming "social media" metrics.

So, with Dahn Tamir and Priya Kanth (MS student at NYU), we decided to examine how big is the problem. How many spammers join the market? How many spam HITs are there?

Using the data from Mechanical Turk Tracker, we picked all the requesters that first joined the market in September 2010 and October 2010. Why new ones? Because we assumed that long term requesters are not spammers. (But this remains to be verified.)

This process resulted in 1733 new requesters that first appeared in the marketplace in September and October 2010. We then took all the HITs that these requesters posted in the market. This was a total of 5842 HIT groups. The activity patterns of the new requesters were similar to those of the general requester population.

The next step was to post these HITs on Mechanical Turk, and asked workers to classify them as spam or not, using the following guidelines:

Use the following guidelines to classify the HIT as SPAM:
  • SEO: Asks me to give a fake rating, vote, review, comment, or "like" on Facebook, YouTube, DIGG, etc., or to create fake mail or website accounts.
  • Fake accounts: Asks me to create an account on Twitter, Facebook, and then perform a likely spam action. 
  • Lead Gen: Asks me to go to a website and sign up for a trial, complete an offer, fill out a form requesting information, "test" a data-entry form, etc.
  • Fake clicks: Asks me to go to a website and click on ads.
  • Fake ads: Asks me to post an ad to Craigslist or other marketplace.
  • Personal Info: Asks me for my real name, phone number, full mailing address or email.
  • You can also use your intuition to classify the HIT
Please DO NOT classify as spam, HITs that are legitimate in nature but priced offensively low.

Interestingly enough, we got a ridiculous amount of spam from the worker side. Even with 99% approval rate and 1000 HITs as qualification, we got plenty of spammers giving us random data.

Since spam was a big problem, we posted the HIT using CrowdFlower and we used a set of 100 manually classified HITs as gold. (Without Crowdflower, we had to manually kick out the spammers and repost the HITs. So, Crowdflower saved the day.)

We asked 11 workers to classify each HIT, and we ignored votes from the untrusted workers (that failed to answer correctly at least 75% of the gold tests). So, with 11 trusted workers working on each HIT, we were reasonably sure that the majority vote across these 11 votes resulted in an accurate HIT classification.

I also ran the "get another label" code and I noticed that all the workers were of reasonable quality. Since the results were similar to those of the majority vote, I decided to keep things simple and go with the majority vote as the correct answer.

The results

The results were disturbing. Out of the total of 5841 HITs, a total of 2390 HITs, or 40.92% were marked as spam HITs.

This is not good! 40% of the HITs from new requesters are spam!

Our next test was to examine whether there are accounts that post a mix of spam and not spam HITs. The analysis indicated that this is not the case. Very few accounts post both spam HITs and legitimate HITs:

The plot illustrates that 31.83% of the new requesters post only spam HITs.

In total, 757 out of the 1733 new requesters posted at least a one spam HIT, and 552 accounts were posting only spam HITs. 56.46% of the new requesters post no spam HITs. This nice separation indicates that it is easy to separate spam requesters from legitimate ones. There are not that many requesters that post both spam HITs and legitimate ones.

So, 31.8% of the new requesters are clear spammers, and 40.92% of the new HITs are spam-related! This is clearly a problem.

Spam HITs and pricing

So, what are the quantitative characteristics of the spam HITs?

First of all, they tend to contain much fewer "HITs available" compared to the legitimate HITs. 95% of the spam HITs contain just a single HIT, while only 75% of the legitimate HITs have one HIT available.

On the other hand, spammers tend to post HITs with higher rewards (perhaps because they do not pay?). Approximately 80% of the legitimate HITs are priced below one dollar, while only 60% of the spam HITs are priced below this threshold. Actually, many of the best paying HITs tend to be spam-related ones.

By combining the two charts above, we can plot the total value of the spam vs not spam HITs. 

Overall, the findings are not really surprising: Most of the spam HITs require large number of workers to complete a task. They want 1000 users to click an ad, not a single user to click a thousand times at a single ad. Therefore, I suspect that most of these spam HITs have a very significant amount of redundancy, (which unfortunately we cannot observe). This means that the total value of the posted spam HITs is most probably much higher than the total value of the legitimate HITs.

What to do?

These trends are very worrisome:
  • 40% of the HITs from new requesters are spam. 
  • 30% of the new requesters are clear spammers.
  • The spam HITs have bigger value than the legitimate ones. 
It is very clear that active action should be taken against spam requesters. 

According to our measurements, we see approximately 1500 new HITs arriving in the market every day (from all requesters), and approximately 30 new requester accounts join the market every day. It should be trivial to review all the HITs manually by posting them to MTurk for review. 

But even if this manual inspection is expensive, this is a task that can be very easily automated. In our current work, we realized that it is very easy to accurately classify HITs as spam or not. A simple SVM linear classifier that uses bag of words as features can achieve a 95% true positive and 95% true negative rate. With a moderately advanced scheme, it should be possible to have a strong system in place pretty quickly.

For whomever is interested, the data is available here

The disheartening part was the response of Amazon when we informed them about the issue. They pretty much assured us that everything is fine, and they believed there is no problem! For me, this was more problematic than the existence of spam.

Why Amazon ignores spam?

To answer this question, I have asked Amazon for access to the data to investigate further. Unfortunately, I was denied access. (It does not pay to criticize Amazon.) Interestingly enough, the MTurk guys share data with other academics.

The key piece for answering this question, which I cannot get from my data: Do spammers pay the workers? 

If the spam requesters do not pay the workers, then Amazon should be more proactive in battling spammer requesters. Workers need to be protected! It is easy to see that it is a death spiral otherwise. The more spammers can get away with getting work done and not paying, the less the workers will trust new requesters. Legitimate new requesters will face a significant uphill battle to convince the workers about their intentions, they will abandon their plans, and let the spammers prevail. We have a market for lemons on the inverse.

If the spam requesters pay the workers, then there is a cynical explanation: Amazon does not take an active role in cleaning the market because they simply profit from the spam. And it is part of the growth. And nobody within the MTurk division would cut in half the growth rate at this point.

However, this would be an incredibly short-sighted approach. With the amount of spam in the worker side, and the amount of spam in the requester side, then Mechanical Turk would slowly turn into a market where spammers requester talk to spammer workers... Ah yes, and academics running experiments...