A Computer Scientist in a Business School

Friday, December 24, 2010

Amazon Reacts: Spammers Kicked Out of MTurk!

I got a notification in the comments of the blog post about spam on MTurk that Amazon seems to have taken seriously the spam problem reported in my previous blog post.

Broken Turk wrote:

Have you seen the site? It looks as if MTurk liked your research?

(Link to top paying HITs, no spam!)

(Turker Nation post)

Indeed, I checked the available HITs and all the spam HITs seem to have magically disappeared! It seems that all the negative publicity convinced the guys at MTurk that spam IS a problem.

I will consider that the goal of the blog post is now achieved. Amazon listened!

Good job Amazon!

Thursday, December 16, 2010

Mechanical Turk: Now with 40.92% spam.

At this point, Amazon Mechanical Turk has reached the mainstream. Pretty much everyone knows about the concept. Post small tasks online, pay people cents, and get thousands of micro-tasks completed.

Unfortunately, this resulted in some unfortunate trends. Anyone who frequents just a little bit the market will notice the tremendous number of spammy HITs. (HIT = a task posted for completion in the market; stands for Human Intelligence Task). "Test if the ads in my website work". "Create a Twitter account and follow me". "Like my YouTube video". "Download this app". "Write a positive review on Yelp". A seemingly endless amount of spam HITs come to the market, mainly with the purpose of spamming "social media" metrics.

So, with Dahn Tamir and Priya Kanth (MS student at NYU), we decided to examine how big is the problem. How many spammers join the market? How many spam HITs are there?

Using the data from Mechanical Turk Tracker, we picked all the requesters that first joined the market in September 2010 and October 2010. Why new ones? Because we assumed that long term requesters are not spammers. (But this remains to be verified.)

This process resulted in 1733 new requesters that first appeared in the marketplace in September and October 2010. We then took all the HITs that these requesters posted in the market. This was a total of 5842 HIT groups. The activity patterns of the new requesters were similar to those of the general requester population.

The next step was to post these HITs on Mechanical Turk, and asked workers to classify them as spam or not, using the following guidelines:

Use the following guidelines to classify the HIT as SPAM:

SEO: Asks me to give a fake rating, vote, review, comment, or "like" on Facebook, YouTube, DIGG, etc., or to create fake mail or website accounts.

Fake accounts: Asks me to create an account on Twitter, Facebook, and then perform a likely spam action.

Lead Gen: Asks me to go to a website and sign up for a trial, complete an offer, fill out a form requesting information, "test" a data-entry form, etc.

Fake clicks: Asks me to go to a website and click on ads.

Fake ads: Asks me to post an ad to Craigslist or other marketplace.

Personal Info: Asks me for my real name, phone number, full mailing address or email.

You can also use your intuition to classify the HIT

Please DO NOT classify as spam, HITs that are legitimate in nature but priced offensively low.

Interestingly enough, we got a ridiculous amount of spam from the worker side. Even with 99% approval rate and 1000 HITs as qualification, we got plenty of spammers giving us random data.

Since spam was a big problem, we posted the HIT using CrowdFlower and we used a set of 100 manually classified HITs as gold. (Without Crowdflower, we had to manually kick out the spammers and repost the HITs. So, Crowdflower saved the day.)

We asked 11 workers to classify each HIT, and we ignored votes from the untrusted workers (that failed to answer correctly at least 75% of the gold tests). So, with 11 trusted workers working on each HIT, we were reasonably sure that the majority vote across these 11 votes resulted in an accurate HIT classification.

I also ran the "get another label" code and I noticed that all the workers were of reasonable quality. Since the results were similar to those of the majority vote, I decided to keep things simple and go with the majority vote as the correct answer.

The results

The results were disturbing. Out of the total of 5841 HITs, a total of 2390 HITs, or 40.92% were marked as spam HITs.

This is not good! 40% of the HITs from new requesters are spam!

Our next test was to examine whether there are accounts that post a mix of spam and not spam HITs. The analysis indicated that this is not the case. Very few accounts post both spam HITs and legitimate HITs:

The plot illustrates that 31.83% of the new requesters post only spam HITs.

In total, 757 out of the 1733 new requesters posted at least a one spam HIT, and 552 accounts were posting only spam HITs. 56.46% of the new requesters post no spam HITs. This nice separation indicates that it is easy to separate spam requesters from legitimate ones. There are not that many requesters that post both spam HITs and legitimate ones.

So, 31.8% of the new requesters are clear spammers, and 40.92% of the new HITs are spam-related! This is clearly a problem.

Spam HITs and pricing

So, what are the quantitative characteristics of the spam HITs?

First of all, they tend to contain much fewer "HITs available" compared to the legitimate HITs. 95% of the spam HITs contain just a single HIT, while only 75% of the legitimate HITs have one HIT available.

On the other hand, spammers tend to post HITs with higher rewards (perhaps because they do not pay?). Approximately 80% of the legitimate HITs are priced below one dollar, while only 60% of the spam HITs are priced below this threshold. Actually, many of the best paying HITs tend to be spam-related ones.

By combining the two charts above, we can plot the total value of the spam vs not spam HITs.

Overall, the findings are not really surprising: Most of the spam HITs require large number of workers to complete a task. They want 1000 users to click an ad, not a single user to click a thousand times at a single ad. Therefore, I suspect that most of these spam HITs have a very significant amount of redundancy, (which unfortunately we cannot observe). This means that the total value of the posted spam HITs is most probably much higher than the total value of the legitimate HITs.

What to do?

These trends are very worrisome:

40% of the HITs from new requesters are spam.
30% of the new requesters are clear spammers.
The spam HITs have bigger value than the legitimate ones.

It is very clear that active action should be taken against spam requesters.

According to our measurements, we see approximately 1500 new HITs arriving in the market every day (from all requesters), and approximately 30 new requester accounts join the market every day. It should be trivial to review all the HITs manually by posting them to MTurk for review.

But even if this manual inspection is expensive, this is a task that can be very easily automated. In our current work, we realized that it is very easy to accurately classify HITs as spam or not. A simple SVM linear classifier that uses bag of words as features can achieve a 95% true positive and 95% true negative rate. With a moderately advanced scheme, it should be possible to have a strong system in place pretty quickly.

For whomever is interested, the data is available here

The disheartening part was the response of Amazon when we informed them about the issue. They pretty much assured us that everything is fine, and they believed there is no problem! For me, this was more problematic than the existence of spam.

Why Amazon ignores spam?

To answer this question, I have asked Amazon for access to the data to investigate further. Unfortunately, I was denied access. (It does not pay to criticize Amazon.) Interestingly enough, the MTurk guys share data with other academics.

The key piece for answering this question, which I cannot get from my data: Do spammers pay the workers?

If the spam requesters do not pay the workers, then Amazon should be more proactive in battling spammer requesters. Workers need to be protected! It is easy to see that it is a death spiral otherwise. The more spammers can get away with getting work done and not paying, the less the workers will trust new requesters. Legitimate new requesters will face a significant uphill battle to convince the workers about their intentions, they will abandon their plans, and let the spammers prevail. We have a market for lemons on the inverse.

If the spam requesters pay the workers, then there is a cynical explanation: Amazon does not take an active role in cleaning the market because they simply profit from the spam. And it is part of the growth. And nobody within the MTurk division would cut in half the growth rate at this point.

However, this would be an incredibly short-sighted approach. With the amount of spam in the worker side, and the amount of spam in the requester side, then Mechanical Turk would slowly turn into a market where spammers requester talk to spammer workers... Ah yes, and academics running experiments...

Monday, December 13, 2010

Sharing code, API's, and a Readability API

Yesterday, I received an email from a student that wanted to have access to some code that we used in our recent TKDE paper "Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics".

Specifically, the student wanted to estimate the readability test scores for the reviews. For those not familiar with readability tests, they are simple formulas that examine the text and estimate what is the necessary level education required in order to read and understand a particular piece of text.

I tried to send the code, but then I realized that it had some dependencies to some old libraries, which have been deprecated. At that point, I realized that it would be a pain to send the code to the student, then give instructions about all the dependencies etc. On the other hand, not sending the code is simply unacceptable.

Sharing code as an API

This got me thinking: How can we make the code to be robust to changes? How can we share the code in a way that it can be easily used by others? Given that all software packages today have web API's, why not creating API's for our own (research) code?

Since I have never tried in the past to do some serious web programming, I decided that I can spend a few hours to familiarize myself with the basics and make my library to be a set of RESTful API calls.

Apparently, it was not that difficult. I uploaded the code to the Google App Engine, and I wrote a small servlet that was taking as input the text, and was returning the readability metric of choice. Almost an assignment for a first-year student learning about programming.

Readability API

After a few hours of coding, I managed to generate a first version of the demo at http://ipeirotis-hrd.appspot.com/. I also created a basic API which can be easily used to estimate the readability scores of various texts.

I followed the example of bit.ly and I allowed the API calls to return simple txt format, so that it can be possible to embed the Readability API calls in many places. For example, I really enjoy calling bit.ly within Excel or within R, in order to shorten URLs. Now, it is possible to do the same in order to compute readability scores.

For example, if we want to compute the SMOG score for for the text "I do not like them in a box. I do not like them with a fox" and get back the score in simple text, you just need to call:

http://ipeirotis.appspot.com/readability/GetReadabilityScores?output=txt&metric=SMOG&text=I%20do%20not%20like%20them%20in%20a%20box.%20I%20do%20not%20like%20them%20with%20a%20fox.

The result is the SMOG score for the text, which in this case is 3.129. You can play with the demo and type whatever text you want, and see the documentation if you want to use the code. Of course, the source code is also available.

Future Plans

I actually like this idea and the result. I will be trying to port more of my code online, and make it available as an API. With the availability of sites such as Google App Engine, we do not have to worry about servers being taken down, or upgrades in OS, etc. The code can remain online and functioning. Now, let's see how easy it will be to port some non-trivial code.

Monday, December 6, 2010

Excerpts from "The Communist Manifesto"

... A class of laborers, who live only so long as they find work, and who find work only so long as their labor increases capital.

These laborers, who must sell themselves piecemeal, are a commodity, like every other article of commerce, and are consequently exposed to all the vicissitudes of competition, to all the fluctuations of the market.

Owing to the extensive use of machinery, and to the division of labor, the work of the proletarians has lost all individual character, and, consequently, all charm for the workman.

He becomes an appendage of the machine, and it is only the most simple, most monotonous, and most easily acquired knack, that is required of him.

Excerpts from "The Communist Manifesto", 1848

162 years later, the Communist Manifesto, by Marx and Engels, finds a new meaning in the online world of Amazon Mechanical Turk.