Friday, December 24, 2010

Amazon Reacts: Spammers Kicked Out of MTurk!

I got a notification in the comments of the blog post about spam on MTurk that Amazon seems to have taken seriously the spam problem reported in my previous blog post.

Broken Turk wrote:

Have you seen the site?  It looks as if MTurk liked your research?

(Link to top paying HITs, no spam!)

(Turker Nation post)

Indeed, I checked the available HITs and all the spam HITs seem to have magically disappeared! It seems that all the negative publicity convinced the guys at MTurk that spam IS a problem.

I will consider that the goal of the blog post is now achieved. Amazon listened!

Good job Amazon!

Thursday, December 16, 2010

Mechanical Turk: Now with 40.92% spam.

At this point, Amazon Mechanical Turk has reached the mainstream. Pretty much everyone knows about the concept. Post small tasks online, pay people cents, and get thousands of micro-tasks completed.

Unfortunately, this resulted in some unfortunate trends. Anyone who frequents just a little bit the market will notice the tremendous number of spammy HITs. (HIT = a task posted for completion in the market; stands for Human Intelligence Task). "Test if the ads in my website work". "Create a Twitter account and follow me". "Like my YouTube video". "Download this app". "Write a positive review on Yelp". A seemingly endless amount of spam HITs come to the market, mainly with the purpose of spamming "social media" metrics.

So, with Dahn Tamir and Priya Kanth (MS student at NYU), we decided to examine how big is the problem. How many spammers join the market? How many spam HITs are there?

Using the data from Mechanical Turk Tracker, we picked all the requesters that first joined the market in September 2010 and October 2010. Why new ones? Because we assumed that long term requesters are not spammers. (But this remains to be verified.)

This process resulted in 1733 new requesters that first appeared in the marketplace in September and October 2010. We then took all the HITs that these requesters posted in the market. This was a total of 5842 HIT groups. The activity patterns of the new requesters were similar to those of the general requester population.

The next step was to post these HITs on Mechanical Turk, and asked workers to classify them as spam or not, using the following guidelines:

Use the following guidelines to classify the HIT as SPAM:
  • SEO: Asks me to give a fake rating, vote, review, comment, or "like" on Facebook, YouTube, DIGG, etc., or to create fake mail or website accounts.
  • Fake accounts: Asks me to create an account on Twitter, Facebook, and then perform a likely spam action. 
  • Lead Gen: Asks me to go to a website and sign up for a trial, complete an offer, fill out a form requesting information, "test" a data-entry form, etc.
  • Fake clicks: Asks me to go to a website and click on ads.
  • Fake ads: Asks me to post an ad to Craigslist or other marketplace.
  • Personal Info: Asks me for my real name, phone number, full mailing address or email.
  • You can also use your intuition to classify the HIT
Please DO NOT classify as spam, HITs that are legitimate in nature but priced offensively low.

Interestingly enough, we got a ridiculous amount of spam from the worker side. Even with 99% approval rate and 1000 HITs as qualification, we got plenty of spammers giving us random data.

Since spam was a big problem, we posted the HIT using CrowdFlower and we used a set of 100 manually classified HITs as gold. (Without Crowdflower, we had to manually kick out the spammers and repost the HITs. So, Crowdflower saved the day.)

We asked 11 workers to classify each HIT, and we ignored votes from the untrusted workers (that failed to answer correctly at least 75% of the gold tests). So, with 11 trusted workers working on each HIT, we were reasonably sure that the majority vote across these 11 votes resulted in an accurate HIT classification.

I also ran the "get another label" code and I noticed that all the workers were of reasonable quality. Since the results were similar to those of the majority vote, I decided to keep things simple and go with the majority vote as the correct answer.

The results

The results were disturbing. Out of the total of 5841 HITs, a total of 2390 HITs, or 40.92% were marked as spam HITs.

This is not good! 40% of the HITs from new requesters are spam!

Our next test was to examine whether there are accounts that post a mix of spam and not spam HITs. The analysis indicated that this is not the case. Very few accounts post both spam HITs and legitimate HITs:

The plot illustrates that 31.83% of the new requesters post only spam HITs.

In total, 757 out of the 1733 new requesters posted at least a one spam HIT, and 552 accounts were posting only spam HITs. 56.46% of the new requesters post no spam HITs. This nice separation indicates that it is easy to separate spam requesters from legitimate ones. There are not that many requesters that post both spam HITs and legitimate ones.

So, 31.8% of the new requesters are clear spammers, and 40.92% of the new HITs are spam-related! This is clearly a problem.

Spam HITs and pricing

So, what are the quantitative characteristics of the spam HITs?

First of all, they tend to contain much fewer "HITs available" compared to the legitimate HITs. 95% of the spam HITs contain just a single HIT, while only 75% of the legitimate HITs have one HIT available.

On the other hand, spammers tend to post HITs with higher rewards (perhaps because they do not pay?). Approximately 80% of the legitimate HITs are priced below one dollar, while only 60% of the spam HITs are priced below this threshold. Actually, many of the best paying HITs tend to be spam-related ones.

By combining the two charts above, we can plot the total value of the spam vs not spam HITs. 

Overall, the findings are not really surprising: Most of the spam HITs require large number of workers to complete a task. They want 1000 users to click an ad, not a single user to click a thousand times at a single ad. Therefore, I suspect that most of these spam HITs have a very significant amount of redundancy, (which unfortunately we cannot observe). This means that the total value of the posted spam HITs is most probably much higher than the total value of the legitimate HITs.

What to do?

These trends are very worrisome:
  • 40% of the HITs from new requesters are spam. 
  • 30% of the new requesters are clear spammers.
  • The spam HITs have bigger value than the legitimate ones. 
It is very clear that active action should be taken against spam requesters. 

According to our measurements, we see approximately 1500 new HITs arriving in the market every day (from all requesters), and approximately 30 new requester accounts join the market every day. It should be trivial to review all the HITs manually by posting them to MTurk for review. 

But even if this manual inspection is expensive, this is a task that can be very easily automated. In our current work, we realized that it is very easy to accurately classify HITs as spam or not. A simple SVM linear classifier that uses bag of words as features can achieve a 95% true positive and 95% true negative rate. With a moderately advanced scheme, it should be possible to have a strong system in place pretty quickly.

For whomever is interested, the data is available here

The disheartening part was the response of Amazon when we informed them about the issue. They pretty much assured us that everything is fine, and they believed there is no problem! For me, this was more problematic than the existence of spam.

Why Amazon ignores spam?

To answer this question, I have asked Amazon for access to the data to investigate further. Unfortunately, I was denied access. (It does not pay to criticize Amazon.) Interestingly enough, the MTurk guys share data with other academics.

The key piece for answering this question, which I cannot get from my data: Do spammers pay the workers? 

If the spam requesters do not pay the workers, then Amazon should be more proactive in battling spammer requesters. Workers need to be protected! It is easy to see that it is a death spiral otherwise. The more spammers can get away with getting work done and not paying, the less the workers will trust new requesters. Legitimate new requesters will face a significant uphill battle to convince the workers about their intentions, they will abandon their plans, and let the spammers prevail. We have a market for lemons on the inverse.

If the spam requesters pay the workers, then there is a cynical explanation: Amazon does not take an active role in cleaning the market because they simply profit from the spam. And it is part of the growth. And nobody within the MTurk division would cut in half the growth rate at this point.

However, this would be an incredibly short-sighted approach. With the amount of spam in the worker side, and the amount of spam in the requester side, then Mechanical Turk would slowly turn into a market where spammers requester talk to spammer workers... Ah yes, and academics running experiments...

Monday, December 13, 2010

Sharing code, API's, and a Readability API

Yesterday, I received an email from a student that wanted to have access to some code that we used in our recent TKDE paper "Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics".

Specifically, the student wanted to estimate the readability test scores for the reviews. For those not familiar with readability tests, they are simple formulas that examine the text and estimate what is the necessary level education required in order to read and understand a particular piece of text.

I tried to send the code, but then I realized that it had some dependencies to some old libraries, which have been deprecated. At that point, I realized that it would be a pain to send the code to the student, then give instructions about all the dependencies etc. On the other hand, not sending the code is simply unacceptable.

Sharing code as an API

This got me thinking: How can we make the code to be robust to changes? How can we share the code in a way that it can be easily used by others? Given that all software packages today have web API's, why not creating API's for our own (research) code?

Since I have never tried in the past to do some serious web programming, I decided that I can spend a few hours to familiarize myself with the basics and make my library to be a set of RESTful API calls.

Apparently, it was not that difficult. I uploaded the code to the Google App Engine, and I wrote a small servlet that was taking as input the text, and was returning the readability metric of choice. Almost an assignment for a first-year student learning about programming.

Readability API

After a few hours of coding, I managed to generate a first version of the demo at I also created a basic API which can be easily used to estimate the readability scores of various texts.

I followed the example of and I allowed the API calls to return simple txt format, so that it can be possible to embed the Readability API calls in many places. For example, I really enjoy calling within Excel or within R, in order to shorten URLs. Now, it is possible to do the same in order to compute readability scores.

For example, if we want to compute the SMOG score for for the text "I do not like them in a box. I do not like them with a fox" and get back the score in simple text, you just need to call:

The result is the SMOG score for the text, which in this case is 3.129. You can play with the demo and type whatever text you want, and see the documentation if you want to use the code. Of course, the source code is also available.

Future Plans

I actually like this idea and the result. I will be trying to port more of my code online, and make it available as an API. With the availability of sites such as Google App Engine, we do not have to worry about servers being taken down, or upgrades in OS, etc. The code can remain online and functioning. Now, let's see how easy it will be to port some non-trivial code.

Monday, December 6, 2010

Excerpts from "The Communist Manifesto"

... A class of laborers, who live only so long as they find work, and who find work only so long as their labor increases capital.

These laborers, who must sell themselves piecemeal, are a commodity, like every other article of commerce, and are consequently exposed to all the vicissitudes of competition, to all the fluctuations of the market.

Owing to the extensive use of machinery, and to the division of labor, the work of the proletarians has lost all individual character, and, consequently, all charm for the workman.

He becomes an appendage of the machine, and it is only the most simple, most monotonous, and most easily acquired knack, that is required of him.

Excerpts from "The Communist Manifesto", 1848

162 years later, the Communist Manifesto, by Marx and Engels, finds a new meaning in the online world of Amazon Mechanical Turk.

Monday, November 29, 2010

Wisdom of the Crowds: When do we need Independence?

I have been thinking lately about the conditions and assumptions for the wisdom of crowds to work. Surowiecki, in this popular book, gave the following four conditions for the crowd to arrive at
the correct decision.
  • Diversity of opinion: Each person should have private information even if it's just an eccentric interpretation of the known facts.
  • Independence: People's opinions aren't determined by the opinions of those around them.
  • Decentralization: People are able to specialize and draw on local knowledge.
  • Aggregation: Some mechanism exists for turning private judgments into a collective decision.
The part that got me mostly puzzled is the independence assumption. Actually, I can support pretty much any thesis. I can argue that independence is necessary. I can argue that we do not really need independence so much. And I can argue that independence is evil. And I will do all these things below.

Independence is necessary

It is not difficult to understand why, in some cases, independence is necessary. If the contributions from the crowd are not independent, then it we may easily observe a herding behavior. Daniel Tunkelang discusses a nice, instructional example (from the book Networks, Crowds, and Markets, by David Easley and Jon Kleinberg), in which the influence of the crowd can lead often to incorrect decisions, while independence can easily avoid erroneous outcomes.

The paper "Limits for the Precision and Value of Information from Dependent Sources" by Clemen and Winkler shows that in the presence of positive correlation, when we aggregate information from multiple dependent sources, the resulting accuracy does not increase as we would expect.

The figure below shows in the x-axis the number of dependent sources, and in the y-axis the equivalent number of independent sources, for various correlation coefficients ρ.

Even at moderate levels of ρ, we see how strong are the limitations. With ρ=0.4 it is almost impossible to go above two independent sources. And if we have noisy input, we often need a large number of independent sources to separate signal from noise.

In other words, it is better to have a couple of independent opinions, rather than having thousands of correlated voices.

Lack of independence: Perhaps not so bad

We have examples where lack of independence is not always bad.

For example, according to the paper "Measuring the Crowd Within" by Vul and Pashler, even asking the same person for a second time and getting the average can lead to improved outcomes.

Or take the other poster-child application of wisdom of crowds: prediction markets (or markets, in general). In these markets, people trade based on their personal information. However, they can always see (and get influenced?) by the aggregated opinion of the crowd, as this is reflected in the market prices. And empirical evidence illustrates that (prediction) markets work surprisingly well, despite (or because of) the lack of independence. Prior work has even demonstrated that even non-public information spreads quickly through the market (and the SEC checks for insider trading if they detect unusual activity before the public release of sensitive information.)

Wikipedia is another example: People do see what everyone else has done so far, before adding the extra information.

One paper that I found to be of interest is the Naïve Learning in Social Networks and the Wisdom of Crowds by Golub and Jackson. The authors address the following question: "for which social network structures will a society of agents who communicate and update naïvely come to aggregate decentralized information completely and correctly?". The results are based on the ideas of convergence for Markov Chains. One of the basic result says that the Pagerank-score of a node in the network defines the weight of the node's influence in the final outcome.

In all these cases, the participants get information from the crowd, they do not just follow blindly. So, there is some benefit in interacting.

Independence is bad

Going even further, we have cases where complete independence of participants is bad!

This typically happens when participants know only parts of the overall information. Through communication, it is possible to identify the complete picture, but lack of communication leads to suboptimal outcomes. Consider the example in Proposition 2 from the paper "We can't disagree forever" by Geanakoplos and Polemarchakis:
  • We have a 4-sided dice, with mutually exclusive outcomes A, B, C, and D, each one occurring with probability 0.25.
  • In reality, the dice rolled 1. But nobody knows that. Instead the knowledge of the players is:
    • Player 1 knows that the event "A or B" happened
    • Player 2 knows that the event "A or C" happened
  • Both players can bet on whether "A or D" happened.
So, look what happens
  • No independence: If player 1 can communicate directly with player 2, they can figure out that event A happened, and they are certain that "A or D" occurred with probability 1.0
  • Independence: If player 1 cannot communicate, then both players assign a probability of 0.5 to the event "A or D". This is despite the fact that they collectively own enough information to figure out that A happened, and there is a market to trade the event. In other words, the market fails to aggregate the available information.
So, we have a scenario where the inability to spread information actually results in a bad outcome. However, if we allowed the participants to be non-independent, we could have an improved outcome.

Influence vs Information Spread

So, we can see actual examples where spread of information (and hence, lack of independence) can be both good and bad. Lack of independence, can lead to groupthink: and the individual voices get drowned in a sea of correlated opinions. At the other extreme, lack of communication leads to suboptimal outcomes.

The paper by Plott and Sunder "Rational Expectations and the Aggregation of Diverse Information in Laboratory Security Markets" discusses the issue in the context of security markets and examines how market design affects the information aggregation properties of markets. (Thanks for David Pennock for the pointer.)

The paper by Ostrovsky "Information Aggregation in Dynamic Markets with Strategic Traders" (in EC'09, I think also forthcoming in Econometrica) provides a rigorous theoretical framework on what are the conditions for information to be aggregated in a market: essentially we have "separable" securities for which all the available information can be aggregated, and non-separable ones that do not have this property. However, I do not have the necessary background to fully understand and present the ideas in the paper. And I cannot see how to connect this with the literature of information spreading in social networks.

In a more intuitive sense, it seems that we need information to spread and not just influence.

Unfortunately, I cannot grasp the full picture, despite the fact that I tried to look the problem from different angles (Ironic, eh?).

I still not fully understand the implications of the above in the design of processes that involve human input. Does it make sense to show to people what other people have contributed so far? Will we see effects of anchoring? Or will we see the establishment of a common ground and get people to coordinate better and understand each other's input?

How can we quantify and put in a common framework all the above?

Wednesday, November 24, 2010

Mechanical Turk, "Interesting Tasks," and Cognitive Dissonance

It is a well-known fact that the wages on Mechanical Turk are horribly low. We can have endless discussions about this, and my own belief is that it is due to the lack of a strong worker reputation system. Others believe that this is due to the global competition for unskilled labor. And others are agnostic, saying that everything is a matter of supply and demand.

Other people try to explain the low wages by looking at the motivation of the workers: Quite a few people find the tasks on Mechanical Turk to be interesting. Ergo, they are willing to work for less.

Perfectly normal right? The task is interesting, people are willing to do it for less money. Sounds reasonable. Right? RIGHT? Well, be careful: correlation does not imply causation!

Enter the region of social psychology (thanks Konstantinos!): The theory of cognitive dissonance indicates that the causation may go in the entirely opposite direction: The wages are low, so people justify their participation by saying the work is interesting!

This surprising result is due to the paper "Cognitive Consequences of Forced Compliance" from Festinger and Carlsmith (1959). It is one of the classic papers in psychology.

What did Festinger and Carlsmith say?

That people that get low payment to do boring tasks, will convince themselves that they do this because the task is interesting. Otherwise, the conflict in their mind will be just too big: why do they work on such a boring task when the payment is horrible?

In contrast, if someone gets paid well to do the same boring task, they will consider the task boring. These well-paid participant can easily justify that they do the work for the money, (so it makes sense to do a boring job).

Amazingly enough, Festinger and Carlsmith verified this experimentally. Here is the experimental setup description from the Wikipedia entry that describes this intriguing experiment:

Students were asked to spend an hour on boring and tedious tasks (e.g., turning pegs a quarter turn, over and over again). The tasks were designed to generate a strong, negative attitude.

Once the subjects had done this, the experimenters asked some of them to do a simple favor. They were asked to talk to another subject (actually an actor) and persuade them that the tasks were interesting and engaging.

Some participants were paid $20 (inflation adjusted to 2010, this equates to $150) for this favor, another group was paid $1 (or $7.50 in "2010 dollars"), and a control group was not asked to perform the favor.

When asked to rate the boring tasks at the conclusion of the study (not in the presence of the other "subject"), those in the $1 group rated them more positively than those in the $20 and control groups.

The researchers theorized that people experienced dissonance between the conflicting cognitions, "I told someone that the task was interesting", and "I actually found it boring." When paid only $1, students were forced to internalize the attitude they were induced to express, because they had no other justification. Those in the $20 condition, however, had an obvious external justification for their behavior (i.e., high payment), and thus experienced less dissonance.

So, when you read surveys (mine included) that indicate that Mechanical Turk workers participate on the platform because they "find the tasks interesting", (and so it makes sense to pay low wages), please have this alternative explanation in mind:

Turkers convince themselves that the work is interesting, otherwise they would be completely crazy sitting there doing mind-boggling boring work, just to earn wage of a couple of bucks per hour.

Tuesday, November 23, 2010

NYC, I Love You(r Data)

Last year, I experimented with the NYC Data Mine repository as a source of data for our introductory course on information systems (for business students, mainly non-majors). The results of the assignment were great, so I repeated it this year.

The goal of the assignment was to teach them how to grab and run database queries against large datasets. As part of an assignment, the students had to to go the NYC Data Mine repository, pick two datasets of their interest, join them in Access, and perform some analysis of interest. The ultimate goal was to get them to use some real data, and use them to perform an analysis of their interest.

Last year, some students took the easy way out and joined the datasets manually(!) on the borough values (Manhattan, Bronx, Brooklyn, Queens, Staten Island). This year, I explicitly forbid them from doing so. Instead, I explicitly asked them to join only using attributed with a large number of values.

The results are here and most of them are well-worth reading! The analyses below is almost like a tour guide on the New York's data sightseeings :-) The new generation of Nate Silver's is coming.

Enjoy the projects:
  • Academia and Concern for the Environment! Is there a correlation between how much you recycle and how well students perform in school? Are kids who are more involved in school activities more likely to recycle? Does school really teach us to be environmentally conscious? To find out the answers check out our site!
  • An Analysis of NYC Events: One of the greatest aspects about New York are the fun festivals, street fairs and block parties where you can really take in the culture. Our charts demonstrate which time to visit New York or what boroughs to attend events. We suggest that tourists and residents check out our research. Also organizers of events or people who make there money from events should also consult our analysis.
  • How are income and after school programs related?: This study is an analysis of how income levels are related to the number of after school programs in an area. The correlation between income and number of school programs was interesting to analyze across the boroughs because while they did follow a trend, the different environments of the boroughs also had an exogenous effect. This is most evident in Manhattan, which can be seen in the study.
  • Restaurant Cleanliness in Manhattan What are the cleanest and dirtiest restaurants in Manhattan? What are the most common restaurant code violations? We analyzed data on restaurant inspection results and found answers to these questions and more.
  • Ethnic Dissimilarity's Effect on New Business: This analysis focuses on the relationship between new businesses and specific ethnic regions. Do ethnically dominated zip codes deter or promote business owners of differing ethnicities to open up shop?
  • Does The Perception Of Safety In Manhattan Match With Reality? People’s perception of events and their surroundings influence their behavior and outlook, even though facts may present a different story. In this regard, we took a look at the reported perception of people’s safety within Manhattan and compared it to the actual crime rates reported by the NYPD. The purpose of our study was to evaluate the difference between the actual crime rate and perceived safety of citizens and measure any discrepancy.
  • Women's Organizations love food stores!: We have concluded that a large percentage of women's organizations are located near casual dining and takeout restaurants as well as personal and professional service establishments compared to what we originally believed would be shopping establishments.
  • Hispanics love electronics!: Our goal for this project is to analyze the relationship between electronic stores and demographics in a particular zip code. We conducted a ratio analysis instead of a count analysis to lessen the effects of population variability as to create an "apples to apples" comparison. From our analysis, it can be seen that there is a greater presence of electronic stores in zip codes with a higher proportion of Hispanics.
  • Political Contributions and Expenditures: A comprehensive analysis of the political contributions and expenditures during the 2009 elections. The breakdown of who, in what areas of Manhattan contribute as well as how candidates spend their money are particularly interesting!
  • How Dirty is Your Food? Our goal for this project is to analyze the various hygiene conditions of restaurants in New York City. We cross referenced the inspection scores of the restaurants with the cuisine they serve to find out if there was any correlation between these two sets of data. By ranking the average health score of the various cuisines, we can determine which kinds of cuisines were more likely to conform to health standards.
  • Want to Start a Laundromat? An Electronic Store? The best possible places to start a Laundromat and an electronic store. For Laundromats we gave the area that had the lowest per capita income, as we noticed a trend that Laundromats do better in poorer neighborhoods. For electronic stores we found the lowest saturated areas that have the highest per capita income.
  • Where to Let Your Children Loose During the Day in NYC: For this analysis, we wondered whether there was a correlation between how safe people felt in certain areas in New York and the availability of after-school programs in the different community boards.
  • Best Place to Live in Manhattan After Graduation: We analyzed what locations in Manhattan, classified by zip code, would be the best to live for a newly graduate. We used factors like shopping, nightlife, gyms, coffeehouses, and more! Visit the website to get the full analysis.
  • Political Contributions and Structures: Our report analyzes the correlation between political contributions and structures in New York in varying zip codes.
  • Best Places to Eat and Find Parking in New York City: Considering the dread of finding parking in New York City, our analysis is aimed at finding the restaurants with the largest number of parking spaces in their vicinities.
  • Are the Cleanest Restaurants Located in the Wealthiest Neighborhoods? Our analysis between property value and restaurant rating for the top and bottom ten rated restaurants by zip codes in New York City
  • Analysis of Popular Baby Names
  • Restaurant Sanitary Conditions: Our team was particularly interested in the various cuisines offered in various demographic neighborhoods, grouped by zip codes. We were especially curious about the sanitary level of various cuisines offered by restaurants. The questions we wanted to answer were:
    • What zip codes had the highest rated restaurants? What type of cuisines are found in these zip codes?
    • What zip codes had the lowest rated restaurants? What type of cuisines are found in these zip codes?
  • Does having more community facilities improve residents' satisfaction with city agencies? Does having more public and private community facilities in NYC such as schools, parks, libraries, public safety, special needs housing, health facilities, etc lead to greater satisfaction with city services? On intuition, the answer is a resounding YES! With more facilities, we would enjoy our neighborhood better and develop a better opinion of New York City services. But how accurate is this intuition? In this analysis, we put that to the test.
  • Housing Patterns in Manhattan: The objective of our analysis was to identify factors which play a role in determining vacancy rates in Manhattan’s community districts. We inferred that vacancy rates are representative of the population’s desire to live in a particular district. We examined determining factors of why people want to live in a particular district including: quality of education, health care services, crime control in each district, etc.
  • Analysis of Cultural Presence and Building Age by Zip Code: Manhattan is a hub for cultural organizations and opportunities for community involvement. But does the amount of "community presence" differ based on area that you live? Is there any relationship between the year that buildings in various areas were built, and the available programs for students and cultural organizations for the general public in that area? We analyzed whether a relationship existed between the number of cultural organizations and after school programs available in a zip code, and the average year that the buildings in the zip code were built. To further our analysis we looked at whether the age of buildings in areas with greatest "cultural presence" affected the sales price of the buildings.
  • Analysis of Baby Names across the Boroughs: We decided to analyze the Baby Names given in 2008 across the boroughs of Manhattan, the Bronx and Brooklyn. We found the most popular names in each Borough, along with top names specific to each borough that were unpopular in other Boroughs. We also found certain factors that could be a determining factor in the naming of these babies.
  • Analysis of New York City Street Complaints: We analyzed the different kinds of street complaints made in New York City, how the city tends to respond to them, and which streets have the most overall complaints when you also bring building complaints into the picture. This analysis taught us that Broadway has the most street complaints but it also piqued our interest in conducting even further analyses.
  • Campaign Contributions and Community Service Programs The goal of our analysis was to determine if there is a correlation between contributions by NYC residents to election candidates and community service programs. We wanted to see if people who are more financially invested in elections are also more inclined to be involved in their neighborhoods through community programs.
  • Public Libraries in Queens: We looked at how many public libraries there were in each zip code in Queens. We also looked at the number of people and racial composition in each zip code, to see if these factors are related.
  • Sidewalk Cafe Clustering: Our study’s goal is to understand where sidewalk cafes cluster and some potential reasons why they cluster. We start by looking at what areas of the city are most populated with sidewalk cafes. Then we look to see if there are any trends related to gender or race demographics. We finally look to see if there is any influence on property value on the abundance of sidewalk cafes.

The surprise in this year: Most students could not understand what is the "CSV" data file. Many of them thought it was some plain text, and did not try to use it. (Hence the prevalence of electronic and laundromat analyses, which were based on datasets available in Excel format.) I guess next year I will need to explain that as well.

Friday, November 19, 2010

Introductory Research Course: Replicate a Paper

The transition to the happy life of a tenured professor meant that I get to be involved in the wonderful part of the job: Getting to sit in school-wide committees.

Fortunately, I was assigned in an extremely interesting committee: We get to examine the PhD program for the school, see the best practices, see what works and what does not, and try to reconcile everything into a set of recommendations for the faculty to examine. The double benefit for me is that I get to understand how the other departments operate in the school, a thing which, for  a computer scientist in a business school, was still kind of a mystery to me.

Anyway, as part of this task, I learned about an interesting approach to teach starting PhD students about research:

A course in which students pick a paper and get to replicate it.

I think this is a great idea. First of all, I am a big fan of learning-by-doing.

For example, to understand how an algorithm works, you need to actually implement it. Not get the code and re-run the experiments. Implement everything, going as deeply as possible. In C, in Java, in Perl, in Python, in MatLab, in Maple, in Stata, it does not matter. For theory, the same thing: replicate the proofs. Do not skip the details. For data analysis, the same. Get your hands dirty.

During such a process, it is great to have someone to serve as a sounding board. Ask questions about the basics. Why do we follow this rule of thumb? What is the assumption behind the use of this method? Asking these questions is much easier while working on replicating someone else's work, rather then when working on your own research and trying to get a paper out.

Myself, I still write code for this very same reason. I need to see how the algorithm behaves. I need to see the small peculiarities in behavior. This observation gets me to understand better not only the algorithm itself but also other techniques that are employed by the algorithm. I am trying to understand econometrics a little bit deeper the last few months, and I do the same. Frustrating? Yes. Slow? Yes. Helpful? You bet!

So, at the end of the seminar, if the students can replicate the results of the paper, great: They learned what it takes to create a paper and most probably understood deeper a few other topics in the way.

If the results are different than in the original paper, then perhaps this is the beginning of a deeper investigation. Why things are different? Tuning? Settings? Bugs? Perhaps uncovering something not seen by the authors?

Even if the data from the authors are not available, the students should be able to reproduce and get similar results perhaps with different data sets. If the results with different data sets are qualitative different, then the paper is essentially not reproducible. (And replicability is not reproducibility.)

And in any case, no matter if the students can replicate the results or not, no matter if the paper is reproducible or not, the lesson from such an exercise can be valuable.

Often the student who understands better the paper, falls in love with a topic, and gets to learn more and more about the area. Following the footsteps of someone is often the first step to find your own path.

I think this seminar will make it to the final set of recommendations to the school. I am wondering how many other schools have such a course.

Update1: Needless to say, this is a class, not something that students try on their own. Therefore, the professor should pick a set of papers which are educational and useful to replicate. This can be either an easy "classic" paper, or an "important new" result, or even a paper that forces the students to use particular tools and data sources. The students choose from a predefined set, not from the wild.

Update2: Thanks to Jun, a commenter below, we have now a reference to the originator of the idea. Apparently, Gary King has published a paper in 2006, titled ""Publication, Publication", in "Political Science and Politics". From the abstract: "I show herein how to write a publishable paper by beginning with the replication of a published article. This strategy seems to work well for class projects in producing papers that ultimately get published, helping to professionalize students into the discipline, and teaching them the scientific norms of the free exchange of academic information. I begin by briefly revisiting the prominent debate on replication our discipline had a decade ago and some of the progress made in data sharing since."

Thursday, October 28, 2010

Cease and desist...

This was just too funny to resist posting.

Here is the background: As part of the core undergraduate introductory class "Information Technology in Business and Society", students have to create a website. To make things more interesting, I ask them to pick a few queries on Google, and try to create a website that will show up on the top of the results for these queries. Essentially it is a mix of technical skills with the ability to understand how pages are ranked and how to analyze the "competition" for these keywords.

So, a student of mine (John Cintolo), created a website about "Hit Club Music Summer 2010", with links to YouTube videos. No copyright infringment or anything illegal.

And one day later, he gets a "cease and desist" letter from HotNewClubSongs. It has so many gems that I will list it here, for your viewing pleasure.

To whom it may concern

It has come to my attention that your website "Hit Club Music Summer 2010" on this URL has potential to threaten my Alexa page ranking. As a consequence, this may cause our website to lose vital income which is generated from ad-space and it will not be tolerated. Due to the nature of your actions I am requesting a formal take-down of your website due to copyright infringement as the music posted on your "" links is not endorsed by the rightful authors, as counseled by my attorney. Considering that you are also going through the New York University server, your actions may cost you and your educational institution unless you cease the aforementioned copyright infringement. If you continue hosting your service I will be forced to file a civil suit in which you will be charged for any lost advertisement revenue, averaging $0.52 per day.

In addition, your html markup shows your ineptitude in online web design, making your website an inefficient option for visitors who truly care about the Club Songs Industry. The listing of the dates on your monthly playlists go in ascending order rather than descending. This is just one of the many flaws of your clearly haphazardly designed website. However, I will give you neither my website URL nor my constructive criticism, for you are clearly trying to make money in an industry which doesn’t have room for your lack of music and website design knowledge. My page viewers have complimented me numerous times on the layout and content of my page.

You may contact me at this e-mail for any further concerns, although it is clear there is not much more to say. Your carelessness, inefficiency, and utter incompetence have gotten you into this hole, and unless you find a way out by October 31st, when my ad-space revenue comes in, further action will be taken. Also, for legal purposes, when and where was this website created? In the chance that it was created before September 30th, 2010, a law suit will be filed for the obvious decrease in revenue from my ads last month, totaling $7.34.

Thank you for your time,

HotNewClubSongs- A Forerunner in the Club Music Industry

Needless to say, I congratulated the student for achieving the goals of the assignment, and offered to cover the damages :-)

Tuesday, October 26, 2010

Student websites

I am just posting this to provide links to the pages of my students, so that Google indexes their websites quickly. Feel free to browse, of course...

Can Crowdsourcing Scale? The Role of Active Learning

Nobody is denying the fact that crowdsourcing becoming mainstream. People use Mechanical Turk for all sorts of applications. And many startups create business plans assuming that crowdsourcing markets will be able to provide enough labor to complete the tasks that will be posted in the market.

And at this point, things become a little tricky.

Can crowdsourcing markets scale? MTurk can tag a thousand images within a few hours. But what will happen if we place one million images in the market? Will there be enough labor to handle all of the posted tasks? How long will the task take? And what will be the cost?

Scaling by combining machine learning with crowdsourcing

Unless you can come up with ingenious ideas, the acquisition of data comes at a cost. To reduce cost, we need to reduce the need for humans to label data. To reduce the need for humans, we need to automate the process. To automate the process, we need to build machine learning models. To build machine learning models, we need humans to label data.... Infinite loop? Yes and no.

The basic idea is to use crowdsourcing in conjunction with machine learning. In particular, we leverage ideas from active learning: The idea behind active learning is to use humans only for the uncertain cases, and not for everything. Machine learning can take care of the simple cases, and ask humans to help for the most important and ambiguous cases.

We also need to have one extra thing in mind: Crowdsourcing generates noisy training data, as opposed to the perfect data that most active learning algorithms expect from humans. So, we need to perform active learning not only towards identifying the cases that are ambiguous for the model, but also figure out which human labels are more likely to be noisy, and fix them. And we also need to be proactive in identifying the quality of the workers.

In any case, after addressing the quality complications, and once we have enough data, we can use the acquired data to build basic machine learning models. The basic machine learning models can then take care of the simple cases, and free humans to handle the more ambiguous and difficult cases. Then, once we collect enough training data for the more difficult cases, we can then build an even better machine learning model. The new model will then automate an even bigger fraction of the process, leaving humans to deal with only the harder cases. And we repeat the process.

This idea was at the core of our KDD 2008 paper, and since then we have significantly expanded these techniques to work with a wider variety of cases (see our current working paper: Repeated Labeling using Multiple Noisy Labelers.)

Example: AdSafe Media.

Here is an example application, deployed in practice through AdSafe Media: Say that we want to build a classifier that recognizes porn pages. Here is an overview of the process, which follows the process of our KDD paper:
  1. We get a few web pages labeled as porn or not. 
  2. We get multiple workers to label each page, to ensure quality.
  3. We compute the quality of each labeler, fix biases, and get better labels for the pages.
  4. We train a classifier that classifies pages as porn or not.
  5. For incoming pages, we classify them using the automatic classifier.
    • If the classifier is confident, we use the outcome of the classifier
    • If the classifier is not confident, the page is directed to humans for labeling (the more ambiguous the page, the more humans we need)
  1. Once we get enough new training data, we move to Step 4 again.
Benefits: Once the classifier is robust enough, there is no need to use humans to handle basic tasks. The classifier takes care of the majority of tasks, ensuring that the speed of classification is high, and that the cost is low. (Even at a 0.1 cents per page, humans are simply too expensive when we deal with billions of pages.) Humans are reserved to handle pages that are difficult to classify. This ensures that for the difficult cases there is always someone to provide feedback, and this crowdsourced feedback ensures that the classifier improves over time.

Other example: SpeakerText. 

According to the press, SpeakerText is using (?) this idea: they use an automatic transcription package to generate a first rough transcript, and then use humans to improve the transcription. The high quality transcriptions can be then used to train a better model for automatic speech recognition. And the cycle continues.

Another example: Google Books. 

The ReCAPTCHA technique is used as as the crowdsourcing component for digitizing books for the Google Books project. As you may have imagined, Google is actively using optical character recognition (OCR) to digitize the scanned books and make them searchable. However, even the best OCR software will not be able to recognize some words from the scanned books.

ReCAPTCHA uses the millions of users on the Internet (most notably, the 500 million Facebook users) as transcribers that fix whatever OCR cannot capture. I guess that Google reuses the fixed words in order to improve their internal OCR system, so that they can reach their goal of digitizing 129,864,880 books a little bit faster.

The limit?

I guess the Google Books and ReCAPTCHA project are really testing the scalability limits of this approach. The improvements in the accuracy of machine learning systems start being marginal once we have enough training data, and we need orders of magnitude more training data to see noticeable improvements.

Of course, with 100 million books to digitize, even an "unnoticeable" improvement of 0.01% in accuracy corresponds to 1 billion more words being classified correctly (assuming 100K words per book), and results in 1 billion less ReCAPTCHA's needed. But I am not sure how many ReCAPTCHA's are needed in order to achieve this hypothetical 0.01% improvement. Luis, if you are reading, give us the numbers :-)

But in any case, I think that 99.99% of the readers of this blog would be happy to hit this limit.

Thursday, October 21, 2010

A Plea to Amazon: Fix Mechanical Turk!

It is now almost four years since I started experimenting with Mechanical Turk. Over these years I have been a great evangelist of the idea.

But as Mechanical Turk becomes mainstream, it is now time for the service to get the basic stuff right. The last few weeks I found myself repeating the same things again and again, so I realized that it is now time to write these things down...

Mechanical Turk, It is Time to Grow Up

The beta testing is over. If the platform wants to succeed, it needs to evolve. Many people want to build on top of MTurk, and the foundations are lacking important structural elements.

Since the beginning of September, I have met with at least 15 different startups describing their ideas and their problems in using and leveraging Mechanical Turk. And hearing their stories, one after the other, I realized: Every single requester has the same problems:
  • Scaling up
  • Managing the complex API
  • Managing execution time
  • Ensuring quality
These problems were identified years ago. And the problems were never addressed.

The current status quo simply cannot continue. It is not good for the requesters, it is not good for the workers, it is not good for even completing the tasks. Amazon, pay attention. These are not just feature requests. These are fundamental requirements for any marketplace to function.

Amazon likes to present the hands-off approach to Mechanical Turk as a strategic choice: In the same way that EC2, S3, and many other web services are targeted to developers, in the same way Mechanical Turk is a neutral clearinghouse of labor. It provides just the ability to match requesters and workers. Everything else is the responsibility of the two consenting parties.

Too bad that this hands-off approach cannot work for a marketplace. The badly needed aspects can be easily summarized in four bullet points:
Below, I discuss these topics in more detail.

Requesters Need: A Better Interface To Post Tasks

A major task of a marketplace is to reduce overhead, friction, transaction costs, and search costs. The faster and easier it is to transact, the better the market. And MTurk fails miserably on that aspect.

I find it amazing that the last major change on Mechanical Turk for the requesters was the introduction of a UI to submit batch tasks. This was back in the summer of 2008. George Bush was the president, Lehman Brothers was an investment bank, Greece had one of the highest growing GDP's in Europe, Facebook had less than 100 million users, and Twitter was still a novelty. It would take 8 more months for FourSquare to launch.

It is high time to make it easier to requesters to post tasks. It is ridiculous to call the command-line tools user-friendly!

What is the benefit of having access to a workforce for microtasks, if a requester needs to hire a full time developer (costing at least $60K) just to deal with all the complexities? How many microtasks someone should execute to recoup the cost of development?

If every requester, in order to get good results, needs to: (a) build a quality assurance system from scratch, (b) ensure proper allocation of qualifications, (c) learn to break tasks properly into a workflow, (d) stratify workers according to quality, (e) [whatever else...], then the barrier is just too high. Only very serious requesters will devote the necessary time and effort.

What is the expected outcome of this barrier? We expect to see a few big requesters and a long tail of small requesters that are posting tiny tasks. (Oh wait, this is the case already.) In other words: It is very difficult for small guys to grow.

Since we are talking about allowing easy posting of tasks: Amazon, please take a look at TurkIt. Buy it, copy it, do whatever, but please allow easy implementation of such workflows in the market. Very few requesters have simple, one-pass tasks. Most requesters want to have crowdsourced workflows. Give them the tools to do so easily.

MTurk is shooting themselves in the foot by encouraging requesters to build their own interfaces and own workflow systems from scratch! For many many HITs, the only way to have a decent interface is to build it yourself in an iframe. What is the problem with iframe? Doing that, MTurk makes it extremely easy for the requester to switch labor channels. The requester who has build an iframe-powered HIT can easily get non-Turk workers to work on these HITs. (Hint: just use different workerid's for other labor channels and get the other workers to visit directly the iframe html page to complete the task.) Yes, it is good for the requester in the long term not to be locked in, but I guess all requesters would be happier if they did not have to build the app from scratch.

Requesters Need: A True Reputation System for Workers

My other big complaint. The current reputation system on Mechanical Turk is simply bad. "Number of completed HITs" and "approval rate" are easy to game.

Requesters need a better reputation profile for workers. Why? A market without a reputation mechanism turns quickly into a market for lemons: When requesters cannot differentiate easily good from bad workers, they tend to assume that every worker is bad. This results in good workers getting paid the same amount as the bad ones. With so low wages, good workers leave the market. At the end, the only Turkers that remain in the market are the bad ones (or the crazy good ones willing to work for the same payment as the bad workers.)

This in turn requires the same task to be completed from many workers, way too many times to ensure quality. I am not against redundancy! (Quite the opposite!) But it should be a technique for taking moderate quality input to generate high quality output. A technique for capturing diverse points of view for the same HIT. Repeated labeling should NOT be the primary weapon against spam.

The lack of a strong reputation system hurts everyone, and hurts the marketplace! Does Amazon want to run a market for lemons? I am sure that the margins will not be high.

Here are a few suggestions on what a worker reputation mechanism should include.
  • Have more public qualification tests:Does the worker have the proper English writing skills? Can the worker proofread?Most marketplaces (eLance, oDesk, vWorker, Guru), allow participants to pass certification tests to signal their quality and knowledge in different areas. Same should happen on Turk. If Amazon does not want to build such tests, let requesters make their own qualification tests available to other requesters for a fee? Myself, I would pay to use the qualifications assigned by CastingWords and CrowdFlower. These requesters would serve as the certification authorities for MTurk, in the same way that universities certify abilities for the labor markets.
  • Keep track of working history: For which requester did the worker work in the past? How many HITs, for what payment? For how long? Long history of work with reputable requesters is a good sign. In the real world, working history matters. People list their work histories in their resumes. Why not on MTurk?
  • Allow rating of workers: What is the rating that the worker received for the submitted work? Please allow requesters to rate workers. We have it everywhere else. We rate films, books, electronics, we rate pretty much everything.
  • Disconnect payment from rating: Tying reputation to acceptance rate is simply wrong. Currently, we can either accept the work and pay, or reject the work and refuse to pay. This is just wrong. We do not rate restaurants based on how often the customers refused to pay for the food! I should not have to reject and not pay for the work, if the only thing that I want to say is that the quality was not perfect. Rejecting work should be an option reserved for spammers. It should never be used against honest workers that do not meet the expectations of the requester.
  • Separate HITs and ratings by type: What was the type of the submitted work? Transcription? Image tagging? Classification? Content generation? Twitter spam? Workers are not uniformly good in all types of tasks. Writing an article requires a very different set of skills from those required for transcription, which in turn are different than the skills for image tagging. Allow requesters to see the rating across these different categories. Almost as good as the public qualification tests.
  • And make all the above accessible from an API, for automatic hiring decisions.
It cannot be that hard to do the above! runs a huge marketplace with thousands of merchants, for years. The guys as Amazon know how to design, maintain, and protect a reputation system for a much bigger marketplace. How hard can it be to port it to Mechanical Turk?

(Amazon's response about the reputation system... )

In a recent meeting, I asked this same question: Why not having a real reputation system?

The MTurk representative defended the current setup, with the following argument:
On the marketplace, the (large number of) buyers can rate the (small number of) merchants, but not vice versa. So, the same thing happens on MTurk. The (large number of) workers can rate the (small number of) requesters using TurkerNation and TurkOpticon. So the opposite should not happen: requesters should not rate workers.
I felt that the answer made sense: two-sided reputation systems indeed have deficiencies. They often lead to mutual-admiration schemes, so such systems end up being easy to hack (not that the current system is too hard to beat.) So, I was satisfied with the given answer... For approximately 10 minutes! Then I realized: Humbug!

There is no need for a reputation system for product buyers on's marketplace! It is not like eBay, where a buyer can win the auction and never pay! The reputation of the buyer on is irrelevant. On Amazon, when a buyer buys a product, as long as the credit card payment clears, the reputation of the buyer simply does not matter. There is no uncertainty, and no need to know anything about the buyer.

Now let's compare the product marketplace with MTurk: The uncertainty on MTurk is about the workers (who are the ones selling services of uncertain quality). The requester is the buyer in the MTurk market. So, indeed, there should not be a need for a reputation system for requesters, but the workers should be rated.

And at that point, people will protest: Why do we have the Hall of Fame/Shame on Turker Nation, why do we have TurkOpticon? Does Panos consider these efforts irrelevant and pointless?

And here is my reply: The very fact that we have such systems means that there is something very wrong with the Mturk marketplace. I expand below.

Workers Need: A Trustworthiness Guarantee for Requesters

Amazon should really learn from its own marketplace on Indeed, on, it is not possible to rate buyers. Amazon simply ensures that when a buyer buys a product online, the buyer pays the merchant. So, Amazon, as the marketplace owner, ensures the trustworthiness of at least one side of the market.

Unfortunately, MTurk does not really guarantee the trustworthiness of the requesters. Requesters are free to reject good work and not pay for work they get to keep. Requesters do not have to pay on time. In a sense, the requesters are serving as the slave masters. The only difference is that on MTurk the slaves can choose their master.

And so, Turker Nation and TurkOpticon were born for exactly this reason: To allow workers to learn more about their masters. To learn which requesters behave properly, which requesters abuse their power.

However, this generates a wrong dynamic in the market. Why? Let's see how things operate.

The Requester Initiation Process

When new requesters come to the market, they are treated with caution by the experienced, good workers. Legitimate workers will simply not complete many HITs of a new requester, until the workers know that the requester is legitimate, pays promptly, and does not reject work unfairly. Most of the good workers will complete just a few HITs of the newcomer, and then wait and observe how the requester behaves.

Now, try to be on the requester's side.

If the requester posts small batches, things may work well. A few good workers do a little bit of good work, and the results come back like magic. The requester is happy, pays, everyone is happy. The small requester will come back after a while, post another small batch, and so on. This process generates a large number of happy small requesters.

However, what happens when the newcomers post big batches of HITs? Legitimate workers will do a little bit of work and then wait and see. Nobody wants to risk a mass rejection, which can be lethal for the reputation of the worker. Given the above, who are the workers who will be willing to work on HITs of the new, unproven requester? You guessed right: Spammers and inexperienced workers. Result? The requester gets low quality results, gets disappointed and wonders what went wrong.

In the best case, the new requesters will seek expert help, (if they can afford it). In the worst case, the new requesters leave the market and use more conventional solutions.

At this point, it should be clear that just having a subjective reputation system for requesters is simply not enough. We need a trustworthiness guarantee for the requesters. Workers should not be afraid of working for a particular requester.

Online merchant in the Amazon marketplace do not need to check the reputation of the people the sell to. Amazon ensures that the byers are legitimate and not fraudsters. Can you imagine if every seller on Amazon had to check the credit score and the trustworthiness of every buyer they sell to? What did you say? It would be a disaster? That people would only sell to a few selected buyers? Well, witness the equivalent disaster on Mechanical Turk.

So, what is needed for the requesters? Since the requester is essentially the "buyer", there is no need to have subjective ratings. The worker should see a set of objective characteristics of the requester, and decide whether to pick a specific posted HIT or not. Here are a few things that are objective:
  • Show speed of payment: The requester payment already goes into an Amazon-controlled "escrow" account. The worker should know how fast the requester typically releases payment.
  • Show the rejection rate for the requester: Is a particular requesters litigious and reports frequently the work of the workers as spam?
  • Show the appeal rate for the requester: A particular requester may have high rejection rate just due to an attack from spammers. However, if the rejected workers appeal and win frequently, then there is something wrong with the requester.
  • Disallow the ability to reject work that is not spam: The requester should not be able to reject submitted work without paying. Rejection should be a last-resort mechanism, reserved only for obviously bad work. The worker should have the right to appeal (and potentially have the submitted work automatically reviewed by peers). This should take out a significant uncertainty in the market, allowing workers to be more confident to work with a new requester.
  • Show total volume of posted work: Workers want to know if the requester is going to come back to the market. Volume of posted work and the lifetime of the worker in the market are important characteristics: workers can use this information to decide whether it makes sense to invest the time to learn the tasks of the requester.
  • Make all the above accessible from an API: Let other people build worker-facing applications on top of MTurk.
So, a major role of a marketplace is to instill a sense of trust. Requesters should trust the workers to complete the work, and workers should not have to worry about unreasonable behavior of the workers. This minimizes the search costs associated with finding a trustworthy partner in the market.

Let's see the final part that is missing.

Workers Need: A Better User Interface

As mentioned earlier, beyond trust, the other important role of the market is to minimize as much as possible transaction overhead and search costs. The transacting parties should find each other as fast as possible, fulfill their goals, and move on. The marketplace should almost be invisible. In this market, where requesters post tasks and the tasks wait for the workers, it is important to make it as easy as possible for workers to find tasks the workers want to work on.

Current Problem: Unpredictable Completion Times

Unfortunately, currently the workers are highly restricted by the current interface, in their ability to find tasks. Workers cannot search for a requester, unless the requester put their name in the keywords. Also workers have no way to navigate and browse through the available tasks, to find things of interest.

At the end, workers end up using mainly two main sorting mechanisms: See the most recent HITs, or see the HITgroups with the most HITs. This means that workers use priority queues to pick the tasks to work on.

What is the result when tasks are being completed following priorities? The completion times of the tasks follow a power-law! (For details on the analysis, see the preprint of the XRDS report "Analyzing the Amazon Mechanical Turk Marketplace".) What is the implication? It is effectively impossible to predict the completion time of the posted tasks. For the current marketplace (with a power-law exponent a=1.5), the distribution cannot even be used to predict the average waiting time: the theoretical average is infinite, i.e., in practice the mean completion time is expected to increase continuously as we observe the market for longer periods of time.

The proposed solutions? So easy, so obvious solutions, that it even hurts to propose them:
  • Have a browsing system with tasks being posted under task categories. See for example, the main page for oDesk, where tasks are being posted under one or more categories. Is this really hard to do?

  • Improve the search engine. Seriously, how hard is it to include all the fields of a HIT into the search index? Ideally it would be better to have a faceted interface on top, but I would be happy to just see the basic things done right.
  • Use a recommender system to propose HITs to workers. For this suggestion, I have to credit ba site on the Internet, with some nifty functionality: it monitors your past buying and rating history, and then recommends products that you may enjoy. It is actually pretty nice and helped that online store to differentiate itself from its competitors. Trying to remember the name of the site... The recommendations look like that:

    It would be a good idea to have something like that on the Amazon Mechanical Turk. Ah! I remembered! The name of the site with the nice recommendations is Amazon! Seriously. Amazon cannot have a good recommender system for its own market?

Competition waits

Repeat after me: A labor marketplace is not the same thing as a computing service. Even if everything is an API, the design of the market still matters.

It is too risky to assume that MTurk can simply a bare-bones clearinghouse for labor, in the same way that S3 can be a bare-bones provider of cloud storage. There is simply no sustainable advantage and no significant added value. Network effects are not strong (especially in the absence of reputation), and just clearing payments and dealing with Patriot Act and KYC is not a significant added value.

Other marketplaces already do that, build API's, and have better design as well. It will not be difficult to get to the micro segment of the crowdsourcing market, and it may happen much faster than Amazon expects. Imho, oDesk and eLance are moving towards the space by having strong APIs for worker management, and good reputation systems. Current MTurk requesters that create their HITs using iframes, can very easily hire eLance and oDesk workers instead of using MTurk.

The recent surge of microcrowdsourcing services indicates that there are many who believe that the position of MTurk in the market is ready to be challenged.

Is it worth trying to challenge MTurk? Luis von Ahn, looking at an earlier post of mine, tweeted:

MTurk is TINY (total market size is on the order of $1M/year): Doesn't seem like it's worth all the attention.

I will reply with a prior tweet of mine:

Mechanical Turk is for crowdsourcing what AltaVista was for search engines. We now wait to see who will be the Google.