A Plea to Amazon: Fix Mechanical Turk!

It is now almost four years since I started experimenting with Mechanical Turk. Over these years I have been a great evangelist of the idea.

But as Mechanical Turk becomes mainstream, it is now time for the service to get the basic stuff right. The last few weeks I found myself repeating the same things again and again, so I realized that it is now time to write these things down...

Mechanical Turk, It is Time to Grow Up

The beta testing is over. If the platform wants to succeed, it needs to evolve. Many people want to build on top of MTurk, and the foundations are lacking important structural elements.

Since the beginning of September, I have met with at least 15 different startups describing their ideas and their problems in using and leveraging Mechanical Turk. And hearing their stories, one after the other, I realized: Every single requester has the same problems:

Scaling up
Managing the complex API
Managing execution time
Ensuring quality

These problems were identified years ago. And the problems were never addressed.

The current status quo simply cannot continue. It is not good for the requesters, it is not good for the workers, it is not good for even completing the tasks. Amazon, pay attention. These are not just feature requests. These are fundamental requirements for any marketplace to function.

Amazon likes to present the hands-off approach to Mechanical Turk as a strategic choice: In the same way that EC2, S3, and many other web services are targeted to developers, in the same way Mechanical Turk is a neutral clearinghouse of labor. It provides just the ability to match requesters and workers. Everything else is the responsibility of the two consenting parties.

Too bad that this hands-off approach cannot work for a marketplace. The badly needed aspects can be easily summarized in four bullet points:

Below, I discuss these topics in more detail.

Requesters Need: A Better Interface To Post Tasks

A major task of a marketplace is to reduce overhead, friction, transaction costs, and search costs. The faster and easier it is to transact, the better the market. And MTurk fails miserably on that aspect.

I find it amazing that the last major change on Mechanical Turk for the requesters was the introduction of a UI to submit batch tasks. This was back in the summer of 2008. George Bush was the president, Lehman Brothers was an investment bank, Greece had one of the highest growing GDP's in Europe, Facebook had less than 100 million users, and Twitter was still a novelty. It would take 8 more months for FourSquare to launch.

It is high time to make it easier to requesters to post tasks. It is ridiculous to call the command-line tools user-friendly!

What is the benefit of having access to a workforce for microtasks, if a requester needs to hire a full time developer (costing at least $60K) just to deal with all the complexities? How many microtasks someone should execute to recoup the cost of development?

If every requester, in order to get good results, needs to: (a) build a quality assurance system from scratch, (b) ensure proper allocation of qualifications, (c) learn to break tasks properly into a workflow, (d) stratify workers according to quality, (e) [whatever else...], then the barrier is just too high. Only very serious requesters will devote the necessary time and effort.

What is the expected outcome of this barrier? We expect to see a few big requesters and a long tail of small requesters that are posting tiny tasks. (Oh wait, this is the case already.) In other words: It is very difficult for small guys to grow.

Since we are talking about allowing easy posting of tasks: Amazon, please take a look at TurkIt. Buy it, copy it, do whatever, but please allow easy implementation of such workflows in the market. Very few requesters have simple, one-pass tasks. Most requesters want to have crowdsourced workflows. Give them the tools to do so easily.

MTurk is shooting themselves in the foot by encouraging requesters to build their own interfaces and own workflow systems from scratch! For many many HITs, the only way to have a decent interface is to build it yourself in an iframe. What is the problem with iframe? Doing that, MTurk makes it extremely easy for the requester to switch labor channels. The requester who has build an iframe-powered HIT can easily get non-Turk workers to work on these HITs. (Hint: just use different workerid's for other labor channels and get the other workers to visit directly the iframe html page to complete the task.) Yes, it is good for the requester in the long term not to be locked in, but I guess all requesters would be happier if they did not have to build the app from scratch.

Requesters Need: A True Reputation System for Workers

My other big complaint. The current reputation system on Mechanical Turk is simply bad. "Number of completed HITs" and "approval rate" are easy to game.

Requesters need a better reputation profile for workers. Why? A market without a reputation mechanism turns quickly into a market for lemons: When requesters cannot differentiate easily good from bad workers, they tend to assume that every worker is bad. This results in good workers getting paid the same amount as the bad ones. With so low wages, good workers leave the market. At the end, the only Turkers that remain in the market are the bad ones (or the crazy good ones willing to work for the same payment as the bad workers.)

This in turn requires the same task to be completed from many workers, way too many times to ensure quality. I am not against redundancy! (Quite the opposite!) But it should be a technique for taking moderate quality input to generate high quality output. A technique for capturing diverse points of view for the same HIT. Repeated labeling should NOT be the primary weapon against spam.

The lack of a strong reputation system hurts everyone, and hurts the marketplace! Does Amazon want to run a market for lemons? I am sure that the margins will not be high.

Here are a few suggestions on what a worker reputation mechanism should include.

Have more public qualification tests:Does the worker have the proper English writing skills? Can the worker proofread?Most marketplaces (eLance, oDesk, vWorker, Guru), allow participants to pass certification tests to signal their quality and knowledge in different areas. Same should happen on Turk. If Amazon does not want to build such tests, let requesters make their own qualification tests available to other requesters for a fee? Myself, I would pay to use the qualifications assigned by CastingWords and CrowdFlower. These requesters would serve as the certification authorities for MTurk, in the same way that universities certify abilities for the labor markets.
Keep track of working history: For which requester did the worker work in the past? How many HITs, for what payment? For how long? Long history of work with reputable requesters is a good sign. In the real world, working history matters. People list their work histories in their resumes. Why not on MTurk?
Allow rating of workers: What is the rating that the worker received for the submitted work? Please allow requesters to rate workers. We have it everywhere else. We rate films, books, electronics, we rate pretty much everything.
Disconnect payment from rating: Tying reputation to acceptance rate is simply wrong. Currently, we can either accept the work and pay, or reject the work and refuse to pay. This is just wrong. We do not rate restaurants based on how often the customers refused to pay for the food! I should not have to reject and not pay for the work, if the only thing that I want to say is that the quality was not perfect. Rejecting work should be an option reserved for spammers. It should never be used against honest workers that do not meet the expectations of the requester.
Separate HITs and ratings by type: What was the type of the submitted work? Transcription? Image tagging? Classification? Content generation? Twitter spam? Workers are not uniformly good in all types of tasks. Writing an article requires a very different set of skills from those required for transcription, which in turn are different than the skills for image tagging. Allow requesters to see the rating across these different categories. Almost as good as the public qualification tests.
And make all the above accessible from an API, for automatic hiring decisions.

It cannot be that hard to do the above! Amazon.com runs a huge marketplace with thousands of merchants, for years. The guys as Amazon know how to design, maintain, and protect a reputation system for a much bigger marketplace. How hard can it be to port it to Mechanical Turk?

(Amazon's response about the reputation system... )

In a recent meeting, I asked this same question: Why not having a real reputation system?

The MTurk representative defended the current setup, with the following argument:

On the Amazon.com marketplace, the (large number of) buyers can rate the (small number of) merchants, but not vice versa. So, the same thing happens on MTurk. The (large number of) workers can rate the (small number of) requesters using TurkerNation and TurkOpticon. So the opposite should not happen: requesters should not rate workers.

I felt that the answer made sense: two-sided reputation systems indeed have deficiencies. They often lead to mutual-admiration schemes, so such systems end up being easy to hack (not that the current system is too hard to beat.) So, I was satisfied with the given answer... For approximately 10 minutes! Then I realized: Humbug!

There is no need for a reputation system for product buyers on Amazon.com's marketplace! It is not like eBay, where a buyer can win the auction and never pay! The reputation of the buyer on Amazon.com is irrelevant. On Amazon, when a buyer buys a product, as long as the credit card payment clears, the reputation of the buyer simply does not matter. There is no uncertainty, and no need to know anything about the buyer.

Now let's compare the Amazon.com product marketplace with MTurk: The uncertainty on MTurk is about the workers (who are the ones selling services of uncertain quality). The requester is the buyer in the MTurk market. So, indeed, there should not be a need for a reputation system for requesters, but the workers should be rated.

And at that point, people will protest: Why do we have the Hall of Fame/Shame on Turker Nation, why do we have TurkOpticon? Does Panos consider these efforts irrelevant and pointless?

And here is my reply: The very fact that we have such systems means that there is something very wrong with the Mturk marketplace. I expand below.

Workers Need: A Trustworthiness Guarantee for Requesters

Amazon should really learn from its own marketplace on Amazon.com. Indeed, on Amazon.com, it is not possible to rate buyers. Amazon simply ensures that when a buyer buys a product online, the buyer pays the merchant. So, Amazon, as the marketplace owner, ensures the trustworthiness of at least one side of the market.

Unfortunately, MTurk does not really guarantee the trustworthiness of the requesters. Requesters are free to reject good work and not pay for work they get to keep. Requesters do not have to pay on time. In a sense, the requesters are serving as the slave masters. The only difference is that on MTurk the slaves can choose their master.

And so, Turker Nation and TurkOpticon were born for exactly this reason: To allow workers to learn more about their masters. To learn which requesters behave properly, which requesters abuse their power.

However, this generates a wrong dynamic in the market. Why? Let's see how things operate.

The Requester Initiation Process

When new requesters come to the market, they are treated with caution by the experienced, good workers. Legitimate workers will simply not complete many HITs of a new requester, until the workers know that the requester is legitimate, pays promptly, and does not reject work unfairly. Most of the good workers will complete just a few HITs of the newcomer, and then wait and observe how the requester behaves.

Now, try to be on the requester's side.

If the requester posts small batches, things may work well. A few good workers do a little bit of good work, and the results come back like magic. The requester is happy, pays, everyone is happy. The small requester will come back after a while, post another small batch, and so on. This process generates a large number of happy small requesters.

However, what happens when the newcomers post big batches of HITs? Legitimate workers will do a little bit of work and then wait and see. Nobody wants to risk a mass rejection, which can be lethal for the reputation of the worker. Given the above, who are the workers who will be willing to work on HITs of the new, unproven requester? You guessed right: Spammers and inexperienced workers. Result? The requester gets low quality results, gets disappointed and wonders what went wrong.

In the best case, the new requesters will seek expert help, (if they can afford it). In the worst case, the new requesters leave the market and use more conventional solutions.

At this point, it should be clear that just having a subjective reputation system for requesters is simply not enough. We need a trustworthiness guarantee for the requesters. Workers should not be afraid of working for a particular requester.

Online merchant in the Amazon marketplace do not need to check the reputation of the people the sell to. Amazon ensures that the byers are legitimate and not fraudsters. Can you imagine if every seller on Amazon had to check the credit score and the trustworthiness of every buyer they sell to? What did you say? It would be a disaster? That people would only sell to a few selected buyers? Well, witness the equivalent disaster on Mechanical Turk.

So, what is needed for the requesters? Since the requester is essentially the "buyer", there is no need to have subjective ratings. The worker should see a set of objective characteristics of the requester, and decide whether to pick a specific posted HIT or not. Here are a few things that are objective:

Show speed of payment: The requester payment already goes into an Amazon-controlled "escrow" account. The worker should know how fast the requester typically releases payment.
Show the rejection rate for the requester: Is a particular requesters litigious and reports frequently the work of the workers as spam?
Show the appeal rate for the requester: A particular requester may have high rejection rate just due to an attack from spammers. However, if the rejected workers appeal and win frequently, then there is something wrong with the requester.
Disallow the ability to reject work that is not spam: The requester should not be able to reject submitted work without paying. Rejection should be a last-resort mechanism, reserved only for obviously bad work. The worker should have the right to appeal (and potentially have the submitted work automatically reviewed by peers). This should take out a significant uncertainty in the market, allowing workers to be more confident to work with a new requester.
Show total volume of posted work: Workers want to know if the requester is going to come back to the market. Volume of posted work and the lifetime of the worker in the market are important characteristics: workers can use this information to decide whether it makes sense to invest the time to learn the tasks of the requester.
Make all the above accessible from an API: Let other people build worker-facing applications on top of MTurk.

So, a major role of a marketplace is to instill a sense of trust. Requesters should trust the workers to complete the work, and workers should not have to worry about unreasonable behavior of the workers. This minimizes the search costs associated with finding a trustworthy partner in the market.

Let's see the final part that is missing.

Workers Need: A Better User Interface

As mentioned earlier, beyond trust, the other important role of the market is to minimize as much as possible transaction overhead and search costs. The transacting parties should find each other as fast as possible, fulfill their goals, and move on. The marketplace should almost be invisible. In this market, where requesters post tasks and the tasks wait for the workers, it is important to make it as easy as possible for workers to find tasks the workers want to work on.

Current Problem: Unpredictable Completion Times

Unfortunately, currently the workers are highly restricted by the current interface, in their ability to find tasks. Workers cannot search for a requester, unless the requester put their name in the keywords. Also workers have no way to navigate and browse through the available tasks, to find things of interest.

At the end, workers end up using mainly two main sorting mechanisms: See the most recent HITs, or see the HITgroups with the most HITs. This means that workers use priority queues to pick the tasks to work on.

What is the result when tasks are being completed following priorities? The completion times of the tasks follow a power-law! (For details on the analysis, see the preprint of the XRDS report "Analyzing the Amazon Mechanical Turk Marketplace".) What is the implication? It is effectively impossible to predict the completion time of the posted tasks. For the current marketplace (with a power-law exponent a=1.5), the distribution cannot even be used to predict the average waiting time: the theoretical average is infinite, i.e., in practice the mean completion time is expected to increase continuously as we observe the market for longer periods of time.

The proposed solutions? So easy, so obvious solutions, that it even hurts to propose them:

Have a browsing system with tasks being posted under task categories. See for example, the main page for oDesk, where tasks are being posted under one or more categories. Is this really hard to do?
Improve the search engine. Seriously, how hard is it to include all the fields of a HIT into the search index? Ideally it would be better to have a faceted interface on top, but I would be happy to just see the basic things done right.
Use a recommender system to propose HITs to workers. For this suggestion, I have to credit ba site on the Internet, with some nifty functionality: it monitors your past buying and rating history, and then recommends products that you may enjoy. It is actually pretty nice and helped that online store to differentiate itself from its competitors. Trying to remember the name of the site... The recommendations look like that:

It would be a good idea to have something like that on the Amazon Mechanical Turk. Ah! I remembered! The name of the site with the nice recommendations is Amazon! Seriously. Amazon cannot have a good recommender system for its own market?

Competition waits

Repeat after me: A labor marketplace is not the same thing as a computing service. Even if everything is an API, the design of the market still matters.

It is too risky to assume that MTurk can simply a bare-bones clearinghouse for labor, in the same way that S3 can be a bare-bones provider of cloud storage. There is simply no sustainable advantage and no significant added value. Network effects are not strong (especially in the absence of reputation), and just clearing payments and dealing with Patriot Act and KYC is not a significant added value.

Other marketplaces already do that, build API's, and have better design as well. It will not be difficult to get to the micro segment of the crowdsourcing market, and it may happen much faster than Amazon expects. Imho, oDesk and eLance are moving towards the space by having strong APIs for worker management, and good reputation systems. Current MTurk requesters that create their HITs using iframes, can very easily hire eLance and oDesk workers instead of using MTurk.

The recent surge of microcrowdsourcing services indicates that there are many who believe that the position of MTurk in the market is ready to be challenged.

Is it worth trying to challenge MTurk? Luis von Ahn, looking at an earlier post of mine, tweeted:

MTurk is TINY (total market size is on the order of $1M/year): Doesn't seem like it's worth all the attention.

I will reply with a prior tweet of mine:

Mechanical Turk is for crowdsourcing what AltaVista was for search engines. We now wait to see who will be the Google.

A Computer Scientist in a Business School

Thursday, October 21, 2010

A Plea to Amazon: Fix Mechanical Turk!