Saturday, February 18, 2012

Mechanical Turk vs oDesk: My experiences

[Necessary disclaimer: I work with the oDesk Research team as the "academic-in-residence." The experiences that I describe in this blog post are the reason that I started working with oDesk. I am not writing this because I started working with oDesk. And at the end of the day, I doubt that oDesk needs my blog posts to get visibility :-)]

A question that I receive often is how to structure tasks on Mechanical Turk for which it is necessary for the workers to pass training before doing the task. My common answer to most such question is that Mechanical Turk is not the ideal environment for such tasks: When training and frequent interaction is required, an employer is typically better off by using a site such as oDesk to hire people for the long term to do the job.

Mechanical Turk: The choice for short-term, bursty tasks

Mechanical Turk tends to shine in cases where demand is super bursty. A task appears out of nowhere, it requires 1000 people to work on it for 2-3 hours each, and get it done within a couple of days. Then the task disappears, and everyone moves on. For such scenarios, I cannot think of a better alternative than Mechanical Turk.

The blessing and curse of the long tail

Why Mechanical Turk allows easy scaling to a large number of workers? Because you can reach a large number of workers quickly. Admittedly, most people will just come and do a few tasks and then disappear. The old saying "80% of the work gets done by 20% of the workers" is typically translated on MTurk as "80% of the work gets done by 2% of the workers". But even these people that work in just a few tasks can contribute a significant amount of work on the aggregate.

But this is also a problem: Workers that complete just a few tasks cannot be evaluated by any reasonable method of statistical quality control. To have a confident measurement of the true performance of the workers, it is not uncommon to require 500 tasks or more. It is highly unclear how you can convince a Turker to stick around for so long.

The task listing interface interferes with task completion times

Since workers tend to rank tasks either by "most recent" or by "most HITs available", the allocation of visibility varies significantly across tasks. If a task gets buried in the 5th or 6th page of the results, it is effectively dead. Nobody looks at the task anymore, and the rate of completion gets pretty close to zero. Such tasks are effectively abandoned and will never finish. You need to "refresh" the task by posting some extra HITs within the task, take the task down and repost it, or play other tricks to get people to look at your task again. Needless to say this is completely unnecessary overhead, a pure result of bad design.

The curse of simplicity

Unfortunately, the ability to scale on demand has some additional drawbacks that are more subtle but, at the end, more important. The key problem: the need for simplicity.

When you suddenly require 1000 new people to work on your task, it is advisable to structure the task as if planning for the worst case scenario. This means that every worker is treated as a first grader; the tasks should be described in the most simple way possible. This often necessitates the generation of workflows that chop the tasks into tiny, easily-digestable pieces, effectively embedding "training" in the process.

As an example, consider the case of classifying a page as a containing "hate speech". Since it is not possible to get the workers to watch an 1-hour tutorial on what exactly is considered hate speech, the task on Mechanical Turk ends up being a loooong list of questions, such as "Do you see any racist jokes?", "Do you see any mention of male chauvinism?", "Do you see any call for violence against a specific ethnic group?" etc etc. Such brain-dead-simple workflows can ensure quality even when the workers are completely inexperienced. With such workflows it is also easy to defend against potential attacks from scammers that may try to submit junk, hoping to get paid for sub-par work.

However, there is a catch: Such micro-task workflows start getting into the way once workers become more experienced. A worker that has spent a few hours examining pages for hate speech has all these questions in his brain, and can process a page much faster. The clickety-click approach with simple, easy-to-chew questions worked early on, to train the worker, but now it is a tedious micromanager embedded in the system.

oDesk: The choice for long-term tasks

When the tasks are expected to last for many days, weeks, or months, then Mechanical Turk is often a suboptimal choice. The continuous need to fight scammer workers, the inability to interact easily with the workers, etc make it much easier to just on oDesk and hire a few people there to work on the task.

How I learned about oDesk as a platform for "micro"-work

While I knew about oDesk as an alternative to Rent-A-Coder and eLance, I never thought about oDesk as a platform for completing tasks similar to the ones done on Mechanical Turk. In HCOMP 2010 though, I learned about the efforts of Metaweb that used oDesk, paying workers on an hourly basis, as opposed to paying piecemeal. This allowed them to get workers to focus on the hard cases; on MTurk people have the incentive to skip the hard cases and perform only the easy tasks that can be done very quickly.

I had seen this problem with the AdSafe tasks that we were running on Mechanical Turk: workers were doing a decent job on classifying pages for the easy cases, but if the page was hard to classify (e.g., if you had to read the text to understand its true content, as opposed to looking at the images) then workers were just skipping or were giving a random answer. To fight such problem, I decided to give it a shot and hire a team of approximately 30 workers from oDesk to annotate web pages.

Migrating from Mechanical Turk to oDesk

Although the migration of a task from MTurk to oDesk seems like a tedious task, it is often pretty simple, and this is due to a design flaw (?) of Mechanical Turk. What is this flaw? If you use the Mechanical Turk capabilities for building a HIT, you are very restricted in terms of what html you can use, and what subset of JavaScript. The solution for anyone who wants to do anything moderately complicated is to build a bespoke html interface and host it within an iframe in the MTurk website. This "iframe-based MTurk HIT" is effectively a custom web application. This web application is trivially easy to adapt, to handle workers from any platform. Instead of logging in using the MTurk worker id, workers from other platforms can login directly in your website. The added bonus? The workers can use the full screen real-estate.

When I am using oDesk, I tend to hire people with minimum checking, and as part of the welcome message, the workers receive an email with their username and password for my website that hosts the MTurk HITs. I noticed lately that oDesk has an API as well, which can be used to further automate the process. But even for hiring workers manually, I could handle the task rather easily for hiring 30-50 workers, who then become effectively permanent employees, working on my tasks only, and getting paid hourly.

One of the things that I want to learn to do more effectively is to use the oDesk API to open job slots and hire people. While oDesk does not provide direct capabilities for creating a UI for task handling and execution, I do not use the MTurk UI in any case. So, this is a functionality that I do not really miss.

Providing training and interacting with oDesk workers

When the need for human labor in long-term, it makes sense to ask the oDesk workers to first spend some time familiarizing themselves with the task, watching some training videos etc. Even asking them to spend 5-6 hours for training themselves is not an unusual request and most oDeskers will happily oblige: They know that there is plenty of work coming up, so they do not mind spending their "first day at work" to familiarize themselves with the task that you give them. They prefer to keep a stable job, instead of having to continuously look around for new projects.

A neat trick that I learned at oDesk is the following: Ask your workers to join a common Skype chatroom. (Or some other chat-room of your choice.) Using this chatroom, you can communicate with your workers in real time, informing them about system issues, directing them to work on specific tasks, giving clarifications, etc etc. I personally find that setting quite amazing, and makes me feel like a modern day factory owner :-). I drop by to say hello to the workers, I ask for feedback, workers welcome the new members and provide clarifications and training, etc. In general, a very happy atmosphere :-)

Lessons on quality control from MTurk, being applied to oDesk

I have to admit, though, that the MTurk experience makes working with oDesk workers much more effective. When working with MTurk tasks, all requesters tend to develop various schemes of quality control, to measure the performance of each worker. These metrics make life much easier when managing big teams on oDesk. Effectively, you get automatic measurements of performance, that allow easy discovery of problematic workers.

I had experiences in the past with workers that were very articulate, very enthusiastic, very energetic, and ... completely sucked at the task at hand. In a regular work environment, such workers may never be identified as problematic cases. They are the life of the company, they bring the vide and the energy. But the quality management schemes, developed due to the quality challenges on handling Mechanical Turk tasks, become useful on oDesk as well.

The extras

Extra bonus 1? On oDesk, I never had to deal with a scammer and nobody attempted to game the system. oDesk runs a pretty strong identity verification scheme, which makes each worker a person tied to a real-world identity, as opposed to the disposable MTurk workerIDs. (I will explain in a later post how easy is to bypass the identity verification step on MTurk.) But the very fact that there is a basic reputation system (with its own flaws, but this is a topic for another post), this makes a huge difference on how workers approach the posted tasks.

Extra bonus 2? The hired oDeskers work only on your tasks! You do not have to worry about a task being buried in the 12th page of the results, no need to play SEO-style tricks to get visibility. You allocate a workforce to your task, and you proceed without worrying about the minute-by-minute competition by other requesters.

The increased cost of oDesk

A "disadvantage" of oDesk is that most of the work ends up being more "expensive" than Mechanical Turk. However, this only holds when you substitute a Turker with an oDesker in an one-to-one basis. This is, however, a very short-sighted approach. Given the higher average quality of the oDeskers, it is often possible to reduce the overhead of quality assurance: Fewer gold test and lower redundancy can decrease significantly the cost of a task. Therefore, when we would run a task on MTurk with a redundancy of 5 or 7, we can reach the same level of quality with just a couple (or just one) oDesk workers.

What I miss in oDesk, part I: Quick access to many workers

What I tend to miss in oDesk is the ability to get very large number of workers working on my tasks within minutes after posting my task. Of course, this is not surprising: On oDesk people are looking for reasonably long tasks, worth at least a few dollars. On MTurk we also get very few people that will stay with the task for long. I am trying to asses objectively what I miss, though. While I get this pleasant feeling that my task started very quickly, this nice fuzzy feeling has the counterside that for reasonably big tasks, the initial speed is never indicative of the overall speed of completion for the task.

I am trying to think how it will be possible to build a real-time platform with people willing to work for long for the tasks. I am looking forward to read more ideas by Rob Miller, Jeff Bigham, Michael Bernstein, and Greg Little on how to accomplish this in cases where we want people accessible in real-time and also want the workers to keep working with my tasks for long-ish periods of time.

What I miss in oDesk, part II: The mentality of a computer-mediated interaction

The last issue with oDesk is that it is fundamentally designed for human-to-human interaction. Workers do not expect to interact with an automatic process for being hired, assigned a task, and being evaluated. I am thinking that perhaps oDesk should have a flag that indicates that a particular task is "crowdsourced", which means that there is no interview process for being hired but rather hiring is mediated by a computing process. While I would love oDesk to allow for such a shift, I am not how easy it is to take a live system with hundreds of thousands of participants and introduce such (rather drastic) changes. Perhaps oDesk can create some "oLabs" products (ala Google Labs) to test such ideas...

Conclusion and future questions

While my research has focused on MTurk for quite a few years, I reached a point where I got tired of fighting scammers just to get basic results back. The oDesk environment allowed me to actually test the quality control algorithm without worrying about adversarial players trying to bypass all measures of quality control.

The fact that I am happy with hiring people through oDesk does not mean that I am fully satisfied with the capabilities of the platform. (You would not expect me to be fully satisfied, would you?)

Here are a few of the things that I want to see:
  • Simpler API. The current one can be used to post tasks and hire people automatically but it was never designed for this purpose; it was designed mainly to allow people to use oDesk through their own interfaces/systems, as opposed to using the main oDesk website. A nice tutorial taking newcomers through all the steps would be a nice addon. (I miss the tutorials that came with Java 1.0...)
  • Better access to the test scores and qualifications of the contractors. This will allow for better algorithms for automatic hiring and automatic salary negotiation. ("Oh you have a top-1% on the Java test, this deserves a 20% increase in salary.") I see that part as a very interesting research direction as well, as I expect labor to be increasingly mediated by computing processes in the future.
  • Support for real-time task execution by having pools of people waiting on demand. This introduces some nice research questions on how to best structure the pricing and incentives for workers to be waiting for tasks to be assigned to them. The papers published over the last year by the MIT et al crowd provide interesting glimpses of what applications to expect.
  • Support for shifts and scheduling. This is a heretic direction for crowdsourcing in my mind, but a very real need. For many tasks we have a rough idea of demand fluctuations over time. Being proactive and scheduling the crowd to appear when needed can lead to the implementation of real production systems that cannot rely on the whims of the crowd.
  • [?] Standardized tasks. With John Horton, we wrote in a past a brief position paper describing the need for standardization in crowdsourcing. Although I would love to see this vision materialized, I am not fully convinced myself that this is a realistic goal. Given the very high degree of vertical expertise necessary for even the most basic tasks, I cannot see how any vendor will be willing to let others use the specialized interfaces and workflows required to accomplish a standard task. As a researcher, I would love to see this vision happening, I am pessimistic on the incentives that people have to adopt this direction.
  • [?] Support for task building and quality control. I am not fully convinced that this is something that a labor platform need to support, but this is definitely in my wish list. This is of course something that I would like to see on MTurk as well. On the other hand, I see that most experienced employers use their own bespoke, customized, and optimized workflows; they also have their own bespoke quality control systems. So, I am not fully convinced that providing basic workflow tools and basic quality control would be a solution for anybody: too basic for the advanced users, too complex for the beginners. Again, I would love to see this happening as a researcher, I am pessimistic about the practical aspects.
Any other ideas and suggestions?