A Computer Scientist in a Business School

Thursday, February 23, 2012

Crowdsourcing and the end of job interviews

When you discuss crowdsourcing solutions with people that have not heard the concept before, they tend to ask the question: "Why is crowdsourcing so much cheaper than existing solutions that depend on 'classic' outsourcing?"

Interestingly enough, this is not a phenomenon that appears only in crowdsourcing. The Sunday edition of the New York Times has an article titled Why Are Harvard Graduates in the Mailroom?. The article discusses the job searching strategy in some fields (e.g., Hollywood, academic, etc), where talented young applicants are willing to start with jobs that are paying well below what their skills deserve, in exchange for having the ability to make it big later in the future:

[This is] the model lottery industry. For most companies in the business, it doesn’t make economic sense to, as Google does, put promising young applicants through a series of tests and then hire only the small number who pass. Instead, it’s cheaper for talent agencies and studios to hire a lot of young workers and run them through a few years of low-paying drudgery.... This occupational centrifuge allows workers to effectively sort themselves out based on skill and drive. Over time, some will lose their commitment; others will realize that they don’t have the right talent set; others will find that they’re better at something else.

Interestingly enough, this occupational centrifuge is very close to the model of employment in crowdsourcing.

In crowdsourcing, there is very little friction in entering and leaving a job. In fact, this is the key crucial difference with traditional modes of employment: There is no interview and the employment is truly at will. You want to work on a task? Start working. You are bored? Stop working. No friction with and interviewing and hiring process, and no friction if the worker decides to stop working.

As in the case of Hollywood and academia, the evaluation is being done on-the-job. While currently the model is mainly applied to small tasks, there is nothing that fundamentally prevents this model from being applied to any other form of employment. With the Udacity and Coursera model, we start seeing that concept being applied to education. Later on, we may see other jobs adapting this model for their purposes (stock trading, anyone)?

What you observe in such settings is that the distribution of participation and engagement is heavy-tailed, tending to follow a power-law: A few participants will provide a significant amount of input, while there will be a long tail of participants that will come, do a few things (complete HITs on MTurk, write Wikipedia articles, watch lectures and homeworks in Coursera, trade stocks, pick your task...) and then leave.

What does it mean to have a power law distribution of participation in crowdsourced projects?

It means that the long-tail of the occasional participants is just not naturally attracted to the task. The persistent few are the good matches for the task. This is self-selection at its best.

No interview needed, and only the people that are truly interested stick around.

Crowdsourcing is the new interview.

The selection of the best participants happens naturally, without the artificial introduction of a selection process mediated through am interview. The interview is an artificial process. It tries to keep out from the task the participants that are not qualified and tries to identify the ones that are the best. This is an imperfect filter. It has false positives and also false negatives. Many people are hired with great hopes, just to be later proven to be ill-suited for the task (false positives). And many good people do not get the chance to work on a task just because they do not look good on paper (false negatives; I am dying to make a Jeremy Lin joke here...)

Think now of an environment where everyone gets a shot to try working on something they are interested in. No friction of getting hired and getting fired. You have a benefit where the best people work on the tasks that they are best at. [You ask what if there are fewer dream jobs than available labor? What to do when training on the job is not possible (cough, doctors, cough). Let me dream for now, and let's bury under the carpet the millions of details need to be addressed before this mode of operation has a shot in becoming reality.]

To answer the question posed at the beginning of the post, "Why is crowdsourcing so much cheaper than existing solutions that depend on 'classic' outsourcing?" The process of self-selection in matching workers and tasks is the key reason on why crowdsourcing is typically cheaper than the traditional process of assigning directly tasks to people. The easier it is for the crowd to find jobs they like, the more efficient the matching and execution.

When you effectively have the most interested and self-selected people working on a given task, the productivity of a team for the task is much higher than the performance of a team consisting of people that may simply be bored or not very interested in the task. Just consider the productivity of five programmers that are dedicated and enthusiastic about what they are building, compared to a similar team of five programmers that were assigned the task by someone and they have to implement it.

At oDesk, there is a significant effort to improve the matching process of projects and contractors, by showing to contractors the best projects for them and to employers the best contractors for a task. My own dream is to be able to eliminate the friction of interviewing and get the process of finding a job and working to be as seamless as possible.

Sunday, February 19, 2012

The Need for Standardization in Crowdsourcing

[This is the blog version of a brief position paper that John Horton and I wrote last year on the advantages of standardization for crowdsourcing. Edited for brevity. Random pictures added for fun.]

Crowdsourcing has shown itself to be well-suited for the accomplishment of various tasks. Yet many crowdsourceable tasks still require extensive structuring and managerial effort to make crowdsourcing feasible. This overhead could be substantially reduced via standardization. In the same way that task standardization enabled the mass production of physical goods, standardization of basic “building block” tasks would make crowdsourcing more scalable. Standardization would make it easier to set prices, spread best practices, build meaningful reputation systems and track quality.

Why standardizing?

Crowdsourcing has emerged over the last few years as a promising solution for a variety of problems. What most problems have in common is one or more sub-problems that cannot be fully automated, and require human labor. This labor demand is being met by workers recruited from online labor markets such as Amazon Mechanical Turk, Microtask, oDesk and Elance or from casual participants recruited by intermediaries like CrowdFlower and CloudCrowd.

In labor markets, buyers and sellers have great flexibility in the tasks they propose and the making and accepting of offers. The flexibility of online labor markets is similar to the flexibility of traditional labor markets. In both markets, buyers and sellers are free to trade almost any kind of labor at almost any terms. However, an important distinction between online and offline is that once a worker is hired off an offline, traditional market, they are not allocated to tasks via a spot market. Workers within firms are employees who have been screened, trained for their jobs and are have incentives for good performance—at a minimum, poor performance can cause them to lose their jobs. Furthermore, for many jobs—particularly those focusing on the production of physical goods—good performance is very well defined, in that workers must adhere to a standard set of instructions.

This standardization of tasks is the essential feature of modern production. The question is how to apply this idea in crowdsourcing.

Crowdsourcing, the dream: The assembly line for knowledge work

With task standardization, innovators like Henry Ford could ensure that hired workers—after suitable training—could complete those tasks easily, predictably and in a way that training was easy to replicate for new workers. To return to paid crowdsourcing, most of the high demand crowdsourcing tasks are relatively low-skilled and require workers to closely and consistently adhere to instructions for a particular, standardized task.

Crowdsourcing, the reality: The bazaar of knowledge work

As it currently stands, existing crowdsourcing platforms bear little resemblance to Henry Ford’s car plants. In crowdsourcing markets, the factory would be more like an open bazaar where workers could come and go as they pleased, receiving or making offers on tasks that different in their difficulty and skill requirements (“install engines!”, “add windshields!”, “design a new chassis!”) for different rates of pay—and with different pricing structures (fixed payment, hourly wages, incentives etc.). Some buyers would be offering work on buses, some on cars, some on lawnmowers. Reputations would be weak and easily subverted. Among both buyers and sellers, one can find scammers; some buyers are simply recruiting accomplices for nefarious activities.

The upside of such a disorganized market is that workers and buyers have lots of flexibility. There are good reasons for not wanting to just recreate the on-line equivalent of single-firm factory. However, we do not think it is an “either-or” proposition. In this paper, we discuss ways that we can have more structure on a marketplace platform, without undermining its key advantages. In particular, we believe that greater task standardization, a cultivated garden approach to work-pools and a market-making type work allocation mechanism to help arrive at prices could help us build scalable human-powered systems that meet real-world needs.

Current status

Despite the excitement and apparent industry maturation, there has been relatively little innovation—at least at the micro-work level—in the technology of how workers are allocated tasks, how reputation is managed and how tasks are presented etc. As innovative as MTurk is, it is basically unchanged since its launch. The criticism of MTurk—the difficulty of pricing work, the difficulty in predicting completion times and gaining quality, the inadequacy of the way that workers can search for tasks—are recurrent and still unanswered. Would-be users of crowdsourcing often fumble, with even technically savvy users getting mixed results. Best practices feel more like folk wisdom than an emerging consensus. Even more troubling, there is some evidence that at least some markets are becoming inundated with spammers.

uTest: An example of verticalized crowdsourcing

One part of the crowdsourcing ecosystem that appears to be thriving is the “curated garden” approach used by companies like uTest (testing software), MicroTask (quality assurance for data entry), CloudCrowd (proofreading and translation), and LiveOps (call centers). These firms recruit and train workers for their standardized tasks and they set prices of both sides of the market. Because the task is relatively narrow, it is easier to build meaningful, informative feedback and verify ex ante that workers can do the task, rather than try to screen bad work out ex post. While this kind of control is not free, practitioners gain the scalability and cost savings of crowdsourcing without the confusion of the open market. The downside of these walled gardens is that access as both a buyer and seller is limited. One of the great virtues of more market like platforms is that they are democratic and easy to experiment on. The natural question is whether it is possible to create labor pools that look more like curated gardens—with well defined, standardized tasks—and yet are still relatively open, both to new buyers and sellers?

Standardizing basic work units

Currently, the labor markets operate in a completely uncoordinated manner. Every employer generates its own work request, prices the request independently, and evaluates the answers separately from everyone else. Although this approach have some intuitive appeal in terms of worker and employer flexibility, it is a fundamentally inefficient approach.

Every employer has to implement from scratch the “best practices” for each type of work. For example, there are multiple UI’s for labeling images, or for transcribing audio. The longterm employers learn from their mistakes and fix the design problems, while newcomers have to learn the lessons of bad design the hard way.
Every employer needs to price its work unit without knowing the conditions of the market and this price cannot fluctuate without removing and reposting the tasks.
Workers need to learn the intricacies of the interface for each separate employer.
Workers need to adapt to the different quality requirements of each employer.

The efficiency of the market can increase tremendously if there is at least some basic standardization of the common types of (micro-)work that is being posted on online labor markets.

So, what are these common types of (micro-)work that we can standardize? Amazon Mechanical Turk lists a set of basic templates, which give a good idea of what tasks are good candidates to standardize first. The analysis of the Mechanical Turk marketplace also indicates a set of tasks that are very frequent on Mechanical Turk and are also good candidates to standardize.

Simple Machines, the standardized units for mechanics. Can we create corresponding simple machines for labor?

We can draw in parallel with engineering: In mechanics, we have a set of “simple machines,” such as screws, levers, wheel and axle, and so on. These simple machines are typically standardized and serve as components for larger, significantly more complicated creations. Analogously, in crowdsourcing, we can define a set of such simple tasks, standardize them, and then build, if necessary, more complicated tasks on top. What are the advantages of standardizing the simple tasks, if we only need them as components?

Reusability: First of all, as mentioned above, there is no need for requesters to think on how to create the user interfaces and best practices for such simple tasks. These standardized tasks can be, of course, revised over time to reflect our knowledge on how to best accomplish them.
Trading commodities: Second, and potentially more important, these simple tasks can be traded in the market in the same way that stocks and commodities are currently traded in financial markets. In stock markets, the buyer does not need to know who is the seller, or whether the order was fulfilled by a single seller or multiple ones: it is the task of the market maker to match and fulfill buy and sell orders. In the same way, we can have a queue of standardized tasks that need to be completed, and workers can complete them at any time, without having to think about the reputation of the requester or to refamiliarize themselves with the task. This should lead to much more efficient task execution.
True market pricing: A third advantage of standardized work units is that pricing becomes significantly simpler. Instead of “testing the market” to see what price points leads to an optimal setting, we can instead have a very “liquid” market with a large number of offered tasks and a large number of workers that work on these tasks. This can lead to a stock-market-like pricing. The tasks get completed by the workers, in priority order according to the offered price for the work unit: the highest paying units get completed first. So, if requesters want to prioritize their own tasks, they can simply price them higher than the current market price. This corresponds to an increase in demand, which moves up the market price. On the other hand, if no requesters post tasks then, once the tasks with the highest prices get completed, then we automatically move to the tasks that have lower price associated with them. This corresponds to the case where the supply of work is higher than the demand, and market prices for the work unit move down.

In cases where there is not enough “liquidity” in the market (i.e., when the workers are not willing to work for the posted prices), then we can employ automated market makers, such as the ones currently used by prediction markets. The process would then operate like this: The workers identify the price for which they are willing to work. Then, the automated market maker takes into consideration the “ask” (the worker quote) and the “bid” (the price of the task), and can perform the trade by “bridging” the difference. Essentially, such automated market makers provide a subsidy in order for the transactions to happen. We should note that a market owner can typically benefit even in scenarios, where they need to subsidize the market through an automated market maker: the fee from a transaction that happens can cover the necessary subsidy which is consumed by the automated market maker.

Can we trade and price standardized crowdsourced tasks as we trade and price securities?

Having basic, standardized work units with highly liquid, high-volume markets can serve as a catalyst for companies to adopt crowdsourcing. Standardization can strengthen the network effects, can provide the basis for better reputation systems, can facilitate pricing, and can lead to the easier development of more complicated tasks that comprise of an arbitrary combination of small work units.

CONSTRUCTING AND PRICING COMPOSITE TASKS

Once we have some basic work units in place, we can start generating tasks that consist of multiple such units, to generate tasks that cannot be achieved with just using basic units. Again we can draw the analogs from mechanical engineering: the “simple machines” (screws, levers, wheel and axle, and so on) can then be assembled together to generate machines of arbitrary complexity. Similarly, in crowdsourcing we can use these standardized set of “simple work units” that can be later assembled to generate tasks of arbitrary complexity.

Quality Assurance

Assume that we have a basic work unit for a task such as comment moderation, that guarantees an accuracy of 80% or higher (e.g., by screening and testing continuously the workers that can complete these tasks). If we want to have a work unit that has higher quality guarantees, we can generate a composite unit that uses multiple, redundant work units and relies on, say, majority vote to generate a work unit with higher quality guarantees.

Pricing Workflows

There is already work available on how to create and control the quality of workflows in crowdsourced environments. We also have a set of design patterns for workflows in general. If we have a crowdsourced workflow that consists of standardized work units, we can also accurately price the overall workflow.

Pricing complex, workflow-based tasks becomes significantly easier when the basic execution units in the workflow are standardized and priced by the market.

We do not even have to reinvent the wheel: there is a significant amount of work on pricing combinatorial contracts in prediction markets. (An example of a combinatorial contract: “Obama will win the 2012 election and will win Ohio” or “Obama will win the 2012 election given that he will win Ohio”.) A workflow can be expressed as a combinatorial expression of the underlying simple work units. Since we know the price of standard units, we can easily leverage work from prediction markets to price tasks of almost arbitrary complexity. The successful deployment of Predictalot by Yahoo! during the 2010 soccer World Cup, with the extensive real-time pricing of complicated combinatorial contracts, gives us the confidence that such a pricing mechanism is also possible for online labor markets.

Timing and Optimizing Workflows

There is already significant amount of work in distributed computing on optimizing execution of task workflows in Mapreduce-like environments. This research should be directly applicable in an environment where the basic computation is performed not by computers but by humans. Also, since the work units will be completed through easy-to-model waiting queues, we can easily leverage the work from queuing theory to estimate how long a task will remain within the system: by identifying the critical parts of execution we can also identify potential bottlenecks and increase the offered prices for only the work units that critically affect the completion time of the overall task.

Role of platforms

One helpful way to think about the role and incentives of online labor platforms is to consider that they are analogous to a commerce-promoting government in a traditional labor market. Most platforms levy an ad valorem charge and thus they have an incentive to increase the size of the total wage bill. While there are many steps these markets can take, their efforts fall into two categories:

remedying externalities, and
setting enforceable standards and rules, i.e., their “weights and measures” function.

Remedying Externalities

An externality is created whenever the costs and benefits from some activity are not solely internalized by the person choosing to engage in that activity. A negative example is pollution—the factory owner gets the goods, others get the smoke—while a positive example is yard beautification (the gardener works and buys the plants, others get to enjoy the scenery). Because the parties making the decision do not fully internalize the costs and benefits, activities producing negative externalities are (inefficiently) over-provided, and activities producing positive externalities are (inefficiently) under-provided. In such cases, “government” intervention can improve efficiency.

Road traffic is an example of a product with negative externalities.

Negative examples are easy to find in on-line labor markets— fraud is one example. Not only is fraud unjust, it also makes everyone else more distrustful, lowering the volume and value of trade. Removing bad actors helps ameliorate the market-killing problem of information asymmetry, as uncertainty about the quality of some good or service is often just the probability that the other trading partner is a fraud.

A positive example is honest feedback after a trade. Giving feedback is costly to both buyers and sellers: It takes time and giving negative feedback invites retaliation or scares off future trading partners. In the negative case, the platform needs to fight fraud—not simply fraud directed at itself but fraud directed at others on the platform, which has a negative second-order effect on the platform creator. In the positive case, the firm can make offering feedback more attractive, by offering rewards, making in mandatory, making it easier, changing rules to prevent retaliation etc.

There are lots of options in both the positive and negative case— the important point is that platform creators recognize externalities and act to encourage positive externalities and eliminate the negative ones. Individual participants do not have the incentives (or even the ability) to fix the negative externalities for all other market participants. For example, no employer has the incentive to publish his own evaluation of the workers that work for his, as this is a signal earned after a significant cost for the employer. This is a case where the market owner can provide the appropriate incentives and designs for the necessary transparency.

Setting Enforceable Standards

Task standardization will probably require buy-in from on online labor markets and intermediaries. Setting cross-platform standards is likely to be a contentious process, as the introduction of standards gives different incentives to different firms, depending upon their business model and market share. However, at least within a particular platform and ignoring their competitors, there is powerful incentive to create standards as they raise the value of paid crowdsourcing and promote efficiency. For example, the market for SMS’s took off in the US only when the big carriers agreed on a common interoperable standard for sending and receiving SMS’s across carrier’s networks.

Standardizing units of measure facilitate transactions and gives us flexibility to create more complex units on top. Can we achieve the same standardization for labor?

In traditional markets, market-wide agreement about basic units of measure facilitate trade. In commodity markets, agreements about quality standards serve a similar role, in that buyers know what they are getting and sellers know what they are supposed to provide. (For example, electricity producers are required to produce electricity adhering to some minimum standards before being able to connect to the grid and sell to other parties.) It should be clear that having public standards make quality assurance easier for the platform: enforcing standards on standardized units of work can be done much easier than enforcing quality standards in a wide variety of adhoc tasks. With such standards, it easier to imagine platform owners more willingly taking the role of testing for and enforcing quality standards for the participants that provide labor.

If we define weights and measures more broadly to include verification of claims, the platform role becomes even wider. They can verify credentials, test scores, work and payment histories, reputation scores and every other piece of information that individuals cannot credibly report themselves. Individuals are also not able to credibly report the quality of their work, but at least with an objective standard, validating those claims is possible. (For example, one of the main innovations made by oDesk was that they logged a worker’s time spent on a task, enabling truthful hourly billing.)

Conclusion

As our knowledge increases and platforms and practices mature, more work will be outsourced to remote workers. On the whole, we think this is a positive development, particularly because paid crowdsourcing gives people in poor countries access to buyers in rich countries, enabling a kind of virtual migration.

At the same time, access to an on demand, inexpensive labor force, more often than than not, enables the creation of products and services that were not possible before: Once you solve a problem that was deemed too-costly-to-solve before, people start looking for the next thing to fix. This in turn generates more positions, more demand, and so on. It is a virtuous cycle, not the Armageddon.

Saturday, February 18, 2012

Mechanical Turk vs oDesk: My experiences

[Necessary disclaimer: I work with the oDesk Research team as the "academic-in-residence." The experiences that I describe in this blog post are the reason that I started working with oDesk. I am not writing this because I started working with oDesk. And at the end of the day, I doubt that oDesk needs my blog posts to get visibility :-)]

A question that I receive often is how to structure tasks on Mechanical Turk for which it is necessary for the workers to pass training before doing the task. My common answer to most such question is that Mechanical Turk is not the ideal environment for such tasks: When training and frequent interaction is required, an employer is typically better off by using a site such as oDesk to hire people for the long term to do the job.

Mechanical Turk: The choice for short-term, bursty tasks

Mechanical Turk tends to shine in cases where demand is super bursty. A task appears out of nowhere, it requires 1000 people to work on it for 2-3 hours each, and get it done within a couple of days. Then the task disappears, and everyone moves on. For such scenarios, I cannot think of a better alternative than Mechanical Turk.

The blessing and curse of the long tail

Why Mechanical Turk allows easy scaling to a large number of workers? Because you can reach a large number of workers quickly. Admittedly, most people will just come and do a few tasks and then disappear. The old saying "80% of the work gets done by 20% of the workers" is typically translated on MTurk as "80% of the work gets done by 2% of the workers". But even these people that work in just a few tasks can contribute a significant amount of work on the aggregate.

But this is also a problem: Workers that complete just a few tasks cannot be evaluated by any reasonable method of statistical quality control. To have a confident measurement of the true performance of the workers, it is not uncommon to require 500 tasks or more. It is highly unclear how you can convince a Turker to stick around for so long.

The task listing interface interferes with task completion times

Since workers tend to rank tasks either by "most recent" or by "most HITs available", the allocation of visibility varies significantly across tasks. If a task gets buried in the 5th or 6th page of the results, it is effectively dead. Nobody looks at the task anymore, and the rate of completion gets pretty close to zero. Such tasks are effectively abandoned and will never finish. You need to "refresh" the task by posting some extra HITs within the task, take the task down and repost it, or play other tricks to get people to look at your task again. Needless to say this is completely unnecessary overhead, a pure result of bad design.

The curse of simplicity

Unfortunately, the ability to scale on demand has some additional drawbacks that are more subtle but, at the end, more important. The key problem: the need for simplicity.

When you suddenly require 1000 new people to work on your task, it is advisable to structure the task as if planning for the worst case scenario. This means that every worker is treated as a first grader; the tasks should be described in the most simple way possible. This often necessitates the generation of workflows that chop the tasks into tiny, easily-digestable pieces, effectively embedding "training" in the process.

As an example, consider the case of classifying a page as a containing "hate speech". Since it is not possible to get the workers to watch an 1-hour tutorial on what exactly is considered hate speech, the task on Mechanical Turk ends up being a loooong list of questions, such as "Do you see any racist jokes?", "Do you see any mention of male chauvinism?", "Do you see any call for violence against a specific ethnic group?" etc etc. Such brain-dead-simple workflows can ensure quality even when the workers are completely inexperienced. With such workflows it is also easy to defend against potential attacks from scammers that may try to submit junk, hoping to get paid for sub-par work.

However, there is a catch: Such micro-task workflows start getting into the way once workers become more experienced. A worker that has spent a few hours examining pages for hate speech has all these questions in his brain, and can process a page much faster. The clickety-click approach with simple, easy-to-chew questions worked early on, to train the worker, but now it is a tedious micromanager embedded in the system.

oDesk: The choice for long-term tasks

When the tasks are expected to last for many days, weeks, or months, then Mechanical Turk is often a suboptimal choice. The continuous need to fight scammer workers, the inability to interact easily with the workers, etc make it much easier to just on oDesk and hire a few people there to work on the task.

How I learned about oDesk as a platform for "micro"-work

While I knew about oDesk as an alternative to Rent-A-Coder and eLance, I never thought about oDesk as a platform for completing tasks similar to the ones done on Mechanical Turk. In HCOMP 2010 though, I learned about the efforts of Metaweb that used oDesk, paying workers on an hourly basis, as opposed to paying piecemeal. This allowed them to get workers to focus on the hard cases; on MTurk people have the incentive to skip the hard cases and perform only the easy tasks that can be done very quickly.

I had seen this problem with the AdSafe tasks that we were running on Mechanical Turk: workers were doing a decent job on classifying pages for the easy cases, but if the page was hard to classify (e.g., if you had to read the text to understand its true content, as opposed to looking at the images) then workers were just skipping or were giving a random answer. To fight such problem, I decided to give it a shot and hire a team of approximately 30 workers from oDesk to annotate web pages.

Migrating from Mechanical Turk to oDesk

Although the migration of a task from MTurk to oDesk seems like a tedious task, it is often pretty simple, and this is due to a design flaw (?) of Mechanical Turk. What is this flaw? If you use the Mechanical Turk capabilities for building a HIT, you are very restricted in terms of what html you can use, and what subset of JavaScript. The solution for anyone who wants to do anything moderately complicated is to build a bespoke html interface and host it within an iframe in the MTurk website. This "iframe-based MTurk HIT" is effectively a custom web application. This web application is trivially easy to adapt, to handle workers from any platform. Instead of logging in using the MTurk worker id, workers from other platforms can login directly in your website. The added bonus? The workers can use the full screen real-estate.

When I am using oDesk, I tend to hire people with minimum checking, and as part of the welcome message, the workers receive an email with their username and password for my website that hosts the MTurk HITs. I noticed lately that oDesk has an API as well, which can be used to further automate the process. But even for hiring workers manually, I could handle the task rather easily for hiring 30-50 workers, who then become effectively permanent employees, working on my tasks only, and getting paid hourly.

One of the things that I want to learn to do more effectively is to use the oDesk API to open job slots and hire people. While oDesk does not provide direct capabilities for creating a UI for task handling and execution, I do not use the MTurk UI in any case. So, this is a functionality that I do not really miss.

Providing training and interacting with oDesk workers

When the need for human labor in long-term, it makes sense to ask the oDesk workers to first spend some time familiarizing themselves with the task, watching some training videos etc. Even asking them to spend 5-6 hours for training themselves is not an unusual request and most oDeskers will happily oblige: They know that there is plenty of work coming up, so they do not mind spending their "first day at work" to familiarize themselves with the task that you give them. They prefer to keep a stable job, instead of having to continuously look around for new projects.

A neat trick that I learned at oDesk is the following: Ask your workers to join a common Skype chatroom. (Or some other chat-room of your choice.) Using this chatroom, you can communicate with your workers in real time, informing them about system issues, directing them to work on specific tasks, giving clarifications, etc etc. I personally find that setting quite amazing, and makes me feel like a modern day factory owner :-). I drop by to say hello to the workers, I ask for feedback, workers welcome the new members and provide clarifications and training, etc. In general, a very happy atmosphere :-)

Lessons on quality control from MTurk, being applied to oDesk

I have to admit, though, that the MTurk experience makes working with oDesk workers much more effective. When working with MTurk tasks, all requesters tend to develop various schemes of quality control, to measure the performance of each worker. These metrics make life much easier when managing big teams on oDesk. Effectively, you get automatic measurements of performance, that allow easy discovery of problematic workers.

I had experiences in the past with workers that were very articulate, very enthusiastic, very energetic, and ... completely sucked at the task at hand. In a regular work environment, such workers may never be identified as problematic cases. They are the life of the company, they bring the vide and the energy. But the quality management schemes, developed due to the quality challenges on handling Mechanical Turk tasks, become useful on oDesk as well.

The extras

Extra bonus 1? On oDesk, I never had to deal with a scammer and nobody attempted to game the system. oDesk runs a pretty strong identity verification scheme, which makes each worker a person tied to a real-world identity, as opposed to the disposable MTurk workerIDs. (I will explain in a later post how easy is to bypass the identity verification step on MTurk.) But the very fact that there is a basic reputation system (with its own flaws, but this is a topic for another post), this makes a huge difference on how workers approach the posted tasks.

Extra bonus 2? The hired oDeskers work only on your tasks! You do not have to worry about a task being buried in the 12th page of the results, no need to play SEO-style tricks to get visibility. You allocate a workforce to your task, and you proceed without worrying about the minute-by-minute competition by other requesters.

The increased cost of oDesk

A "disadvantage" of oDesk is that most of the work ends up being more "expensive" than Mechanical Turk. However, this only holds when you substitute a Turker with an oDesker in an one-to-one basis. This is, however, a very short-sighted approach. Given the higher average quality of the oDeskers, it is often possible to reduce the overhead of quality assurance: Fewer gold test and lower redundancy can decrease significantly the cost of a task. Therefore, when we would run a task on MTurk with a redundancy of 5 or 7, we can reach the same level of quality with just a couple (or just one) oDesk workers.

What I miss in oDesk, part I: Quick access to many workers

What I tend to miss in oDesk is the ability to get very large number of workers working on my tasks within minutes after posting my task. Of course, this is not surprising: On oDesk people are looking for reasonably long tasks, worth at least a few dollars. On MTurk we also get very few people that will stay with the task for long. I am trying to asses objectively what I miss, though. While I get this pleasant feeling that my task started very quickly, this nice fuzzy feeling has the counterside that for reasonably big tasks, the initial speed is never indicative of the overall speed of completion for the task.

I am trying to think how it will be possible to build a real-time platform with people willing to work for long for the tasks. I am looking forward to read more ideas by Rob Miller, Jeff Bigham, Michael Bernstein, and Greg Little on how to accomplish this in cases where we want people accessible in real-time and also want the workers to keep working with my tasks for long-ish periods of time.

What I miss in oDesk, part II: The mentality of a computer-mediated interaction

The last issue with oDesk is that it is fundamentally designed for human-to-human interaction. Workers do not expect to interact with an automatic process for being hired, assigned a task, and being evaluated. I am thinking that perhaps oDesk should have a flag that indicates that a particular task is "crowdsourced", which means that there is no interview process for being hired but rather hiring is mediated by a computing process. While I would love oDesk to allow for such a shift, I am not how easy it is to take a live system with hundreds of thousands of participants and introduce such (rather drastic) changes. Perhaps oDesk can create some "oLabs" products (ala Google Labs) to test such ideas...

Conclusion and future questions

While my research has focused on MTurk for quite a few years, I reached a point where I got tired of fighting scammers just to get basic results back. The oDesk environment allowed me to actually test the quality control algorithm without worrying about adversarial players trying to bypass all measures of quality control.

The fact that I am happy with hiring people through oDesk does not mean that I am fully satisfied with the capabilities of the platform. (You would not expect me to be fully satisfied, would you?)

Here are a few of the things that I want to see:

Simpler API. The current one can be used to post tasks and hire people automatically but it was never designed for this purpose; it was designed mainly to allow people to use oDesk through their own interfaces/systems, as opposed to using the main oDesk website. A nice tutorial taking newcomers through all the steps would be a nice addon. (I miss the tutorials that came with Java 1.0...)
Better access to the test scores and qualifications of the contractors. This will allow for better algorithms for automatic hiring and automatic salary negotiation. ("Oh you have a top-1% on the Java test, this deserves a 20% increase in salary.") I see that part as a very interesting research direction as well, as I expect labor to be increasingly mediated by computing processes in the future.
Support for real-time task execution by having pools of people waiting on demand. This introduces some nice research questions on how to best structure the pricing and incentives for workers to be waiting for tasks to be assigned to them. The papers published over the last year by the MIT et al crowd provide interesting glimpses of what applications to expect.
Support for shifts and scheduling. This is a heretic direction for crowdsourcing in my mind, but a very real need. For many tasks we have a rough idea of demand fluctuations over time. Being proactive and scheduling the crowd to appear when needed can lead to the implementation of real production systems that cannot rely on the whims of the crowd.
[?] Standardized tasks. With John Horton, we wrote in a past a brief position paper describing the need for standardization in crowdsourcing. Although I would love to see this vision materialized, I am not fully convinced myself that this is a realistic goal. Given the very high degree of vertical expertise necessary for even the most basic tasks, I cannot see how any vendor will be willing to let others use the specialized interfaces and workflows required to accomplish a standard task. As a researcher, I would love to see this vision happening, I am pessimistic on the incentives that people have to adopt this direction.
[?] Support for task building and quality control. I am not fully convinced that this is something that a labor platform need to support, but this is definitely in my wish list. This is of course something that I would like to see on MTurk as well. On the other hand, I see that most experienced employers use their own bespoke, customized, and optimized workflows; they also have their own bespoke quality control systems. So, I am not fully convinced that providing basic workflow tools and basic quality control would be a solution for anybody: too basic for the advanced users, too complex for the beginners. Again, I would love to see this happening as a researcher, I am pessimistic about the practical aspects.

Any other ideas and suggestions?

Tuesday, February 7, 2012

ACM EC 2012: Some early statistics

This year, together with Kevin Leyton-Brown, we co-chair ACM EC 2012, the 13th ACM Conference on Electronic Commerce, which will be held in Valencia, Spain, from June 4th to June 8th.

Today was the submission deadline, and honestly I was a little bit worried about the number of submissions. 11 hours before the deadline we had just 119 submissions, a number significantly lower than for most of the recent EC conferences.

My worry did not last for long. After observing the number of new papers per hour, and by extrapolating quickly I realized that we were going to get a large number of additional submissions. The extrapolation from the regression showed that we should expect 210 submissions, maybe a little lower if submission rate lowers closer to the deadline. The answers on Twitter indicated that most probably the opposite would happen. In fact, here is the submissions over time:

Yes, most of the papers were submitted just a few hours before the deadline.

By the deadline, we had a total of 225 papers uploaded, an all-time high number of submissions. Given that this is the first time that EC will be help outside a major city in the US and that such movements typically mean lower number of submissions and attendance), we are more than happy with the number of submissions.

This year we also instituted the concept of tracks, to guarantee to the authors that their papers will be reviewed by reviewers in the same area. (A common perception is that EC is dominated by theorists who are hostile to empirical and applied work, so this separation should alleviate this concern.) Here is the approximate breakdown across tracks:

50% Theory & Foundations
15% Artificial Intelligence
15% Empirical & Applications
10% Theory+AI
5% Theory+Empirical
5% AI+ Empirical
0% in all three

We will also introduce a new concept at EC this year: Anyone who has a paper related to EC, published or accepted for publication in another venue, conference or publication over the last year, will be able to come and present the work as a poster. We hope that this will allow the conference to serve as a meeting place for exchanging ideas about the field, in addition to being a venue where novel research is being presented for the first time. We will be posting the details soon on the official website of EC'12.

Looking forward to seeing you in Valencia!