The Need for Standardization in Crowdsourcing

[This is the blog version of a brief position paper that John Horton and I wrote last year on the advantages of standardization for crowdsourcing. Edited for brevity. Random pictures added for fun.]

Crowdsourcing has shown itself to be well-suited for the accomplishment of various tasks. Yet many crowdsourceable tasks still require extensive structuring and managerial effort to make crowdsourcing feasible. This overhead could be substantially reduced via standardization. In the same way that task standardization enabled the mass production of physical goods, standardization of basic “building block” tasks would make crowdsourcing more scalable. Standardization would make it easier to set prices, spread best practices, build meaningful reputation systems and track quality.

Why standardizing?

Crowdsourcing has emerged over the last few years as a promising solution for a variety of problems. What most problems have in common is one or more sub-problems that cannot be fully automated, and require human labor. This labor demand is being met by workers recruited from online labor markets such as Amazon Mechanical Turk, Microtask, oDesk and Elance or from casual participants recruited by intermediaries like CrowdFlower and CloudCrowd.

In labor markets, buyers and sellers have great flexibility in the tasks they propose and the making and accepting of offers. The flexibility of online labor markets is similar to the flexibility of traditional labor markets. In both markets, buyers and sellers are free to trade almost any kind of labor at almost any terms. However, an important distinction between online and offline is that once a worker is hired off an offline, traditional market, they are not allocated to tasks via a spot market. Workers within firms are employees who have been screened, trained for their jobs and are have incentives for good performance—at a minimum, poor performance can cause them to lose their jobs. Furthermore, for many jobs—particularly those focusing on the production of physical goods—good performance is very well defined, in that workers must adhere to a standard set of instructions.

This standardization of tasks is the essential feature of modern production. The question is how to apply this idea in crowdsourcing.

Crowdsourcing, the dream: The assembly line for knowledge work

With task standardization, innovators like Henry Ford could ensure that hired workers—after suitable training—could complete those tasks easily, predictably and in a way that training was easy to replicate for new workers. To return to paid crowdsourcing, most of the high demand crowdsourcing tasks are relatively low-skilled and require workers to closely and consistently adhere to instructions for a particular, standardized task.

Crowdsourcing, the reality: The bazaar of knowledge work

As it currently stands, existing crowdsourcing platforms bear little resemblance to Henry Ford’s car plants. In crowdsourcing markets, the factory would be more like an open bazaar where workers could come and go as they pleased, receiving or making offers on tasks that different in their difficulty and skill requirements (“install engines!”, “add windshields!”, “design a new chassis!”) for different rates of pay—and with different pricing structures (fixed payment, hourly wages, incentives etc.). Some buyers would be offering work on buses, some on cars, some on lawnmowers. Reputations would be weak and easily subverted. Among both buyers and sellers, one can find scammers; some buyers are simply recruiting accomplices for nefarious activities.

The upside of such a disorganized market is that workers and buyers have lots of flexibility. There are good reasons for not wanting to just recreate the on-line equivalent of single-firm factory. However, we do not think it is an “either-or” proposition. In this paper, we discuss ways that we can have more structure on a marketplace platform, without undermining its key advantages. In particular, we believe that greater task standardization, a cultivated garden approach to work-pools and a market-making type work allocation mechanism to help arrive at prices could help us build scalable human-powered systems that meet real-world needs.

Current status

Despite the excitement and apparent industry maturation, there has been relatively little innovation—at least at the micro-work level—in the technology of how workers are allocated tasks, how reputation is managed and how tasks are presented etc. As innovative as MTurk is, it is basically unchanged since its launch. The criticism of MTurk—the difficulty of pricing work, the difficulty in predicting completion times and gaining quality, the inadequacy of the way that workers can search for tasks—are recurrent and still unanswered. Would-be users of crowdsourcing often fumble, with even technically savvy users getting mixed results. Best practices feel more like folk wisdom than an emerging consensus. Even more troubling, there is some evidence that at least some markets are becoming inundated with spammers.

uTest: An example of verticalized crowdsourcing

One part of the crowdsourcing ecosystem that appears to be thriving is the “curated garden” approach used by companies like uTest (testing software), MicroTask (quality assurance for data entry), CloudCrowd (proofreading and translation), and LiveOps (call centers). These firms recruit and train workers for their standardized tasks and they set prices of both sides of the market. Because the task is relatively narrow, it is easier to build meaningful, informative feedback and verify ex ante that workers can do the task, rather than try to screen bad work out ex post. While this kind of control is not free, practitioners gain the scalability and cost savings of crowdsourcing without the confusion of the open market. The downside of these walled gardens is that access as both a buyer and seller is limited. One of the great virtues of more market like platforms is that they are democratic and easy to experiment on. The natural question is whether it is possible to create labor pools that look more like curated gardens—with well defined, standardized tasks—and yet are still relatively open, both to new buyers and sellers?

Standardizing basic work units

Currently, the labor markets operate in a completely uncoordinated manner. Every employer generates its own work request, prices the request independently, and evaluates the answers separately from everyone else. Although this approach have some intuitive appeal in terms of worker and employer flexibility, it is a fundamentally inefficient approach.

Every employer has to implement from scratch the “best practices” for each type of work. For example, there are multiple UI’s for labeling images, or for transcribing audio. The longterm employers learn from their mistakes and fix the design problems, while newcomers have to learn the lessons of bad design the hard way.
Every employer needs to price its work unit without knowing the conditions of the market and this price cannot fluctuate without removing and reposting the tasks.
Workers need to learn the intricacies of the interface for each separate employer.
Workers need to adapt to the different quality requirements of each employer.

The efficiency of the market can increase tremendously if there is at least some basic standardization of the common types of (micro-)work that is being posted on online labor markets.

So, what are these common types of (micro-)work that we can standardize? Amazon Mechanical Turk lists a set of basic templates, which give a good idea of what tasks are good candidates to standardize first. The analysis of the Mechanical Turk marketplace also indicates a set of tasks that are very frequent on Mechanical Turk and are also good candidates to standardize.

Simple Machines, the standardized units for mechanics. Can we create corresponding simple machines for labor?

We can draw in parallel with engineering: In mechanics, we have a set of “simple machines,” such as screws, levers, wheel and axle, and so on. These simple machines are typically standardized and serve as components for larger, significantly more complicated creations. Analogously, in crowdsourcing, we can define a set of such simple tasks, standardize them, and then build, if necessary, more complicated tasks on top. What are the advantages of standardizing the simple tasks, if we only need them as components?

Reusability: First of all, as mentioned above, there is no need for requesters to think on how to create the user interfaces and best practices for such simple tasks. These standardized tasks can be, of course, revised over time to reflect our knowledge on how to best accomplish them.
Trading commodities: Second, and potentially more important, these simple tasks can be traded in the market in the same way that stocks and commodities are currently traded in financial markets. In stock markets, the buyer does not need to know who is the seller, or whether the order was fulfilled by a single seller or multiple ones: it is the task of the market maker to match and fulfill buy and sell orders. In the same way, we can have a queue of standardized tasks that need to be completed, and workers can complete them at any time, without having to think about the reputation of the requester or to refamiliarize themselves with the task. This should lead to much more efficient task execution.
True market pricing: A third advantage of standardized work units is that pricing becomes significantly simpler. Instead of “testing the market” to see what price points leads to an optimal setting, we can instead have a very “liquid” market with a large number of offered tasks and a large number of workers that work on these tasks. This can lead to a stock-market-like pricing. The tasks get completed by the workers, in priority order according to the offered price for the work unit: the highest paying units get completed first. So, if requesters want to prioritize their own tasks, they can simply price them higher than the current market price. This corresponds to an increase in demand, which moves up the market price. On the other hand, if no requesters post tasks then, once the tasks with the highest prices get completed, then we automatically move to the tasks that have lower price associated with them. This corresponds to the case where the supply of work is higher than the demand, and market prices for the work unit move down.

In cases where there is not enough “liquidity” in the market (i.e., when the workers are not willing to work for the posted prices), then we can employ automated market makers, such as the ones currently used by prediction markets. The process would then operate like this: The workers identify the price for which they are willing to work. Then, the automated market maker takes into consideration the “ask” (the worker quote) and the “bid” (the price of the task), and can perform the trade by “bridging” the difference. Essentially, such automated market makers provide a subsidy in order for the transactions to happen. We should note that a market owner can typically benefit even in scenarios, where they need to subsidize the market through an automated market maker: the fee from a transaction that happens can cover the necessary subsidy which is consumed by the automated market maker.

Can we trade and price standardized crowdsourced tasks as we trade and price securities?

Having basic, standardized work units with highly liquid, high-volume markets can serve as a catalyst for companies to adopt crowdsourcing. Standardization can strengthen the network effects, can provide the basis for better reputation systems, can facilitate pricing, and can lead to the easier development of more complicated tasks that comprise of an arbitrary combination of small work units.

CONSTRUCTING AND PRICING COMPOSITE TASKS

Once we have some basic work units in place, we can start generating tasks that consist of multiple such units, to generate tasks that cannot be achieved with just using basic units. Again we can draw the analogs from mechanical engineering: the “simple machines” (screws, levers, wheel and axle, and so on) can then be assembled together to generate machines of arbitrary complexity. Similarly, in crowdsourcing we can use these standardized set of “simple work units” that can be later assembled to generate tasks of arbitrary complexity.

Quality Assurance

Assume that we have a basic work unit for a task such as comment moderation, that guarantees an accuracy of 80% or higher (e.g., by screening and testing continuously the workers that can complete these tasks). If we want to have a work unit that has higher quality guarantees, we can generate a composite unit that uses multiple, redundant work units and relies on, say, majority vote to generate a work unit with higher quality guarantees.

Pricing Workflows

There is already work available on how to create and control the quality of workflows in crowdsourced environments. We also have a set of design patterns for workflows in general. If we have a crowdsourced workflow that consists of standardized work units, we can also accurately price the overall workflow.

Pricing complex, workflow-based tasks becomes significantly easier when the basic execution units in the workflow are standardized and priced by the market.

We do not even have to reinvent the wheel: there is a significant amount of work on pricing combinatorial contracts in prediction markets. (An example of a combinatorial contract: “Obama will win the 2012 election and will win Ohio” or “Obama will win the 2012 election given that he will win Ohio”.) A workflow can be expressed as a combinatorial expression of the underlying simple work units. Since we know the price of standard units, we can easily leverage work from prediction markets to price tasks of almost arbitrary complexity. The successful deployment of Predictalot by Yahoo! during the 2010 soccer World Cup, with the extensive real-time pricing of complicated combinatorial contracts, gives us the confidence that such a pricing mechanism is also possible for online labor markets.

Timing and Optimizing Workflows

There is already significant amount of work in distributed computing on optimizing execution of task workflows in Mapreduce-like environments. This research should be directly applicable in an environment where the basic computation is performed not by computers but by humans. Also, since the work units will be completed through easy-to-model waiting queues, we can easily leverage the work from queuing theory to estimate how long a task will remain within the system: by identifying the critical parts of execution we can also identify potential bottlenecks and increase the offered prices for only the work units that critically affect the completion time of the overall task.

Role of platforms

One helpful way to think about the role and incentives of online labor platforms is to consider that they are analogous to a commerce-promoting government in a traditional labor market. Most platforms levy an ad valorem charge and thus they have an incentive to increase the size of the total wage bill. While there are many steps these markets can take, their efforts fall into two categories:

remedying externalities, and
setting enforceable standards and rules, i.e., their “weights and measures” function.

Remedying Externalities

An externality is created whenever the costs and benefits from some activity are not solely internalized by the person choosing to engage in that activity. A negative example is pollution—the factory owner gets the goods, others get the smoke—while a positive example is yard beautification (the gardener works and buys the plants, others get to enjoy the scenery). Because the parties making the decision do not fully internalize the costs and benefits, activities producing negative externalities are (inefficiently) over-provided, and activities producing positive externalities are (inefficiently) under-provided. In such cases, “government” intervention can improve efficiency.

Road traffic is an example of a product with negative externalities.

Negative examples are easy to find in on-line labor markets— fraud is one example. Not only is fraud unjust, it also makes everyone else more distrustful, lowering the volume and value of trade. Removing bad actors helps ameliorate the market-killing problem of information asymmetry, as uncertainty about the quality of some good or service is often just the probability that the other trading partner is a fraud.

A positive example is honest feedback after a trade. Giving feedback is costly to both buyers and sellers: It takes time and giving negative feedback invites retaliation or scares off future trading partners. In the negative case, the platform needs to fight fraud—not simply fraud directed at itself but fraud directed at others on the platform, which has a negative second-order effect on the platform creator. In the positive case, the firm can make offering feedback more attractive, by offering rewards, making in mandatory, making it easier, changing rules to prevent retaliation etc.

There are lots of options in both the positive and negative case— the important point is that platform creators recognize externalities and act to encourage positive externalities and eliminate the negative ones. Individual participants do not have the incentives (or even the ability) to fix the negative externalities for all other market participants. For example, no employer has the incentive to publish his own evaluation of the workers that work for his, as this is a signal earned after a significant cost for the employer. This is a case where the market owner can provide the appropriate incentives and designs for the necessary transparency.

Setting Enforceable Standards

Task standardization will probably require buy-in from on online labor markets and intermediaries. Setting cross-platform standards is likely to be a contentious process, as the introduction of standards gives different incentives to different firms, depending upon their business model and market share. However, at least within a particular platform and ignoring their competitors, there is powerful incentive to create standards as they raise the value of paid crowdsourcing and promote efficiency. For example, the market for SMS’s took off in the US only when the big carriers agreed on a common interoperable standard for sending and receiving SMS’s across carrier’s networks.

Standardizing units of measure facilitate transactions and gives us flexibility to create more complex units on top. Can we achieve the same standardization for labor?

In traditional markets, market-wide agreement about basic units of measure facilitate trade. In commodity markets, agreements about quality standards serve a similar role, in that buyers know what they are getting and sellers know what they are supposed to provide. (For example, electricity producers are required to produce electricity adhering to some minimum standards before being able to connect to the grid and sell to other parties.) It should be clear that having public standards make quality assurance easier for the platform: enforcing standards on standardized units of work can be done much easier than enforcing quality standards in a wide variety of adhoc tasks. With such standards, it easier to imagine platform owners more willingly taking the role of testing for and enforcing quality standards for the participants that provide labor.

If we define weights and measures more broadly to include verification of claims, the platform role becomes even wider. They can verify credentials, test scores, work and payment histories, reputation scores and every other piece of information that individuals cannot credibly report themselves. Individuals are also not able to credibly report the quality of their work, but at least with an objective standard, validating those claims is possible. (For example, one of the main innovations made by oDesk was that they logged a worker’s time spent on a task, enabling truthful hourly billing.)

Conclusion

As our knowledge increases and platforms and practices mature, more work will be outsourced to remote workers. On the whole, we think this is a positive development, particularly because paid crowdsourcing gives people in poor countries access to buyers in rich countries, enabling a kind of virtual migration.

At the same time, access to an on demand, inexpensive labor force, more often than than not, enables the creation of products and services that were not possible before: Once you solve a problem that was deemed too-costly-to-solve before, people start looking for the next thing to fix. This in turn generates more positions, more demand, and so on. It is a virtuous cycle, not the Armageddon.

A Computer Scientist in a Business School

Sunday, February 19, 2012

The Need for Standardization in Crowdsourcing