A Computer Scientist in a Business School

Wednesday, July 31, 2013

Online labor markets: Why they can't scale and the crowdsourcing solution.

I am a big proponent of outsourcing work using online labor markets. Over the last decade I outsourced hundreds of projects, ranging from simple data entry to big, complex software products. I learned to create project specs, learned how to manage contractors, and learned how to keep projects moving forward. In general, I consider myself competent in managing distributed teams and projects.

I also met and talked with many people that share my passion for this style of work. We discuss strategies for hiring, for managing the short- and long-term projects, for pricing, for handling legal risks, and other topics of interest. After many such discussions, I reached a striking conclusion: Everyone has a completely different style of managing this process.

This plurality of "best practices" is a bad thing. Having too many best practices means that there are no best practices. The lack of consensus makes it impossible to effectively teach a newcomer of how to handle the process.

The problem with manual hiring in online labor markets

People that want to use contractors for their projects face the following problems:

Few people know what they want: Just for fun, go and check random projects on oDesk, eLance, and Freelancer. An endless list of poorly described projects, requests for "clone of Facebook" for $500, and a lot of related crap. It is not a surprise that many of these projects remain open for ever.
Few people know how to hire: Ask any startup CEO how easy is to hire an employee. It is a pain. The art and craft of inferring the match of an individual to a given task is a very hard problem. Few people know how to do it right. Even within Google and Microsoft, with their legendary interviewing processes, interviewing is seen by many as a hard, time-consuming, and unrewarding experience.
Few people know how to manage a project: Even fewer people know how to manage a project. The harrowing fact is that most people believe that they can. Most people hire someone, hoping that the employee will be in their head, will understand what these vague specifications mean, will know everything that is not documented in a project, and will be able to do a great job. Very few people realize that outsourcing a project means that you will need to spend significant amount of time managing the project.

The result of the combination of these factors? Online labor does not scale through manual hiring. (Of course, this is not unique to online outsourcing. Offline hiring has the same problem.) There are simply not enough qualified employers that can hire effectively, who will be able to create demand for jobs for the online labor markets to continue to grow.

Online hiring vs online shopping

The counter-argument is that labor was always like that. Since the market for labor operates "manually," the transition to electronic hiring will allow for growth. In the same way people were initially afraid of shopping online, they started buying things online, they are going to switch to hiring online.

I do not buy this argument. When people buy an item online, they buy a standardized product. They are not ordering a bespoke item, which is created according to the customer specifications. Customization is typically limited and allowed on a specific set of dimensions. You can customize your Mac to have a better processor, more memory, and a larger hard disk. But you cannot order a laptop with a 19 in screen, and cannot ask for 96 Gb of memory.

But in online markets this is what happens. The random customer comes and asks for a web application ("just the functionality of the X website"), and wants this app to be built for $500. It is the same as if someone goes to a computer store and asks for a laptop with a 19 inch screen, with 128Gb of memory, and 10Tb disk. And, since 1Gb of memory costs 7 dollars, it is reasonable to just pay $1000 for 128Gb, right?

Lessons from online shopping

Based on the experience for the transition of shopping from offline to online, let's see how online labor can move forward.

Standardize and productize: Currently, in online markets, most people ask for a specific set of tasks. Content generation, website authoring, transcriptions, translations, etc. Many of these can be "productized" and be offered as standardized packages, perhaps with a few pre-set customizations available. (Instead of "select the hard disk size, you have a "select blog post length".) This vertical-oriented strategy is followed by many crowdsourcing companies and offers to the client a clean separation from the process of hiring and managing a task. This vertical strategy works well to create small offerings but it is not clear if there is sufficient demand within each vertical to fuel the growth expected for a startup. This is a topic for a new blog post.
Productize the project creation/management: When a standardized offering is not sufficient, the client is directed into hiring a product manager that will spec the requirements, examine if there is sufficient supply of skills in the market, hire individual contractors, manage the overall process, etc. This is similar to renovating a house. The delivered product is often completely customized, but the client does not seek to hire separately electricians, carpenters, painter, etc. Instead, the owner hires a "general contractor" who creates the master plan for the renovation, procures the materials, hires subcontractors, etc. While it eases some of the problems, this is a process suitable only for reasonably big project.
Become a staffing agency: A problem with all existing marketplaces is that they are not acting as employers, but only as matching agents. Few, if any, marketplaces are guaranteeing quality. Every transaction is a transaction between "consenting adults." Unfortunately, very few potential employers understand that, and hire with the implicit assumption that the marketplace is placing a guarantee on the quality of the contractors. So, if the contractor ends up being unqualified for the task, there is very little recourse. By guaranteeing quality, the employer (who is the one spending the money) gets some minimum level of guarantee about the deliverable. Unfortunately, providing such quality guarantees is easier said than done.
Let contractors build offerings: By observing the emergence of marketplaces like Etsy, you can see that people are becoming more comfortable with ordering semi-bespoke, handcrafted items online, for which they have little information. A potential route is to allow the contractors in online markets to build such "labor products" and price them themselves, in the same way that Etsy sellers are putting up their handcrafted stuff online.

All these approaches are fine, and I expect most current marketplaces to adopt one or more of these strategies over time. However, all of them rely on the same assumption: That hiring, as shopping, will be a human activity.

What happens, though, if we stop assuming that hiring is a human-mediated effort?

Crowdsourcing practices to the rescue

I will not pretend that the current state of the crowdsourcing industry offers concrete solutions to the problems listed above. But today's efforts in crowdsourcing move us towards an algorithmically-mediated work environment.

Of course, like all automatic solutions, the initial environment is much worse than "traditional" approaches. We see that in all the growing pains of Mechanical Turk. It is often easier to just hire a couple of trusted virtual assistants from oDesk to do the job, instead of trying to implement the full solution stack to get things done properly on MTurk.

However, the initial learning curve starts paying off later. Production environments that rely on a "crowd" need to automate as much as possible the hiring and management of workers. This automation makes the tasks much more scalable than traditional hiring and project management. High-startup costs, then lower marginal costs of adding workers to a process.

This leads to easier scalability. Of course, the moment the benefits of easier scalability start becoming obvious, it will be too late for players that rely on manual hiring to catch up. It is one of the reasons that I believe that Mechanical Turk has the potential to be the major labor platform, even if this seems a laughable proposition at this point.

I will make a prediction: Crowdsourcing is currently at the forefront of defining the methods and practices in the workplace for the next few decades. Assembly lines and integration of machines in the work environment led to the mass production revolution of the 20th century. The current crowdsourcing practices will define how the majority of people are going to work on knowledge tasks in the future. A computer process will monitor and manage the working process, and hiring manually will be soon a thing of the past, for many "basic" knowledge tasks.

Some will find this prospect frightening. I do not find it any more frightening than having traffic lights regulate traffic in intersections, or having the auto-pilot taking care of my flight.

Sunday, July 28, 2013

Crowdsourcing and information theory: The disconnect

In crowdsourcing, redundancy is a common approach to ensure quality. One of the questions that arises in this setting is the question of equivalence. Let's assume that a worker has a known probability $q$ of giving a correct answer, when presented with a choice of $n$ possible answers. If I want to simulate one high-quality worker workers of quality $q$, how many workers of quality $q' < q$ do we need?

Information Theory and the Noisy Channel Theorem

Information theory, and the noisy channel theorem, can give an answer to the problem: Treat each worker as a noisy channel, and measure the "capacity" of each user. Then, the sum of the capacities of the different workers should give us the equivalent capacity of a high-quality worker.

We have that the capacity $C(q,n)$ of a worker with quality $q$, who returns the correct answer with probability $q$, when presented with $n$ choices, is:

$C(q,n) = H(\frac{1}{n}, n) - H(q, n)$

where $H(q,n) = -q \cdot \log(q) - (1-q) \cdot \log(\frac{1-q}{n-1}) $ is the entropy (aka uncertainty) of the worker.

Examples

The value $H(\frac{1}{n}, n) = \log(n)$ is the initial entropy that we have for the question, when no answers are given. Intuitively, when $n=2$, the initial uncertainty is equal to $\log(2)=1$ bit, since we need one bit to describe the correct answer out of the 2 available. When $n=4$, the uncertainty is equal to $\log(4)=2$ bits, as we need 2 bits to describe the correct answer out of the 4 available.

Therefore, a perfect worker, with quality $q=1$ will have $H(1,n)=0$ entropy, and therefore the capacity of a perfect worker is $\log(n)$.

Can Two Imperfect Workers Simulate a Perfect One?

Now, here comes the puzzle. Assume that we have $n=4$, and the workers have to choose among 4 possible answers. We also have also two workers with $q=0.85$, that select with 85% probability the correct answer out of 4 available. These workers have each capacity equal to $C(0.85, 4) = 1.15$ bits. At the same time, we have one perfect worker with $q=1$. This worker has a capacity of $C(1,4)=2$ bits. So, in principle, the two noisy workers are sufficient to simulate a perfect worker (and would leave a remaining 0.3 bits to use :-)

What am I missing?

My problem is that I do not get how to reach this theoretical limit. I cannot figure out how to use these two workers with $q=0.85$, in order to reconstruct the correct answer. Asking two workers to work in parallel will not cut it (still possible for both workers to agree and be incorrect). Sequential processing (get first a worker to select two out of the four answers, then the second one pick the correct out of the two) seems more powerful, but again I do not understand how to operationalize this.

According to information theory, these two $q=0.85$ workers are equivalent, on average, with one perfect $q=1.0$ worker. (Actually, they seem to carry more information). And even if we avoid perfection, and we set target quality at $q=0.99$, $C(0.99,4)=1.9$. I still cannot see how I can combine two workers with 85% accuracy to simulate a 99% accurate worker.

Update 1 (thanks to Michael Nielsen): Information theory operates over a large amount of transmitted information, so posing the question as "answering a single question" makes it sound more impossible than it should.

We need 2 bits of information to transfer the answer for a multiple choice question with n=4 choices. Say that we have a total of N such questions. So, we need 2N bits to transfer perfectly ann the answers. If we have perfect workers, with $q=1$, we have that $C(1,4)=2$, and we need 2N bits / 2 bits/answer = N answers, from these workers.

Now, let's say that we have workers with $q'=0.85$. In that case $C(0.85, 4) = 1.15$ bits per answer. Therefore, we need 2N bits / 1.15 bits/answer = 1.74N answers from these 85% accurate workers in order to perfectly reconstruct the answers for these N questions.

So, if we get from these 85% workers a total of 100 answers (each one 85% correct), we should be able to reconstruct the 100% correct answer for ~57 (=100/1.74) questions.

Of course we should be intelligent of what exactly to ask and get these 100 answers.

I see in Wikipedia, in the article about the noisy channel theorem, that "Simple schemes such as 'send the message 3 times and use a best 2 out of 3 voting scheme if the copies differ' are inefficient error-correction methods" and that "Advanced techniques such as Reed–Solomon codes and, more recently, turbo codes come much closer to reaching the theoretical Shannon limit". Unfortunately, my familiarity with such coding schemes is minimal (i.e., I have no clue), so I cannot understand their applicability in a crowdsourcing setting.

So, here is my question: What coding schemes should we use in crowdsourcing in order to get closer to the theoretical limits given by Shannon? Or what is the fundamental thing that I miss? Because I do feel that I am missing something...

Any help would be appreciated.

Update 2 (thanks to the comments by stucchio and syrnik): Information theory predicts that we can always recover the perfect answers from noisy workers, given sufficient worker capacity. For anyone that has worked in crowdsourcing, this sounds very strange, and seems practically infeasible. The problem does not seem to be in the assumptions of the analysis; instead it seems to rely on the feasibility of implementing a proper encoding scheme on top of human computation.

The key concept in information theory is the coding scheme that is used to encode the information, to make the transmission of information robust to errors. Information theory does not say how we can recover this perfect information using a noisy channel. Over time, researchers came up with appropriate encoding schemes that approach very closely the theoretical maximum (see above, Reed-Solomon codes, turbo codes, etc). However, it is not clear whether these schemes are translatable into a human computation setting.

Consider this gross simplification (which, I think, is good enough to illustrate the concept): In telecommunications, we put a "checksum" together with each message, to capture cases of incorrect information transmission. When the message gets transmitted erroneously, the checksum does not match the message content. This may be the result of corruption in the message content, or the result of corruption in the checksum (or both). In such cases, we re-transmit the message. Based on the noise characteristics of the channel, we can decide how long the message should be, how long the checksum should be, etc., to achieve maximum communication efficiency.

For example, consider using a parity bit, the simplest possible checksum computation. We count the number of 1 bits in the message: if the number of 1's is odd, we set the parity bit to 1, if the number of 1's is even, we set the parity bit to 0. The extra parity bit increases the size of the message but can be used to detect errors when the message gets transmitted over a noisy channel, and reduce the error rate. By increasing the number of parity bits we can reduce the error rate to arbitrarily low levels.

In a human computation setting, computing such a checksum is highly non-trivial. Since we do not really know the original message, we cannot compute at the source an error-free checksum. We can of course try to create "meta"-questions that will try to compute the "checksum" or even try to modify all the original questions to have an error-correcting component in them.

See now the key difference: In information theory, we have computed error-free the message to be transmitted with built-in error-correction. Consider now the same implementation in a human computation setting: We ask the user to inspect the previous k questions, and report some statistics about the previously submitted answers. The user now operates on the noisy message (i.e., the given, noisy answers), therefore even the error-free computation of the checksum is going to be noisy, defeating the purpose of an error-correcting code.

Alternatively, we can try to take the original questions, and try to ask them in a way that enforces some error-correcting capability. However, it is not clear that these "meta-questions" are going to have the same level of complexity for the humans, even if in the information-theoretic sense, they carry the same amount of information.

It seems that in a human computation setting we have noisy computation, as opposed to noisy transmission. Since the computation is noisy, there is a good chance that the computation these "checksums" is going to be correlated with the original errors. Therefore, it is not clear whether we can actually implement the proper encoding schemes on top of human computers, to achieve the theoretical maximums predicted by information theory.

Or, at least, this seems like a very interesting, and challenging research problem.