## Thursday, October 17, 2013

### Badges and the Lake Wobegon effect

For those not familiar with the term, the Lake Wobegon effect is the case when all or nearly all of a group claim to be above average, and comes from the finctional town where "all the women are strong, all the men are good looking, and all the children are above average."

Interestingly enough, as Wikipedia states, this effect of the majority of the group thinking that they are performing above-average "has been observed among drivers, CEOs, hedge fund managers, presidents, coaches, radio show hosts, late night comedians, stock market analysts, college students, parents, and state education officials, among others."

So, a natural question was whether this effect also appears in an online labor setting. We took some data from an online certification company, similar to Smarterer, where people take tests to show how well they know a particular skill (e.g., Excel, Audio Editing, etc.) The tests are not pass/fail but more like a GRE/SAT score: there is no "passing" score, only a percentile indicator that shows what percentage of other participants have a lower score.

Interestingly enough, we noticed a Lake Wobegon effect there as well: Most of the workers that displayed the badge of achievement, have scores above average, giving yet another point for the Lake Wobegon effect.

Of course, this does not mean that all users that took the test performed above average. Test takers have the choice to make their final score public to the world, or keep it private. Given that the user's profile is also used in a site where employers look for potential hires, there is some form of strategic choice in whether the test score is visible or not. Having a low score is often worse than having no score at all.

So, we wanted to see what scores make users comfortable with their performance, and incentivizes them to display their badge of achievement. Marios analyzed the data, and compared the distribution of scores for workers that decided to keep their score private, compared to the workers that made their performance public. Here is the outcome:

It becomes clear that scores below 50% are not posted often, while scores that exceed 60% have significantly higher odds of being posted online for the world to see. This becomes more clear if we take the log-odds of a worker deciding to make the score public, given the achieved percentile:

So, in the world of online labor if you ever hire someone who chose to display a certification, you know that there are good chances that you picked a worker that is better than average, at least in the test. (We have some other results on the predictive power of tests in terms of work performance, but this is a topic that cannot fit into the margins of this blog post :-)

Needless to say, this effect illustrates a direction that will take crowdsourcing, and labor markets in general, out of the race-to-the-bottom, market-for-lemons-style, pricing, where only price can separate the various workers. As education history serves in an offline setting as signaling for the potential quality of the employee, we are going to see more and more globally recognized certifications replacing educational history for many online workers.

## Wednesday, September 11, 2013

### CrowdScale workshop at HCOMP 2013

A public service announcement, to advertise CrowdScale (http://www.crowdscale.org/) a cool workshop at HCOMP 2013 that focuses on challenges that people face when applying crowdsourcing at scale.

A couple of interesting twists from the classic workshop recipe;
• The workshop invites submission of short (2-page) position papers which identify and motivate key problems or potential approaches for crowdsourcing at scale, even if there aren’t satisfactory solutions proposed. (Deadline: October 4)
• Second, there is a shared task challenge, which also carries a cool $1500 reward for the winner. The CfP follows: Crowdsourcing at a large scale raises a variety of open challenges: • How do we programmatically measure, incentivize and improve the quality of work across thousands of workers answering millions of questions daily? • As the volume, diversity and complexity of crowdsourcing tasks increase, how do we scale the hiring, training and evaluation of workers? • How do we design effective elastic marketplaces for more skilled work? • How do we adapt models for long-term, sustained contributions rather than ephemeral participation of workers?We believe tackling such problems will be key to taking crowdsourcing to the next level – from its uptake by early adopters today, to its future as how the world’s work gets done. To advance the research and practice in crowdsourcing at scale, our workshop invites position papers tackling such issues of scale. In addition, we are organizing a shared task challenge regarding how to best aggregate crowd labels on large crowdsourcing datasets released by Google and CrowdFlower. Twitter: #crowdscale@CrowdAtScale Organizers ## Wednesday, July 31, 2013 ### Online labor markets: Why they can't scale and the crowdsourcing solution. I am a big proponent of outsourcing work using online labor markets. Over the last decade I outsourced hundreds of projects, ranging from simple data entry to big, complex software products. I learned to create project specs, learned how to manage contractors, and learned how to keep projects moving forward. In general, I consider myself competent in managing distributed teams and projects. I also met and talked with many people that share my passion for this style of work. We discuss strategies for hiring, for managing the short- and long-term projects, for pricing, for handling legal risks, and other topics of interest. After many such discussions, I reached a striking conclusion: Everyone has a completely different style of managing this process. This plurality of "best practices" is a bad thing. Having too many best practices means that there are no best practices. The lack of consensus makes it impossible to effectively teach a newcomer of how to handle the process. The problem with manual hiring in online labor markets People that want to use contractors for their projects face the following problems: • Few people know what they want: Just for fun, go and check random projects on oDesk, eLance, and Freelancer. An endless list of poorly described projects, requests for "clone of Facebook" for$500, and a lot of related crap. It is not a surprise that many of these projects remain open for ever.
• Few people know how to hire: Ask any startup CEO how easy is to hire an employee. It is a pain. The art and craft of inferring the match of an individual to a given task is a very hard problem. Few people know how to do it right. Even within Google and Microsoft, with their legendary interviewing processes, interviewing is seen by many as a hard, time-consuming, and unrewarding experience.
• Few people know how to manage a project: Even fewer people know how to manage a project. The harrowing fact is that most people believe that they can. Most people hire someone, hoping that the employee will be in their head, will understand what these vague specifications mean, will know everything that is not documented in a project, and will be able to do a great job. Very few people realize that outsourcing a project means that you will need to spend significant amount of time managing the project.
The result of the combination of these factors? Online labor does not scale through manual hiring. (Of course, this is not unique to online outsourcing. Offline hiring has the same problem.) There are simply not enough qualified employers that can hire effectively, who will be able to create demand for jobs for the online labor markets to continue to grow.

Online hiring vs online shopping

The counter-argument is that labor was always like that. Since the market for labor operates "manually," the transition to electronic hiring will allow for growth. In the same way people were initially afraid of shopping online, they started buying things online, they are going to switch to hiring online.

I do not buy this argument. When people buy an item online, they buy a standardized product. They are not ordering a bespoke item, which is created according to the customer specifications. Customization is typically limited and allowed on a specific set of dimensions. You can customize your Mac to have a better processor, more memory, and a larger hard disk. But you cannot order a laptop with a 19 in screen, and cannot ask for 96 Gb of memory.

But in online markets this is what happens. The random customer comes and asks for a web application ("just the functionality of the X website"), and wants this app to be built for $500. It is the same as if someone goes to a computer store and asks for a laptop with a 19 inch screen, with 128Gb of memory, and 10Tb disk. And, since 1Gb of memory costs 7 dollars, it is reasonable to just pay$1000 for 128Gb, right?

Lessons from online shopping

Based on the experience for the transition of shopping from offline to online, let's see how online labor can move forward.
1. Standardize and productize: Currently, in online markets, most people ask for a specific set of tasks. Content generation, website authoring, transcriptions, translations, etc. Many of these can be "productized" and be offered as standardized packages, perhaps with a few pre-set customizations available. (Instead of "select the hard disk size, you have a "select blog post length".) This vertical-oriented strategy is followed by many crowdsourcing companies and offers to the client a clean separation from the process of hiring and managing a task. This vertical strategy works well to create small offerings but it is not clear if there is sufficient demand within each vertical to fuel the growth expected for a startup. This is a topic for a new blog post.
2. Productize the project creation/management: When a standardized offering is not sufficient, the client is directed into hiring a product manager that will spec the requirements, examine if there is sufficient supply of skills in the market, hire individual contractors, manage the overall process, etc. This is similar to renovating a house. The delivered product is often completely customized, but the client does not seek to hire separately electricians, carpenters, painter, etc. Instead, the owner hires a "general contractor" who creates the master plan for the renovation, procures the materials, hires subcontractors, etc. While it eases some of the problems, this is a process suitable only for reasonably big project.
3. Become a staffing agency: A problem with all existing marketplaces is that they are not acting as employers, but only as matching agents. Few, if any, marketplaces are guaranteeing quality. Every transaction is a transaction between "consenting adults." Unfortunately, very few potential employers understand that, and hire with the implicit assumption that the marketplace is placing a guarantee on the quality of the contractors. So, if the contractor ends up being unqualified for the task, there is very little recourse. By guaranteeing quality, the employer (who is the one spending the money) gets some minimum level of guarantee about the deliverable. Unfortunately, providing such quality guarantees is easier said than done.
4. Let contractors build offerings: By observing the emergence of marketplaces like Etsy, you can see that people are becoming more comfortable with ordering semi-bespoke, handcrafted items online, for which they have little information. A potential route is to allow the contractors in online markets to build such "labor products" and price them themselves, in the same way that Etsy sellers are putting up their handcrafted stuff online.
All these approaches are fine, and I expect most current marketplaces to adopt one or more of these strategies over time. However, all of them rely on the same assumption: That hiring, as shopping, will be a human activity.

What happens, though, if we stop assuming that hiring is a human-mediated effort?

Crowdsourcing practices to the rescue

I will not pretend that the current state of the crowdsourcing industry offers concrete solutions to the problems listed above. But today's efforts in crowdsourcing move us towards an algorithmically-mediated work environment.

Of course, like all automatic solutions, the initial environment is much worse than "traditional" approaches. We see that in all the growing pains of Mechanical Turk. It is often easier to just hire a couple of trusted virtual assistants from oDesk to do the job, instead of trying to implement the full solution stack to get things done properly on MTurk.

However, the initial learning curve starts paying off later. Production environments that rely on a "crowd" need to automate as much as possible the hiring and management of workers. This automation makes the tasks much more scalable than traditional hiring and project management. High-startup costs, then lower marginal costs of adding workers to a process.

This leads to easier scalability. Of course, the moment the benefits of easier scalability start becoming obvious, it will be too late for players that rely on manual hiring to catch up. It is one of the reasons that I believe that Mechanical Turk has the potential to be the major labor platform, even if this seems a laughable proposition at this point.

I will make a prediction: Crowdsourcing is currently at the forefront of defining the methods and practices in the workplace for the next few decades. Assembly lines and integration of machines in the work environment led to the mass production revolution of the 20th century. The current crowdsourcing practices will define how the majority of people are going to work on knowledge tasks in the future. A computer process will monitor and manage the working process, and hiring manually will be soon a thing of the past, for many "basic" knowledge tasks.

Some will find this prospect frightening. I do not find it any more frightening than having traffic lights regulate traffic in intersections, or having the auto-pilot taking care of my flight.

## Sunday, July 28, 2013

### Crowdsourcing and information theory: The disconnect

In crowdsourcing, redundancy is a common approach to ensure quality. One of the questions that arises in this setting is the question of equivalence. Let's assume that a worker has a known probability $q$ of giving a correct answer, when presented with a choice of $n$ possible answers. If I want to simulate one high-quality worker workers of quality $q$, how many workers of quality $q' < q$ do we need?

Information Theory and the Noisy Channel Theorem

Information theory, and the noisy channel theorem, can give an answer to the problem: Treat each worker as a noisy channel, and measure the "capacity" of each user. Then, the sum of the capacities of the different workers should give us the equivalent capacity of a high-quality worker.

We have that the capacity $C(q,n)$ of a worker with quality $q$, who returns the correct answer with probability $q$, when presented with $n$ choices, is:

$C(q,n) = H(\frac{1}{n}, n) - H(q, n)$

where $H(q,n) = -q \cdot \log(q) - (1-q) \cdot \log(\frac{1-q}{n-1})$ is the entropy (aka uncertainty) of the worker.

Examples

The value $H(\frac{1}{n}, n) = \log(n)$ is the initial entropy that we have for the question, when no answers are given. Intuitively, when $n=2$, the initial uncertainty is equal to $\log(2)=1$ bit, since we need one bit to describe the correct answer out of the 2 available. When $n=4$, the uncertainty is equal to $\log(4)=2$ bits, as we need 2 bits to describe the correct answer out of the 4 available.

Therefore, a perfect worker, with quality $q=1$ will have $H(1,n)=0$ entropy, and therefore the capacity of a perfect worker is $\log(n)$.

Can Two Imperfect Workers Simulate a Perfect One?

Now, here comes the puzzle. Assume that we have $n=4$, and the workers have to choose among 4 possible answers. We also have also two workers with $q=0.85$, that select with 85% probability the correct answer out of 4 available. These workers have each capacity equal to $C(0.85, 4) = 1.15$ bits. At the same time, we have one perfect worker with $q=1$. This worker has a capacity of $C(1,4)=2$ bits. So, in principle, the two noisy workers are sufficient to simulate a perfect worker (and would leave a remaining 0.3 bits to use :-)

What am I missing?

My problem is that I do not get how to reach this theoretical limit. I cannot figure out how to use these two workers with $q=0.85$, in order to reconstruct the correct answer. Asking two workers to work in parallel will not cut it (still possible for both workers to agree and be incorrect). Sequential processing (get first a worker to select two out of the four answers, then the second one pick the correct out of the two) seems more powerful, but again I do not understand how to operationalize this.

According to information theory, these two  $q=0.85$ workers are equivalent, on average, with one perfect $q=1.0$ worker. (Actually, they seem to carry more information). And even if we avoid perfection, and we set target quality at $q=0.99$,  $C(0.99,4)=1.9$. I still cannot see how I can combine two workers with 85% accuracy to simulate a 99% accurate worker.

• Update 1 (thanks to Michael Nielsen): Information theory operates over a large amount of transmitted information, so posing the question as "answering a single question" makes it sound more impossible than it should.

We need 2 bits of information to transfer the answer for a multiple choice question with n=4 choices. Say that we have a total of N such questions. So, we need 2N bits to transfer perfectly ann the answers. If we have perfect workers, with $q=1$, we have that $C(1,4)=2$, and we need 2N bits / 2 bits/answer = N answers, from these workers.

Now, let's say that we have workers with $q'=0.85$. In that case $C(0.85, 4) = 1.15$ bits per answer. Therefore, we need 2N bits / 1.15 bits/answer = 1.74N answers from these  85% accurate workers in order to perfectly reconstruct the answers for these N questions.

So, if we get from these 85% workers a total of 100 answers (each one 85% correct), we should be able to reconstruct the 100% correct answer for ~57 (=100/1.74) questions.

Of course we should be intelligent of what exactly to ask and get these 100 answers.

I see in Wikipedia, in the article about the noisy channel theorem, that "Simple schemes such as 'send the message 3 times and use a best 2 out of 3 voting scheme if the copies differ' are inefficient error-correction methods" and that "Advanced techniques such as Reed–Solomon codes and, more recently, turbo codes come much closer to reaching the theoretical Shannon limit". Unfortunately, my familiarity with such coding schemes is minimal (i.e., I have no clue), so I cannot understand their applicability in a crowdsourcing setting.

So, here is my question: What coding schemes should we use in crowdsourcing in order to get closer to the theoretical limits given by Shannon? Or what is the fundamental thing that I miss? Because I do feel that I am missing something...

Any help would be appreciated.

• Update 2 (thanks to the comments by stucchio and syrnik): Information theory predicts that we can always recover the perfect answers from noisy workers, given sufficient worker capacity. For anyone that has worked in crowdsourcing, this sounds very strange, and seems practically infeasible. The problem does not seem to be in the assumptions of the analysis; instead it seems to rely on the feasibility of implementing a proper encoding scheme on top of human computation.

The key concept in information theory is the coding scheme that is used to encode the information, to make the transmission of information robust to errors. Information theory does not say how we can recover this perfect information using a noisy channel. Over time, researchers came up with appropriate encoding schemes that approach very closely the theoretical maximum (see above, Reed-Solomon codes, turbo codes, etc). However, it is not clear whether these schemes are translatable into a human computation setting.

Consider this gross simplification (which, I think, is good enough to illustrate the concept): In telecommunications, we put a "checksum" together with each message, to capture cases of incorrect information transmission. When the message gets transmitted erroneously, the checksum does not match the message content. This may be the result of corruption in the message content, or the result of corruption in the checksum (or both). In such cases, we re-transmit the message. Based on the noise characteristics of the channel, we can decide how long the message should be, how long the checksum should be, etc., to achieve maximum communication efficiency.

For example, consider using a parity bit, the simplest possible checksum computation. We count the number of 1 bits in the message: if the number of 1's is odd, we set the parity bit to 1, if the number of 1's is even, we set the parity bit to 0. The extra parity bit increases the size of the message but can be used to detect errors when the message gets transmitted over a noisy channel, and reduce the error rate. By increasing the number of parity bits we can reduce the error rate to arbitrarily low levels.

In a human computation setting, computing such a checksum is highly non-trivial. Since we do not really know the original message, we cannot compute at the source an error-free checksum. We can of course try to create "meta"-questions that will try to compute the "checksum" or even try to modify all the original questions to have an error-correcting component in them.

See now the key difference: In information theory, we have computed error-free the message to be transmitted with built-in error-correction. Consider now the same implementation in a human computation setting: We ask the user to inspect the previous k questions, and report some statistics about the previously submitted answers. The user now operates on the noisy message (i.e., the given, noisy answers), therefore even the error-free computation of the checksum is going to be noisy, defeating the purpose of an error-correcting code.

Alternatively, we can try to take the original questions, and try to ask them in a way that enforces some error-correcting capability. However, it is not clear that these "meta-questions" are going to have the same level of complexity for the humans, even if in the information-theoretic sense, they carry the same amount of information.

It seems that in a human computation setting we have noisy computation, as opposed to noisy transmission. Since the computation is noisy, there is a good chance that the computation these "checksums" is going to be correlated with the original errors. Therefore, it is not clear whether we can actually implement the proper encoding schemes on top of human computers, to achieve the theoretical maximums predicted by information theory.

Or, at least, this seems like a very interesting, and challenging research problem.