A Computer Scientist in a Business School

Tuesday, November 15, 2011

Does lack of reputation help the crowdsourcing industry?

Can the lack of a public reputation system on Amazon Mechanical Turk be the reason behind the success of current crowdsourcing companies? I present an analysis that points to this direction. Unfortunately, this "feature" also leads to a stagnating crowdsourcing market with limited potential for growing.

Low salaries and market for lemons

A contentious issue about crowdsourcing, and specifically about Amazon Mechanical Turk, is that wages are very low. It is not uncommon to see effective wages of \$1/hr, or even lower. Why is that?

I have argued in the past that Mechanical Turk is an example of a "market for lemons". Good workers are drowning in the anonymity of the crowd. Since the good workers cannot differentiate themselves from bad workers before working on a task, they are doomed to receive the same level of compensation as the bad workers.

This is not a fault of the employers: when a new employer joins the market, it is almost necessary for the employer to test the incoming workers to ensure the quality of the work. During this testing period, high-quality workers are completing the tasks side-by-side with low-quality workers, and everyone receives a low salary.

The counter argument that I often hear is: "But the market, in the long run, should see an increase in salaries, as good workers demonstrate their quality to employers". Of course, in the long run we are all dead. But even at the long run, and even after we are all dead, the market does not seem to be on a path to convergence to fair salaries.

Why? Here is the brief summary:

High-quality workers are much more valuable than low-quality ones
Lack of a shared reputation system depresses salaries pushing all salaries close to the level of low-quality workers
Employers build their own, private reputation systems, learning the quality of the workers
With the private quality information, employers can retain good workers by paying higher wages compared to the low-quality workers, but still lower than their "fair" quality-adjusted wage.
New employers cannot compete with incumbents since they do not have access to the privately built reputation systems and have to face the cost of learning the quality of the workers, while incumbents enjoy their advantage of already knowing who the good workers are
Incumbents can enjoy a strong cost advantage, effectively blocking newcomers from entering the industry

Below I expand these arguments in a little bit more higher level of detail.

Quality equivalence of low- and high-quality workers

First, let's examine the differences in payment between high- and low-quality workers. Let's take a very simple setting: Suppose that you have workers performing a task with two answers: Yes or no. The low quality are accurate $lq$% of the time. The high-quality workers are accurate $hq$% of the time. How many workers of low quality do we need to emulate one worker of high quality?

Working in the simplest possible case, assume that we have we have $k$ low-quality workers, and each gives with probability $q$ the correct answer. We take the majority vote to be the the aggregate answer. What is the probability $P(q,k)$ that the the majority will be correct? We have that:

$P(q,k) = \sum_{i = \lceil \frac{k+1}{2} \rceil}^k \binom{k}{i} \cdot q^i \cdot(1-q)^{k-i}$

(Assume, for the sake of simplicity that $k$ is odd. Otherwise, we need to add the term

$\frac{1}{2}\cdot \left( \lceil \frac{k+1}{2} \rceil - \lceil \frac{k}{2} \rceil \right) \cdot \binom{k}{k/2}\cdot q^{k/2}\cdot (1-q)^{k/2}$ in the above equation, to allocate ties appropriately)

Given the above, we can find how many low-quality workers of quality $lq$ we need to emulate a single high-quality worker of quality $hq$: We just need to solve the equation:

$P(lq, k) = P(hq, 1)$

Here are a few indicative pairs: To reach the 95% quality level we need:

3 workers of quality 90%.
7 workers of quality 80%.
9 workers of quality 75%.
15 workers of quality 70%.
67 workers of quality 60%.
269 workers of quality 55%.

If our goal is to reach the 99% quality level, we need:

3 workers of quality 95%
5 workers of quality 90%
13 workers of quality 80%
31 workers of quality 70%

This means that the fair wage of a single worker that is accurate at the 95% quality level should be ~9 times higher than the wage of the worker who is 75% accurate. A worker who is 99% accurate should demand 13x higher salary than someone who is 80% accurate. Notice that as the quality of the low-quality workers drops, the difference in fair wages between the high-quality and low-quality increases in a very fast rate.

Employers learning the quality of workers

Suppose that we have an employer called PanosLabs that has worked for a long period of time with workers. At this point, PanosLabs has a long track record for many workers, and the quality estimates for each worker are pretty solid.

Now, this knowledge of worker quality allows PanosLabs to pay the good workers higher salaries. Let's assume that PanosLabs decided to be very "generous". For the high-quality 99%-accurate workers, PanosLabs quadruples the salary, compared to the general pool. Similarly, for workers that are 95%-accurate, PanosLabs triples the salary compared to the general pool.

Assuming that the general pool of workers is at the 80% accuracy level, PanosLabs gets the following bargain: It is now possible to cut costs significantly, while maintaining the same quality level.

Initially, PanosLabs was hiring 13 workers per case, paying each \$1/hr; this is an effective wage of $13/hr for reaching the 99% quality level. Now, PanosLabs can have the 99% quality level by just employing a single 99% worker, for the cost of \$4/hr. This is a cost reduction of 70%!

Great bargain eh? This is the benefit of knowing thy worker...

Increasing the barriers to entry

Now let's assume that a new employer, called RotisLabs arrives at the market. The high-quality workers are now happily employed at PanosLabs, receiving a salary that is 4X the running market salary for their task.

RotisLabs coming to the crowdsourcing market, is in a pickle. RotisLabs has no way of identifying and attracting the high quality workers without attracting the workers to work for RotisLabs first. Why?

There is no history of employment. In the "real world" knowing that an engineer worked at, say, Google gives some signal of quality. In our setting RotisLabs cannot check if a worker has worked for PanosLabs.
It is not possible to check how much the workers get paid for other tasks. In the "real world" prices serve as signals. An employee that gets a high salary also signals to other employers that is a high performer. However, RotisLabs cannot check the prices that workers receive.

Check now the situation of RotisLabs: The competitor, PanosLabs, generates 99% accurate work at the cost of \$4/hr. What are the options of RotisLabs?

First option: RotisLabs can pay \$1/hr. This option attracts the following workers: The low-quality, 80%-accurate workers that did not get increases by PanosLabs, and, if lucky, some new 99%-accurate workers that just arrived in the market. However, this pay rate does not attract the high-quality workers that stick with PanosLabs, severely limiting the pool of good workers accessible to RotisLabs. Notice that, at this pay level, RotisLabs has a cost of \$13/hr to reach the 99%-quality level, while competing with PanosLabs that has 70% lower cost of production, i.e., \$4/hr. If RotisLabs has enough cash and patience, will stick to the market until learning the quality of workers. In most cases, though, RotisLabs will just realize that it is not possible to compete.
Second option: RotisLabs can pay \$4/hr. This option may attract the 99%-accurate workers that work for PanosLabs. But this will also attract the 80% workers! Our dear friend, RotisLabs, cannot separate the two. Therefore, to ensure the 99%-quality level, RotisLabs needs to still hire 13 workers per case, to account for the cases where many 80% workers work on an example. This increases the overall cost of production at \$52/hr. Ooops! PanosLabs can reach the same level of quality with a cost of just \$4/hr.

You can see that knowing the quality of the workers can give a tremendous benefit to the incumbent players that invest into learning the quality of the workers.

Interestingly enough, due to the depressed salaries that is a direct consequence of the lack of reputation systems, the established employer effectively passed the search costs to the employees: While learning the quality of the workers, the employer is paying salaries corresponding to the lowest expected level of quality. It is up to the workers to carry the burden of low salaries until proving themselves (again and again, for every single employer...)

Lack of shared reputation system: The foundation of the crowdsourcing industry?

The lack of a (shared) reputation system is a godsend for companies that enjoy a first movers advantage. They can keep their costs down, while keeping their own employers happy, (in a relative sense: "cant you see how much better I am paying you compared to the general pool?").

The anonymity generates the conditions for "market of lemons" salaries, which keep the costs down. At the same time, the smart and established employers can find and reach out to the high quality workers. By paying these workers "generously", the smart employers can lock-in the workers into "golden cages": offer salaries that are higher than those for the general population, but still much much lower than the level of the fair wages for the produced quality levels.

When even these 4x or 5x (unrealistic and fictional) salary increases, mentioned in the example above, are great bargains, you can imagine the margins that crowdsourcing companies can command.

In a very perverse manner, the anonymity imposed by Mechanical Turk is now effectively serving as the foundation of the current crowdsourcing industry. The anonymity keeps worker costs down, allowing most companies to offer solutions that are very cost competitive compared to alternatives. At the same time, this policy is hurting the Amazon MTurk marketplace by effectively generating huge barriers to entry for newcomer employers, and depressing the salaries of newcomer employees. (The Masters qualification is a step in the right direction, but too crude to serve as an effective signalling mechanism.)

The future?

Let's see who will manage to generate the appropriate market for crowdsourcing that will resolve these issues. One thing is clear: the direction towards improving crowdsourcing markets requires salaries to increase significantly. Interestingly enough, this is expected to lower the overall cost of production as well, as the cost of quality control will be significantly lower.

As I said in the crowdsourcing panel at the WWW2011 conference last Spring:

It is not about the cost!
It is not about the crowd!
It is not about simple tasks!
Crowdsourcing is best for “parallel, scalable, automatic interviews” and for finding quickly good workers
Find the best trained workers, fast, pay them well, and keep them!

Thursday, November 10, 2011

BusinessWeek...

BusinessWeek on my research :-)

Special thanks to my collaborators that made this research possible: Foster Provost, Jing Wang, Josh Attenberg, Shengli Sheng. Additional thanks go to AdSafe Media and, of course, Tagasauris.

Tuesday, October 11, 2011

Collective Intelligence 2012: Deadline November 4, 2011

For all those of you interested in crowdsourcing, I would like to bring your attention to a new conference, named Collective Intelligence 2012, being organized at MIT this spring (April 18-20, 2012) by Tom Malone and Luis von Ahn. The conference is expected to have a set of 15-20 invited speakers (disclaimer: I am one of them), and also accepts papers submitted for publication. The deadline is November 4th, 2011, so if you have something that you would be willing to share with a wide audience interested in collective intelligence, this may be a place to consider.

The call for papers follows:

Overview

Collective intelligence has existed at least as long as humans have, because families, armies, countries, and companies have all--at least sometimes--acted collectively in ways that seem intelligent. But in the last decade or so a new kind of collective intelligence has emerged: groups of people and computers, connected by the Internet, collectively doing intelligent things. For example, Google technology harvests knowledge generated by millions of people creating and linking web pages and then uses this knowledge to answer queries in ways that often seem amazingly intelligent. Or in Wikipedia, thousands of people around the world have collectively created a very large and high quality intellectual product with almost no centralized control, and almost all as volunteers!

These early examples of Internet-enabled collective intelligence are not the end of the story but just the beginning. And in order to understand the possibilities and constraints of these new kinds of intelligence, we need a new interdisciplinary field. Forming such a field is one of the goals of this conference.

We seek papers about behavior that is both collective and intelligent. By collective, we mean groups of individual actors, including, for example, people,

computational agents, and organizations. By intelligent, we mean that the collective behavior of the group exhibits characteristics such as, for example,

perception, learning, judgment, or problem solving.

Topics of interest include but are not limited to:

human computation
social computing
crowdsourcing
wisdom of crowds (e.g., prediction markets)
group memory and problem-solving
deliberative democracy
animal collective behavior
organizational design
public policy design (e.g., regulatory reform)
ethics of collective intelligence (e.g., "digital sweatshops")
computational models of group search and optimization
emergence and evolution of intelligence
new technologies for making groups smarter

For a more complete description of the scope, please click here. For any questions, please email contact@ci2012.org.

Dates and Location

The conference will be held April 18-20, 2012 on the MIT campus in Cambridge, MA. Accommodations in nearby hotels will be available for conference attendees.

Format

The conference will consist of:

invited talks from prominent researchers in different areas related to collective intelligence
oral paper presentations
poster sessions

Submission

Papers of three types are invited:

Reports of original research results
Reviews of previous research in one or more fields relevant to collective intelligence
Position papers about research agendas for the field of collective intelligence

Some of the papers submitted will be invited for oral presentation, others for presentation as posters.

Papers may be up to 8 pages in length. The deadline for submission is November 4, 2011. Download the submission format. Papers shall be submitted by email to submissions@ci2012.org.

Important Dates

Paper submission deadline: November 4, 2011
Notification of paper acceptance / rejection: January 15, 2012
Camera-ready papers due: February 15, 2012
Conference dates: April 18-20, 2012

Saturday, September 3, 2011

Probabilities and MTurk Executives: A Troubled Story

On the Mechanical Turk blog, there is a blog post that describes the need to build custom qualifications for workers. (Update: the old link was removed after I posted this analysis, and the new version does not contain any of the problematic math analysis.) While the argument is correct, it is backed by some horrendous math analysis. For your viewing pleasure, here is the quote:

This difference in accuracy is magnified if you’re using plurality. Supposed you use plurality of 2 (asking 2 Workers the same question). With Masters, if 2 Workers with average accuracy (99%) agree on an answer there is a 98% probability that it’s the correct answer. With the broader group, if 2 Workers with average accuracy (90%) agree on an answer there is an 81% probability that it’s the correct answer. And if you happen to get 2 68% accurate Workers submitting assignments for the same HIT, the probability the answer is accurate is only 46%!

Dear Sharon:

We do appreciate your efforts on improving MTurk and on giving correct advice.

But this analysis that attempts to back a correct argument, is absolutely wrong. It is so wrong that it hurts. Just think at a very intuitive level: how is it possible to ask two workers of certain accuracy, see that they agree, and expect the accuracy of the corroborated answer to be lower? This is simply not possible!

Here is the correct analysis:

Supposed you use plurality of 2 (asking 2 Workers the same question). With Masters, if 2 Workers with average accuracy (99%) agree on an answer then the probability that this answer is incorrect is

$Pr(\mathit{incorrect}|\mathit{agreement}) = \frac{Pr(\mathit{worker1\ incorrect\ and\ worker2\ incorrect})}{Pr(\mathit{agreement})}$.

Assuming (conditional) independence of the workers:

$Pr(\mathit{incorrect}|\mathit{agreement}) = $

$\frac{Pr(\mathit{worker1\ incorrect}) \cdot Pr(\mathit{worker2\ incorrect})}{Pr(\mathit{agreement})}=\frac{(1-p)^2}{p^2+(1-p)^2}$.

where $p$ is the probability of a worker being correct.

With 99% accuracy, the probability of a worker being correct is $p=0.99$. So:

$Pr(\mathit{incorrect}|\mathit{agreement}) = \frac{0.01 \cdot 0.01}{0.01 \cdot 0.01 + 0.99 \cdot 0.99}$

$\Rightarrow Pr(\mathit{incorrect}|\mathit{agreement}) = 0.000101$.

Since $Pr(\mathit{correct}|\mathit{agreement}) = 1-Pr(\mathit{incorrect}|\mathit{agreement})$, therefore, with Masters, if 2 Workers with average accuracy (99%) agree on an answer, there is a $1-0.000101 \approx 99.99\%$ probability that it’s the correct answer.

With the broader group, if 2 Workers with average accuracy (90%) agree on an answer there is an $1-\frac{ 0.1 \cdot 0.1}{0.1 \cdot 0.1 + 0.9 \cdot 0.9} \approx 98.78\%$ probability that it’s the correct answer. And if you happen to get two 68%-accurate workers submitting assignments for the same HIT (and they both agree), the probability the answer is accurate is only $1-\frac{ 0.32 \cdot 0.32}{0.32 \cdot 0.32 + 0.68 \cdot 0.68} \approx 81.87\%$!

How Sharon got confused? The analysis that she presents calculates not the accuracy of the answer when the workers agree, but instead it calculates how often the two workers will agree and agree on the correct answer. Indeed, with workers that have 68% accuracy, we will observe agreement on the correct answer only 48% of the time. (And, 10% of the time, they will agree on the incorrect answer.) More importantly, though, 42% of the time, the two workers will disagree, and we will need to bring an extra worker, increasing the cost by 50%.

Why Sharon got confused? One explanation is that she is victim of the conjunction fallacy or that she does not understand conditional probabilities. However, I believe it is not that. I bet that she did not get puzzled by the results because the presented math confirmed another (correct) intuition that she had about the market: redundancy when relying on low-quality workers is not cost-effective.

Consider this: if you have 3 workers of 68% accuracy, the combination of the three (e.g., using majority vote) will result in an average accuracy of only 75%. In other words only 3 out of 4 times the majority will generate the correct answer. To reach 90% accuracy, we need 11 workers with 68% accuracy each. And to reach 99% accuracy, we need 39 workers of 68% accuracy! (I will present the math in a later blog post.)

Even using "moderately high quality" workers, simulating a worker that is 99% accurate tends to be an expensive proposition. We need five workers that are 90% accurate to get 99% accuracy.

So, yes, the high-quality Masters workers are worth their extra price. In fact, they are worth their weight in gold. Paying only 20% more to access a guaranteed pool of high-quality "Masters" workers is a great bargain, given the quality differences with the general worker pool.

Actually, if I were a 99% accurate worker I would feel offended that I do not get at least double or triple the running wage for the common workers. There is a great mispricing of the services provided by high-quality workers, and most requesters today exploit just this fact to keep the wages down, while still managing to get high quality results from the tested, reliable workers.

Sunday, August 28, 2011

The impact of online reviews: An annotated bibliography

A few weeks back, I received some questions about online consumer reviews, their impact on sales, and other related questions. At that point, I realized that while I had a good grasp of the technical literature within Computer Science venues, my grasp of the overall empirical literature within Marketing and Information Systems venues was rather shaky, so I had to do a better work in preparing a literature review.

So, I did whatever a self-respecting professor would do in such a situation: I asked my PhD student, Beibei Li, to compile a list of such papers, write a brief summary of each, and send me the list. She had passed her qualification exam by studying exactly this area, so she was the resident expert in the topic.

Beibei did not disappoint me. A few hours later I had a very good list of papers in my mailbox, together with the description. It was so good, that I thought that many other people would be interested in the list.

So, without further ado, I present you Beibei's annotated bibliography about online reviews and their business impact.

User behavior and online reviews

Nan Hu, Paul Pavlou and Jie Zhang, in their paper "Overcoming the J-shaped distribution of product reviews" have shown that the graphical representation of product reviews has a J-shaped distribution: mostly 5-star ratings, some 1-star ratings, and hardly any ratings in between. What can explain this distribution? They attribute this rating distribution into two biases:

Purchasing bias: People that buy a product do not constitute a random sample of the population. People buy products that they believe they will enjoy. So, the reviews are written by people that are more likely to like the product. Since only people with higher product valuations purchase a product, those with lower valuations are less likely to purchase the product, and they will not write a (negative) product review. Purchasing bias causes the positive skewness in the distribution of product reviews and inflates the average.
Underreporting bias: Among people who purchased a product, those with extreme ratings (5-star or 1-star) are more likely to express their views to “brag or moan” than those with moderate views.

Xinxin Li and Lorin Hitt, in their 2008 paper "Self-Selection and Information Role of Online Product Reviews" have found that online reviews may be subject to a self-selection bias: products are not randomly assigned to reviewers. Rather, early buyers (buyers who also post the first reviews) self-select product that they believe they may enjoy, in the absence of any existing information. This is in contrast to other buyers that wait for more signals about the quality of a product to emerge, before being convinced to buy, and therefore have a lower prior expectation about the product quality. As a consequence, the preferences of early buyers systematically differ from the broader consumer population, the early reviews can be biased, either in a positive or negative way. Such bias in reviews will affect sales and reduce consumer surplus, even if all reviews are truthful.
Wendy W. Moe and Michael Trusov in their paper "Measuring the Value of Social Dynamics in Online Product Ratings Forums", looked into how social influences affect the subsequent ratings and sales. They demonstrated that reviewer rating behavior is significantly affected by previous ratings. In other words, product reviews not only reflect the customers' experience with the product, but they also affect the ratings of later reviews as well.
Chrysanthos Dellarocas, Guodong (Gordon) Gao, and Ritu Narayan in their paper "Are Consumers More Likely to Contribute Online Reviews for Hit or Niche Products?" show that consumers tend to prefer posting reviews for obscure movies but also for hit movies that have already a large number of online reviews. The recommendation of the authors to owners of review websites is that volume of previously posted reviews should become less prevalent in order to encourage posting of reviews for lesser-known products.

Online product reviews and product sales

Judy Chevalier and Dina Mayzlin, in their 2006 paper "The Effect of Word of Mouth on Sales: Online Book Reviews" have first demonstrated that online ratings have significant impact on book sales. The key trick was to monitor the sales of the same book in parallel on Amazon.com and on Barnes & Noble. Since the two sites were selling the same book, any external effect would be similar to both websites. However, effects specific to Amazon or on BN.com would influence sales only on the respective websites (e.g., customer preferences on Amazon, site-specific promotions, etc.). Through this "differences in differences" method, Chevalier and Mayzlin could isolate and measure the effect of product reviews, without worrying about other confounding factors.
Yong Liu, in the 2006 paper "Word of Mouth for Movies: Its Dynamics and Impact on Box Office Revenue" have looked at the same topic, but focused on the movie box office. Different from Chevalier and Mayzlin, his finding suggested that the valence of reviews does not matter for box office sales, however the review volume does.
Pradeep K. Chintagunta, Shyam Gopinath and Sriram Venkataraman, in their 2010 paper "The Effects of Online User Reviews on Movie Box Office Performance: Accounting for Sequential Rollout and Aggregation Across Local Markets" have further studied the impact (valence, volume, and variance) of online reviews by looking at the local geographic movie box office, rather than the national-level aggregate box office performance. After accounting for various potential complications in the analysis, they suggested that it is the valence that seems to matter and not the volume.
Jonah Berger, Alan T. Sorensen and Scott J. Rasmussen, in their 2010 paper "Positive Effects of Negative Publicity: When Negative Reviews Increase Sales" found that negative reviews can boost sales for unknown books, but hurt sales for books with established authors. This happens because negative reviews bring visibility to unknown books. Whereas for authors who are already well known, publicity does not boost the awareness of their books, instead, the valence of the publicity becomes more important.
Chris Forman, Anindya Ghose and Batia Wiesenfeld, in their 2008 paper "Examining the Relationship Between Reviews and Sales: The Role of Reviewer Identity Disclosure in Electronic Markets" have looked at the role of reviewer identity disclosure (e.g., real name and location of the reviewer) in examining the relationship between Amazon book reviews and sales. They found that the prevalence of reviewer disclosure of identity information is associated with increases in helpfulness rating of the review and the subsequent online product sales. This is because community members more positively assess reviewers who disclose identity-descriptive information, and then use their assessment of reviewers as a heuristic shaping their evaluation of the product reviewed.
Nikolay Archak, Anindya Ghose and Panagiotis G. Ipeirotis (yours truly), in the 2011 paper "Deriving the Pricing Power of Product Features by Mining Consumer Reviews", examine the idea that the textual content of the product reviews is an important determinant of consumers' choices, over and above the valence and volume of reviews. Using text mining tools, they incorporated review text by decomposing textual reviews into segments describing different product features. This work demonstrates how textual data can be used to learn consumers' relative preferences for different product features and also how text can be used for predictive modeling of future changes in sales.
Anindya Ghose and Panagiotis G. Ipeirotis (yours truly, again), in the 2011 paper "Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics", explored online review's impact on helpfulness and product sales, using multiple aspects of review text, such as subjectivity levels, various measures of readability and extent of spelling errors. The analysis has revealed that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. See also the related blog post that I wrote in January 2010 (yes, even after acceptance, it took 1.5 years for the paper to appear in print).
Yubo Chen, Qi Wang and Jinhong Xie, in their paper "Online Social Interactions: A Natural Experiment on Word of Mouth Versus Observational Learning" studied how word-of-mouth (WOM, i.e., others’ opinions) differs from observational learning (i.e., others’ purchase actions) in influencing sales. They have found that :

negative WOM is more influential than positive WOM;
positive observational learning information significantly increases sales but negative one has no effect (e.g., reporting purchase statistics help popular products, without hurting niche ones);
the sales impact of observational learning increases with WOM volume

Michael Luca, in his "job market paper" "Reviews, Reputation, and Revenue: The Case of Yelp.com" used a nice trick for estimating the causal effect of consumer reviews from Yelp.com on restaurant demand. Using revenue data from the state of Washington, he examined what is the effect of having an extra "half star" in Yelp. The key trick is to exploit the discontinuity in the way that Yelp assigns aggregate scores: A restaurant with 3.76 average review rating gets a 4-star review, while a restaurant with 3.74 average review rating gets a 3.5-star review. So, if there is a big gap in the revenues between restaurants with scores of 3.76 and 3.74, then this revenue gap (which actually exists) can be attributed to Yelp, and to its summary rating. (This blog posts presents further analysis of the paper, and also mentions similar use of this discontinuity trick to study the effect of sanitary scores in NYC: a restaurant may get an "A" score with $x$ penalty points, and another get a "B" with $x+1$ penalty points). Luca found discontinuous jumps in restaurant sales that follow the discontinuous jumps in the ratings around the rounding thresholds. This finding strongly suggested that changes in ratings (e.g., from just below a rounding threshold to just above a rounding threshold) can have significant causal impact on restaurant demand.

Online word of mouth and firms

Michael Trusov, Randolph E. Bucklin, and Koen Pauwels in their 2009 paper "Effects of Word-of-Mouth Versus Traditional Marketing: Findings from an Internet Social Networking Site" compared the effects of word-of-mouth marketing versus traditional marketing, as judged from the member growth at an Internet social networking site. They found that WOM referrals (i.e., invitations) not only produce a substantially higher short-term response, but also have substantially longer carryover effects in the long run than traditional marketing actions (e.g., promotion events, media appearances).
David Godes and Dina Mayzlin, in their 2009 paper "Firm-Created Word-of-Mouth Communication: Evidence from a Field Test" examined how a firm should try to create useful word-of-mouth. They looked at who creates WOM and what kind WOM and matters. They found that for a product with a low initial awareness level, WOM that is most effective at driving sales is created by less loyal (not highly loyal) customers and occurs between acquaintances (not friends). They also found that although "opinion leadership" is useful in identifying potentially effective spreaders of WOM among very loyal customers, it is less useful for the sample of less loyal customers.
Jackie Y. Luan and Scott Neslin, in their paper "The Development and Impact of Consumer Word of Mouth in New Product Diffusion" focused on how word-of-mouth (WOM) influences new product adoption in the video game market. Specifically, they were able to measure how effectively firms' marketing efforts generate WOM (buzz) and to determine whether WOM influences product adoption primarily through an informative role (i.e., helping the consumer learn product quality) or a persuasive role (i.e., exerting a direct impact on sales, for example, by increasing awareness).

If you have any other papers that you think that should be included in the list, please add your recommendation in the comments, together with a brief description of the conceptual and methodological contribution of the paper.

Monday, July 25, 2011

Native vs Grapevine Reputation on MTurk

The Mechanical Turk blog has a new entry today, by Sharon (Chiarella), titled "Cooking with Sharon" & Tip #3 Manage Your Reputation.

In the article, Sharon encourages requesters to do the following:

Pay well - Don’t be fooled into underpaying Workers by comparing your HITs to low priced HITs that aren’t being completed.
Pay fairly – Don’t reject an Assignment unless you’re SURE it’s the Worker who is wrong.
Pay quickly – If you approve or reject Assignments once a week, Workers may do a few HITs and then wait to see if they are paid before doing more. This is especially true if you’re a new Requester and haven’t established your reputation yet.

Sharon then explains that workers do talk with each other in the forums, on Turkopticon, and so on, and collectively establish the reputation of the requester based on these factors. While there is nothing wrong with this "grapevine"-based reputation, it also illustrates some obvious features that the Mechanical Turk platform is missing.

Instead of outsourcing the task to third-party forums, Amazon should provide features that make the reputation of the requester more transparent, visible, and objective.

For example, each requester could have a profile, in which the workers can see:

The total number of HITs, and rewards posted by the requester
The rejection rate for the requester
The distribution of working time for the HITs of the requester
The effective hourly wage for the tasks completed for the requester
The payment lag from completion of the task until payment

These are all elements that workers would find useful. They are statistics that contribute to the transparency of the market, and their objective nature makes the establishment of reputation much faster. Such objective characteristics are complementing the more subjective features used in the the grapevine-based reputation systems (Turker Nation, Turkopticon, etc), where only a subset of workers contribute and measure personal perceptions (e.g., was this task "well-paid" or not?). Of course, subjective reputation systems will continue to play their role, providing information that cannot be easily quantified. But they should not be the only reputation signal for the market.

Could there be side-effects if such a system is deployed? Yes. I can see some cases where this profile can introduce strange incentives in the market. (For example, it may be good to have a few of my tasks spammed and still pay immediately for the results, so that I can have high acceptance rate, HITs that require only a little bit of time to be completed, and show a high hourly wage.) But these are just details that can be addressed. There is no way that overall the market could suffer when such statistics become publicly available. (Sorry Mr \$0.23/hr-requester, you are not that valuable.)

Markets operate based on trust and are better with increased information efficiency. Any step towards this direction is a good step for the market participants and, by extension, for the market owner.

Friday, July 22, 2011

A tale about parking

The media attention to my prior blog post was really not something that I enjoyed. Not so much for the attention itself but for focusing on exactly the wrong issues. That post was NOT about me and my evaluation. This is not the main point. I thought that the salary issue was worth mentioning (apparently, it was not) but it was, indeed, a MINOR part of the issue.

In fact, after reflecting on this point, I realized the following: Even if I had received a $1M bonus from NYU for my efforts, the basic problem would still be there: the teaching experience would degenerate into a witch hunt, focusing on cheating, instead of being about learning. And yes, I would still write the same blog post even if I were fully satisfied with my annual evaluation. In fact, the blog post was in my folder of draft posts for a few months now, long before receiving my annual evaluation.

If you want a a parallel, consider this hypothetical story:

A tale about parking

Suppose that you live in a city with a huge traffic problem, and a resulting huge parking problem. Too many cars on the street.

People try to find parking and they drive around, drive around. A lot. Some drivers get frustrated and they double park. Some drivers are stupid enough to double park during rush hour, block the traffic, and leave the car unattended. As expected, the police arrives and assigns a ticket to the offender, sometimes taking the car as well. However, during quiet hours, when there is no traffic many drivers double park, but they do not block the traffic, and nobody gives them a ticket.

Suddenly, in one neighborhood only, call it Redwich Village, a lone policeman starts assigning tickets for every parking violation. No matter if it is minor or major. No matter if the driver just stepped out, or if it is the first time that the driver double parked. Zero-tolerance policy.

By doing that, and being more vigilant, our lone policeman assigns 10 times more tickets that before. By doing that, he also lost countless hours fighting with the offenders. This continuous fight, also annoys some other residents of the neighborhood that want the policeman to focus on policing the neighborhood, and not spend all his time giving parking tickets.

But even our lone policeman gets frustrated: he realizes that he did not become a policeman to give parking tickets. While it is part of his duties, he feels that it is just better not to be so aggressive. His boss also gets a report that many neighborhood residents are annoyed. His boss knows that the complaints are due to the zero-tolerance policy on parking tickets. So he says that he would like our lone policeman to both continue this idiosyncratic zero-tolerance policy enforced just by our lone policeman, and be as diligent with his other duties as before.

Our lone policeman goes on and reflects on the overall experience. He realizes that he is fighting a losing battle. As the number of cars increase in the city, there will be more people parking illegally.

So, our lone policeman suggests that we need to do something more fundamental about the parking problem: He suggests that people could carpool, use bicycles, mass transit, or simply walk. And he asks for people to think of more such alternatives. If there are less cars in the city, the problem will be resolved.

He describes all his thoughts in his blog, in a long post, titled "Why I will never give parking tickets again." He describes the futility of parking tickets to fight the underlying problem, and vows never to be so vigilant about parking tickets. He will be as vigilant as all the other policemen, which is as vigilant as he was before.

His blog post goes viral. Media pick up fragments, everyone reads whatever they want to read. Some headlines:

"Parking tickets in Redwich Village increase by 1000%. Is it impossible to park your car in Redwich?"
"Parking-related violations skyrocket in Redwich Village. Policeman punished for enforcing the rules."
"RedWich Village sucks. Only scumbags live in RedWich Village, what did you expect? Any lawful behavior?"
"Stupid city residents: We know that all people that live in cities are cheaters and park illegally"
"Why the government does not reward this honest policeman?"
"Why this policeman is vowing not to obey the law? Oh the society..."

Now, some of the business owners of Redwich Village are annoyed because people may not drive to Redwich, if they think it is impossible to find parking. Some residents are also annoyed because real estate prices may go down if people believe that Redwich is a place where you cannot park your car. After all, it is all a matter of reputation.

And in this bruhaha, nobody pays any attention to the underlying problem. Is increased vigilance the solution to the parking problem? Should we give more tickets? Should we install cameras? Or should we try to follow the suggestions of our lone policeman and think of other ways to reduce traffic, and therefore resolve the parking problem on a more fundamental level?

The blog post of our lone policeman is neither about the policeman nor about Redwich. It is about the fact that there is too much traffic in the whole city. Which in turn causes the parking problem. Parking scarcity is the symptom, not the real problem. And while he wrote about the traffic problem and suggested solutions, 99% of the coverage was about Rewich and about his own evaluation.

This is exactly how the discussion about cheating evolved in the media. Instead of focusing on how to make student evaluation objective and cheating-proof, the discussion focused on whether my salary went sufficiently up or not. This is not the main point. It is not even a minor point, in reflection. The real question is on how we can best evaluate our students and which evaluation strategies are robust to cheating, encourage creativity, and evaluate true learning.

And this is not a discussion that can be done while screaming.

Sunday, July 17, 2011

Why I will never pursue cheating again

The post is temporarily removed. I will restore it after ensuring that there are no legal liabilities for myself or my employer.

Until then, you can read my commentary in my new blog post: A tale about parking.

The discussion on Hacker News was good as well. Also see the response that I posted at the Business Insider website and the coverage at Inside Higher Education.

Sunday, June 26, 2011

Extreme value theory 101, or Newsweek researching minimum wage on Mechanical Turk

Last week, Newsweek published an article titled The Real Minimum Wage. The authors report that "in a weeks-long experiment, we posted simple, hourlong jobs (listening to audio recordings and counting instances of a specific keyword) and continually lowered our offer until we found the absolute bottom price that multiple people would accept, and then complete the task."

The results "showed" that Americans are the ones willing to accept the lowest possible salary for working on a task, compared even to people in India, Romania, Philippines, etc. In fact, they found the that there are Americans willing to work for 25 cents per hour, while they could not find anyone willing to work for less than \$1/hr in any other country. The conclusion of the article? Americans are more desperate than anyone else in the world.

What is the key problem of this study? There are many more US-based workers on Mechanical Turk compared to other nationalities. So, if you have a handful of workers from other countries, and hundreds of workers from the US, you are guaranteed to find more extreme findings for the US. Why? To put it simply, you are searching harder within the US to find small values, compared to the effort placed on other countries. (There are other issues as well, e.g., workers that would work on this task are not necessarily representative of the overall population; the same workers are exposed to multiple, decreasing salaries, issues of anchoring, issues of workers falsely reporting to be from the US, whether the authors checked IP geo-location, etc. While all these are valid concerns, they are secondary to the very basic statistical problem.)

Finding a Minimum Value: A Probabilistic Approach

On an abstract, statistical level, by testing workers from multiple countries, to determine their minimum wage, we sample multiple "minimum wage distributions" trying to find the smallest value within each one of them.

Each probability distribution corresponds to the minimum wages that workers from different countries are willing to accept. Let's call the CDF's of distributions $F_i(x)$, with, say, $F_1(x)$ being the distribution for minimum wages for US, $F_2(x)$ for India, $F_3(x)$ for UK, etc etc.

As an simplifying example, assume that $F(x)$ is a uniform distribution, with minimum value \$0 and a maximum value \$10, for an average acceptable minimum wage of \$5. This means that:

10% of the population will accept a minimum wage below \$1, (i.e., $F(\$1)=0.1$)
20% of the population will accept a minimum wage below \$2, (i.e., $F(\$2)=0.2$)
...
90% of the population will accept a minimum wage below \$9, (i.e., $F(\$9)=0.9$)
100% of the population will accept a minimum wage below \$10, (i.e., $F(\$10)=1.0$)

Now, let's assume that we sample $n$ workers from one of the country-specific distributions. After running the experiment, we get back measurements $x_1, \ldots, x_n$, each one corresponding to the minimum wage for each of the workers that participated in the study, who comes from the country that we are measuring.

What is the probability of one of these wages being below, say, $z=\$0.25$? Here is the probability calculation:

$\begin{eqnarray}
Pr(\mathit{min~wage} < z) &=& 1 - Pr(\mathit{all~wages} \geq z)\\
& =& 1 - Pr(x_1 \geq z, \ldots, x_n \geq z)
\end{eqnarray}$

Assuming independence across the sampled values, we have:

$\begin{eqnarray}
Pr(\mathit{min~wage} < z) &=& 1 - \prod_{i=1}^n Pr(x_i \geq z) \\
& =& 1 - \left(1 - F(z) \right)^n
\end{eqnarray}$

So, if we sample $n$ workers, set the minimum wage at $z=0.25$ , and assume uniform distribution for $F$, then $F(\$0.25)=0.025$ and the probability that we will find at least one worker willing to work for 25 cents is:

$Pr(\mathit{min~wage} < z) = 1 - 0.975^n$

Plotting this, as a function of $n$, we have the following:

As we get more and more workers, the more likely it is to find a value that will be at or below 25 cents/hour.

So, how this approach explains the findings of Newsweek?

We know that all countries are not equally represented on Mechanical Turk. Most workers are from the US (50% or so), followed by India (35% or so), and then by Canada (2%), UK (2%), Philippines (2%), and a variety of other countries with similarly small percentages. This means that in the study, we expect to have more Americans participating, followed by Indians, and then a variety of other countries. So, even if the distribution of minimum wages was identical across all countries, we expect to find lower wages in the country with the largest number of participants.

Since the majority of the workers on Mechanical Turk are from US, followed by India, followed by Canada, and UK, etc, the illustration by Newsweek simply gives us the country of origin of the workers, in reverse order of popularity!

At this point, someone may ask: what happens if the distribution is not uniform but, say, lognormal? (A much more plausible distribution for minimum acceptable wages.) For this specific question, as you can see from the analysis above, this does not make much of a difference: The only thing that we need to know if the value of $F(z)$ for the $z$ value of interest.

Going in depth: Extreme Value Theory

A more general question is: What is the expected maximum (or minimum) value that we expect to find when we sample from an arbitrary distribution? This is the topic of extreme value theory, a field in statistics that tries to predict the probability of extreme events (e.g., what is the possible biggest possible drop in the stock market? what is the biggest rainfall in this region?) Given the events in the financial markets in 2008, this theory has received significant attention in the last few years.

What is nice about this theory is that the fundamentals can be summarized very succinctly. The Fisher–Tippett–Gnedenko theorem states that, if we sample from a distribution, the maximum values that we expect to find will be a random variable, belonging to one of the three distributions:

If the distribution from which we are sampling has a tail that decreases exponentially (e.g., normal distribution, exponential, Gamma, etc), then the maximum value is described by the (reversed) Gumbel distribution (aka "type I extreme value distribution")
If the distribution from which we are sampling has a tail that decreases as a polynomial (i.e., has a "long tail") (e.g., power-laws, Cauchy, Student-t, etc), then the maximum value is described by the Frechet distribution (aka "type II extreme value distribution")
If the distribution from which we are sampling has a tail that is finite (i.e., has a "short tail") (e.g., uniform, Beta, etc), then the maximum follows the (reversed) Weibull distribution (aka "type III extreme value distribution")

The three types of the distributions are all special cases of the generalized extreme value distribution.

This theory has significant applications not only when modeling risk (stock market, weather, earthquakes, etc), but also when modeling decision-making for humans: Often, we model humans as utility maximizers, who are making decisions that maximize their own well-being. This maximum-seeking behavior results often in the distributions described above. I will give a more detailed description in a later blog post.

Friday, June 24, 2011

Accepted papers for the 3rd Human Computation Workshop (HCOMP 2011)

We have posted online the schedule for the 3rd Human Computation Workshop (HCOMP 2011), which will be organized as part of AAAI 2011, in San Francisco, on August 8th. The registration fee for participating in the workshop is a pretty modest \$125 for graduate students, and \$155 for other participants. Just make sure to register before July 1st to get these rates, as afterwards the rates jump to \$165 and \$185. I should also mention that, following the tradition established in Paris in HCOMP 2009, we will have a group dinner for all the participants after the workshop to continue the discussions from the day...

We have a strong program, with 16 long papers accepted, and 16 papers being presented as demos and posters. Below you can find the titles of the papers and their abstracts. The PDF versions of the papers ~~will be posted online by AAAI, after the completion of the conference~~ are available through the AAAI Digital Library. Until then, you can search Google, or just ask the authors for a pre-print. So, if you are interested in crowdsourcing and human computation, we hope to see you there in San Francisco in August!

Long Papers

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds
Sudheendra Vijayanarasimhan, Kristen Grauman (UT Austin)

Active learning and crowdsourcing are promising ways to efficiently build up training sets for object recognition, but thus far techniques are tested in artificially controlled settings. Typically the vision researcher has already determined the dataset's scope, the labels ``actively" obtained are in fact already known, and/or the crowd-sourced collection process is iteratively fine-tuned. We present an approach for *live learning* of object detectors, in which the system autonomously refines its models by actively requesting crowd-sourced annotations on images crawled from the Web. To address the technical issues such a large-scale system entails, we introduce a novel part-based detector amenable to linear classifiers, and show how to identify its most uncertain instances in sub-linear time with a hashing-based solution. We demonstrate the approach with experiments of unprecedented scale and autonomy, and show it successfully improves the state-of-the-art for the most challenging objects in the PASCAL benchmark. In addition, we show our detector competes well with popular nonlinear classifiers that are much more expensive to train.
Robust Active Learning using Crowdsourced Annotations for Activity Recognition
Liyue Zhao, Gita Sukthankar (UCF); Rahul Sukthankar (Google Research/CMU)

Recognizing human activities from wearable sensor data is an important problem, particularly for health and eldercare applications. However, collecting sufficient labeled training data is challenging, especially since interpreting IMU traces is difficult for human annotators. Recently, crowdsourcing through services such as Amazon's Mechanical Turk has emerged as a promising alternative for annotating such data, with active learning serving as a natural method for affordably selecting an appropriate subset of instances to label. Unfortunately, since most active learning strategies are greedy methods that select the most uncertain sample, they are very sensitive to annotation errors (which corrupt a significant fraction of crowdsourced labels). This paper proposes methods for robust active learning under these conditions. Specifically, we make three contributions: 1) we obtain better initial labels by asking labelers to solve a related task; 2) we propose a new principled method for selecting instances in active learning that is more robust to annotation noise; 3) we estimate confidence scores for labels acquired from MTurk and ask workers to relabel samples that receive low scores under this metric. The proposed method is shown to significantly outperform existing techniques both under controlled noise conditions and in real active learning scenarios. The resulting method trains classifiers that are close in accuracy to those trained using ground-truth data.
Beat the Machine: Challenging workers to find the unknown unknowns
Josh Attenberg, Panos Ipeirotis, Foster Provost (NYU)

This paper presents techniques for gathering data that expose errors of automatic classification models. Prior work has demonstrated the promise of having humans seek training data, as an alternative to active learning, in cases where there is extreme class imbalance. We now explore the direction where we ask humans to identify cases what will cause the classification system to fail. Such techniques are valuable in revealing problematic cases that do not reveal themselves during the normal operation of the system, and may include cases that are rare but catastrophic. We describe our approach for building a system to satisfy this requirements, trying to encourage humans to provide us with such data points. In particular, we reward a human when the provided example is difficult for the model to handle, and the reward is proportional to the magnitude of the error. In a sense, the humans are asked to ''Beat the Machine'' and find cases where the automatic model (''the machine'') is wrong. Our experimental data show that the density of the identified problems is an order of magnitude higher compared to alternative approaches, and that the proposed technique can identify quickly the ``big flaws'' that would typically remain uncovered.
Human Intelligence Needs Artificial Intelligence
Daniel Weld, Mausam Mausam, Peng Dai (University of Washington)

Crowdsourcing platforms, such as Amazon Mechanical Turk, have enabled the construction of scalable applications for tasks ranging from product categorization and photo tagging to audio transcription and translation. These vertical applications are typically realized with complex, self-managing workflows that guarantee quality results. But constructing such workflows is challenging, with a huge number of alternative decisions for the designer to consider. Artificial intelligence methods can greatly simplify the process of creating complex crowdsourced workflows. We argue this thesis by presenting the design of TurKontrol 2.0, which uses machine learning to continually refine models of worker performance and task difficulty. Using these models, TurKontrol 2.0 uses decision-theoretic optimization to 1) choose between alternative workflows, 2) optimize parameters for a workflow, 3) create personalized interfaces for individual workers, and 4) dynamically control the workflow. Preliminary experience suggests that these optimized workflows are significantly more economical than those generated by humans.
Worker Motivation in Crowdsourcing and Human Computation
Nicolas Kaufmann; Thimo Schulze (University of Mannheim)

Many human computation systems use crowdsourcing markets like Amazon Mechanical Turk to recruit human workers. The payment in these markets is usually very low, and still collected demographic data shows that the participants are a very diverse group including highly skilled full time workers. Many existing studies on their motivation are rudimental and not grounded on established motivation theory. Therefore, we adapt different models from classic motivation theory, work motivation theory and Open Source Software Development to crowdsourcing markets. The model is tested with a survey of 431 workers on Mechanical Turk. We find that the extrinsic motivational categories (immediate payoffs, delayed payoffs, social motivation) have a strong effect on the time spent on the platform. For many workers, however, intrinsic motivation aspects are more important, especially the different facets of enjoyment based motivation like “task autonomy” and “skill variety”. Our contribution is a preliminary model based on established theory intended for the comparison of different crowdsourcing platforms.
Honesty in an Online Labor Market
Winter Mason, Siddharth Suri, Daniel Goldstein (Yahoo! Research)

The efficient functioning of markets and institutions assume a certain degree of honesty from participants. In labor markets, for instance, employers benefit from employees who will render meaningful work, and employees benefit from employers who will pay the promised amount for services rendered. We use an established method for detecting dishonest behavior in a series of experiments conducted on \amt, a popular online labor market. Our first experiment estimates a baseline amount of dishonesty for this task in the population sample. The second experiment tests the hypothesis that the level of dishonesty in the population will be sensitive to the relative amount that can be gained by dishonest reporting, and the third experiment, manipulates the degree to which dishonest reporting can be detected at the individual level. We conclude with a demographic and cross-cultural analysis of the predictors of dishonest reporting in this market.
Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection
David Chen (UT Austin); William Dolan (Microsoft Research)

Traditional methods of collecting translation and paraphrase data are prohibitively expensive, making constructions of large, new corpora difficult. While crowdsourcing offers a cheap alternative, quality control and scalability can become problematic. We discuss a novel annotation task that uses videos as the stimulus which discourages cheating. It also only requires monolingual speakers, thus making it easier to scale since more workers are qualified to contribute. Finally, we employed a multi-tiered payment system that helps retain good workers over the long-term, resulting in a persistent, high-quality workforce. We present the results of one of the largest linguistic data collection efforts using Mechanical Turk, yielding 85K English sentences and more than 1k sentences for each of a dozen more languages.
CrowdSight: Rapidly Prototyping Intelligent Visual Processing Apps
Mario Rodriguez (UCSC); James Davis

We describe a framework for rapidly prototyping applications which require intelligent visual processing, but for which there does not yet exist reliable algorithms, or for which engineering those algorithms is too costly. The framework, CrowdSight, leverages the power of crowdsourcing to offload intelligent processing to humans, and enables new applications to be built quickly and cheaply, affording system builders the opportunity to validate a concept before committing significant time or capital. Our service accepts requests from users either via email or simple mobile applications, and handles all the communication with a backend human computation platform. We build redundant requests and data aggregation into the system freeing the user from managing these requirements. We validate our framework by building several test applications and verifying that prototypes can be built more easily and quickly than would be the case without the framework.
Digitalkoot: Making Old Archives Accessible Using Crowdsourcing
Otto Chrons, Sami Sundell (Microtask)

In this paper, we present Digitalkoot, a system for fixing errors in the Optical Character Recognition (OCR) process of old texts through the use of human computation. By turning the work into simple games, we are able to attract a great number of volunteers to donate their time and cognitive capacity for the cause. Our analysis shows how untrained people can reach very high accuracy through the use of crowdsourcing. Furthermore we analyze the effect of social media and gender on participation levels and the amount of work accomplished.
Error Detection and Correction in Human Computation: Lessons from the WPA
David Alan Grier (GWU)

Human Computation is, of course, a very old field with a forgotten literature that treats many of the key problems, especially error detection and correction. The obvious methods of error detection, duplicate calculation, have proven to be subject to Babbage's Rule: Different workers using the same methods on the same data will tend to make the same errors. To avoid the consequences of this rule, early human computers developed a disciplined regimen to identify and correct mistakes. This paper reconstructs those methods, puts them in a modern context and identifies their implications for the modern version of human computation.
Programmatic gold: targeted and scalable quality assurance in crowdsourcing
Dave Oleson, Vaughn Hester, Alex Sorokin, Greg Laughlin, John Le, Lukas Biewald (CrowdFlower)

Crowdsourcing is an effective tool for scalable data annotation in both research and enterprise contexts. Due to crowdsourcing's open participation model, quality assurance is critical to the success of any project. Present methods rely on EM-style post-processing or manual annotation of large gold standard sets. In this paper we present an automated quality assurance process that is inexpensive and scalable. Our novel process relies on programmatic gold creation to provide targeted training feedback to workers and to prevent common scamming scenarios. We find that it decreases the amount of manual work required to manage crowdsourced labor while improving the overall quality of the results.
An Iterative Dual Pathway Structure for Speech-to-Text Transcription
Beatrice Liem, Haoqi Zhang, Yiling Chen (Harvard University)

In this paper, we develop a new human computation algorithm for speech-to-text transcription that can potentially achieve the high accuracy of professional transcription using only microtasks deployed via an online task market or a game. The algorithm partitions audio clips into short 10-second segments for independent processing and joins adjacent outputs to produce the full transcription. Each segment is sent through an iterative dual pathway structure that allows participants in either path to iteratively refine the transcriptions of others in their path while being rewarded based on transcriptions in the other path, eliminating the need to check transcripts in a separate process. Initial experiments with local subjects show that produced transcripts are on average 96.6% accurate.
An Extendable Toolkit for Managing Quality of Human-based Electronic Services
David Bermbach, Robert Kern, Pascal Wichmann, Sandra Rath, Christian Zirpins (KIT)

Micro-task markets like Amazon MTurk enable online workers to provide human intelligence as Web-based on demand services (so called people services). Businesses facing large amounts of knowledge work can benefit from increased flexibility and scalability of their workforce but need to cope with reduced control of result quality. While this problem is well recognized, it is so far only rudimentarily addressed by existing platforms and tools. In this paper, we present a flexible research toolkit which enables experiments with advanced quality management mechanisms for generic micro-task markets. The toolkit enables control of correctness and performance of task fulfillment by means of dynamic sampling, weighted majority voting and worker pooling. We demonstrate its application and performance for an OCR scenario building on Amazon MTurk. The toolkit however enables the development of advanced quality management mechanisms for a large variety of people service scenarios and platforms.
What’s the Right Price? Pricing Tasks for Finishing on Time
Siamak Faridani, Bjoern Hartmann (UC Berkeley); Panos Ipeirotis (NYU)

Many practitioners currently use rules of thumb to price tasks on online labor markets. Incorrect pricing leads to task starvation or inefficient use of capital. Formal optimal pricing policies can address these challenges. In this paper we argue that an optimal pricing policy must be based on the tradeoff between price and desired completion time. We show how this duality can lead to a better pricing policy for tasks in online labor markets. This paper makes three contributions. First, we devise an algorithm for optimal job pricing using a survival analysis model. We then show that worker arrivals can be modeled as a non-homogenous Poisson Process (NHPP). Finally using NHPP for worker arrivals and discrete choice models we present an abstract mathematical model that captures the dynamics of the market when full market information is presented to the task requester. This model can be used to predict completion times and optimal pricing policies for both public and private crowds.
Pricing Mechanisms for Online Labor Market
Yaron Singer, Manas Mittal (UC Berkeley EECS)

In online labor markets, determining the appropriate incentives is a difficult problem. In this paper, we present dynamic pricing mechanisms for determining the optimal prices for such tasks. In particular, the mechanisms are designed to handle the intricacies of the markets like mechanical turk (workers are coming online, requesters have budgets, etc.). The mechanisms have desirable theoretical guarantees (incentive compatibility, budget feasibility, and competitive ration performance) and perform well in practice. Experiments demonstrate the effectiveness and feasibility of using such mechanisms in practice.
Labor Allocation in Paid Crowdsourcing: Experimental Evidence on Positioning, Nudges and Prices
John Horton (ODesk); Dana Chandler (MIT)

This paper reports the results of a natural field experiment where workers from a paid crowdsourcing environment self-select into tasks and are presumed to have limited attention. In our experiment, workers labeled any of six pictures from a 2 x 3 grid of thumbnail images. In the absence of any incentives, workers exhibit a strong default bias and tend to select images from the top-left (``focal'') position; the bottom-right (``non-focal'') position, was the least preferred. We attempted to overcome this bias and increase the rate at which workers selected the least preferred task, by using a combination of monetary and non-monetary incentives. We also varied the saliency of these incentives by placing them in either the focal or non-focal position. Although both incentive types caused workers to re-allocate their labor, monetary incentives were more effective. Most interestingly, both incentive types worked better when they were placed in the focal position and made more salient. In fact, salient non-monetary incentives worked about as well as non-salient monetary ones. Our evidence suggests that user interface and cognitive biases play an important role in online labor markets and that salience can be used by employers as a kind of ``incentive multiplier''.

Posters

Developing Scripts to Teach Social Skills: Can the Crowd Assist the Author?
Fatima Boujarwah, Jennifer Kim, Gregory Abowd, Rosa Arriaga (Georgia Tech)

The social world that most of us navigate effortlessly can prove to be a perplexing and disconcerting place for individuals with autism. Currently there are no models to assist non-expert authors as they create customized social script-based instructional modules for a particular child. We describe an approach to using human computation to develop complex models of social scripts for a plethora of complex and interesting social scenarios, possible obstacles that may arise in those scenarios, and potential solutions to those obstacles. Human input is the natural way to build these models, and in so doing create valuable assistance for those trying to navigate the intricacies of a social life.
CrowdLang - First Steps Towards Programmable Human Computers for General Computation
Patrick Minder, Abraham Bernstein (University of Zurich)

Crowdsourcing markets such as Amazon’s Mechanical Turk provide an enormous potential for accomplishing work by combining human and machine computation. Today crowdsourcing is mostly used for massive parallel information processing for a variety of tasks such as image labeling. However, as we move to more sophisticated problem-solving there is little knowledge about managing dependencies between steps and a lack of tools for doing so. As the contribution of this paper, we present a concept of an executable, model-based programming language and a general purpose framework for accomplishing more sophisticated problems. Our approach is inspired by coordination theory and an analysis of emergent collective intelligence. We illustrate the applicability of our proposed language by combining machine and human computation based on existing interaction patterns for several general computation problems.
Ranking Images on Semantic Attributes using CollaboRank
Jeroen Janssens, Eric Postma, Jaap Van den Herik (Tilburg University)

In this paper, we investigate to what extent a large group of human workers is able to produce collaboratively a global ranking of images, based on a single semantic attribute. To this end, we developed CollaboRank, which is a method that formulates and distributes tasks to human workers, and aggregates their personal rankings into a global ranking. Our results show that a relatively high consensus can be achieved, depending on the type of the semantic attribute.
Artificial Intelligence for Artificial Artificial Intelligence
Peng Dai, Mausam, Daniel Weld (University of Washington)

Crowdsourcing platforms such as Amazon Mechanical Turk have become popular for a wide variety of human intelligence tasks; however, quality control continues to be a significant challenge. Recently, Dai et al (2010) propose TurKontrol, a theoretical model based on POMDPs to optimize iterative, crowd-sourced workflows. However, they neither describe how to learn the model parameters, nor show its effectiveness in a real crowd-sourced setting. Learning is challenging due to the scale of the model and noisy data: there are hundreds of thousands of workers with high-variance abilities. This paper presents an end-to-end system that first learns TurKontrol's POMDP parameters from real Mechanical Turk data, and then applies the model to dynamically optimize live tasks. We validate the model and use it to control a successive-improvement process on Mechanical Turk. By modeling worker accuracy and voting patterns, our system produces significantly superior artifacts compared to those generated through static workflows using the same amount of money.
One Step beyond Independent Agreement: A Tournament Selection Approach for Quality Assurance of Human Computation Tasks
Yu-An Sun, Shourya Roy (Xerox); Greg Little (MIT CSAIL)

Quality assurance remains a key topic in the human computation research field. Prior work indicates that independent agreement is effective for low difficulty tasks, but has limitations. This paper addresses this problem by proposing a tournament selection based quality control process. The experimental results from this paper show that humans are better at identifying the correct answers than generating them.
Turkomatic: Automatic, Recursive Task and Workflow Design for Mechanical Turk
Anand Kulkarni, Matthew Can, Bjoern Hartmann (UC Berkeley)

On today’s human computation systems, designing tasks and workflows is a difficult and labor-intensive process. Can workers from the crowd be used to help plan workflows? We explore this question with Turkomatic, a new interface to microwork platforms that uses crowd workers to help plan workflows for complex tasks. Turkomatic uses a general-purpose divide-and-conquer algorithm to solve arbitrary natural-language requests posed by end users. The interface includes a novel real-time visual workflow editor that enables requesters to observe and edit workflows while the tasks are being completed. Crowd verification of work and the division of labor among members of the crowd can be handled automatically by Turkomatic, which substantially simplifies the process of using human computation systems. These features enable a novel means of interaction with crowds of online workers to support successful execution of complex work.
MuSweeper: Collect Mutual Exclusions with Extensive Game
Tao-Hsuan Chang, Cheng-wei Chan, Jane Yung-jen Hsu (National Taiwan University)

Mutual exclusions are important information for machine learning. Games With A Purpose (or GWAP) provide an effective way to get large amount of data from web users. This research proposes MuSweeper, a minesweeper-like game, to collect mutual exclusions. By embedding game theory into game mechanism, the precision is guaranteed. Experiment showed MuSweeper can efficiently collect mutual exclusions with high precision.
MobileWorks: A Mobile Crowdsourcing Platform for Workers at the Bottom of the Pyramid
Prayag Narula, Philipp Gutheim, David Rolnitzky, Anand Kulkarni, Bjoern Hartmann (UC Berkeley)

We present MobileWorks, a mobile phone-based crowdsourcing platform. MobileWorks targets workers in developing countries who live at the bottom of the economic pyramid. This population does not have access to desktop computers, so existing microtask labor markets are inaccessible to them. MobileWorks offers human OCR tasks that can be accomplished on low-end mobile phones; workers access it through their mobile web browser. To address the limited screen resolution available on low-end phones, MobileWorks segments documents into many small pieces, and sends each piece to a different worker. A first pilot study with 10 users over a period of 2 months revealed that it is feasible to do simple OCR tasks using simple Mobile Web based application. We found that on an average the workers do the tasks at 120 tasks per hour. Using single entry the accuracy of workers across the different documents is 89% . We propose a multiple entry solution which increases the theoretical accuracy of the OCR to more than 99%.
Towards Task Recommendation in Micro-Task Markets
Vamsi Ambati, Stephan Vogel, Jaime Carbonell (CMU)

As researchers embrace micro-task markets for eliciting human input, the nature of the posted tasks moves from those requiring simple mechanical labor to requiring specific cognitive skills. On the other hand, increase is seen in the number of such tasks and the user population in micro-task market places requiring better search interfaces for productive user participation. In this paper we posit that understanding user skill sets and presenting them with suitable tasks not only maximizes the over quality of the output, but also attempts to maximize the benefit to the user in terms of more successfully completed tasks. We also implement a recommendation engine for suggesting tasks to users based on implicit modeling of skills and interests. We present results from a preliminary evaluation of our system using publicly available data gathered from a variety of human computation experiments recently conducted on Amazon's Mechanical Turk.
On Quality Control and Machine Learning in Crowdsourcing
Matthew Lease (UT Austin)

The advent of crowdsourcing has created a variety of new opportunities for improving upon traditional methods of data collection and annotation. This in turn has created intriguing new opportunities for data-driven machine learning (ML). Convenient access to crowd workers for simple data collection has further generalized to leveraging more arbitrary crowd-based human computation to supplement ML. While new potential applications of crowdsourcing continue to emerge, a variety of practical and sometimes unexpected obstacles have already limited the degree to which its promised potential can be actually realized in practice. This paper considers two particular aspects of crowdsourcing and their interplay, data quality control (QC) and ML, reflecting on where we have been, where we are, and where we might go from here.
CollabMap: Augmenting Maps using the Wisdom of Crowds
Ruben Stranders, Sarvapali Ramchurn, Bing Shi, Nicholas Jennings (University of Southampton)

In this paper we develop a novel model of geospatial data creation, called CollabMap, that relies on human computation. CollabMap is a crowdsourcing tool to get users contracted via Amazon Mechanical Turk or a similar service to perform micro-tasks that involve augmenting existing maps (e.g. GoogleMaps or Ordnance Survey) by drawing evacuation routes, using satellite imagery from GoogleMaps and panoramic views from Google Street View. We use human computation to complete tasks that are hard for a computer vision algorithm to perform or to generate training data that could be used by a computer vision algorithm to automatically define evacuation routes.
Improving Consensus Accuracy via Z-score and Weighted Voting
Hyun Joon Jung, Matthew Lease (UT Austin)

We describe a Z-score based outlier detection method for detection and filtering of inaccurate crowd workers. After filtering, we aggregate labels from remaining workers via simple majority voting or feature-weighted voting. Both su-pervised and unsupervised features are used, individually and in combination, for both outlier detection and weighted voting. We evaluate on noisy judgments collected from Amazon Mechanical Turk which assess Websearch relevance of query/document pairs. We find that filtering in combination with multi-feature weighted voting achieves 8.94% relative error reduction for graded accuracy (4.25% absolute) and 5.32% for binary accuracy (3.45% absolute).
Making Searchable Melodies: Human vs. Machine
Mark Cartwright, Zafar Rafii, Jinyu Han, Bryan Pardo (Northwestern University)

Systems that find music recordings based on hummed or sung, melodic input are called Query-By-Humming (QBH) systems. Such systems employ search keys that are more similar to a cappella singing than the original recordings. Successful deployed systems use human computation to create these search keys: hand-entered midi melodies or recordings of a cappella singing. Tunebot is one such system. In this paper, we compare search results using keys built from two automated melody extraction system to those gathered using two populations of humans: local paid sing-ers and Amazon Turk workers.
PulaCloud: Using Human Computation to Enable Development at the Bottom of the Economic Ladder
Andrew Schriner (University of Cincinnati); Daniel Oerther (Missouri University of Science and Technology); James Uber (University of Cincinnati)

This research aims to explore how Human Computation can be used to aid economic development in communities experiencing extreme poverty throughout the world. Work is ongoing with a community in rural Kenya to connect them to employment opportunities through a Human Computation system. A feasibility study has been conducted in the community using the 3D protein folding game Foldit and Amazon’s Mechanical Turk. Feasibility has been confirmed and obstacles identified. Current work includes a pilot study doing image analysis for two research projects and developing a GUI that is usable by workers with little computer literacy. Future work includes developing effective incentive systems that operate both at the individual level and the group level and integrating worker accuracy evaluation, worker compensation, and result-credibility evaluation.
Towards Large-Scale Processing of Simple Tasks with Mechanical Turk
Paul Wais, Shivaram Lingamneni, Duncan Cook, Jason Fennell, Benjamin Goldenberg, Daniel Lubarov, David Marin, Hari Simons (Yelp, inc.)

Crowdsourcing platforms such as Amazon's Mechanical Turk (AMT) provide inexpensive and scalable workforces for processing simple online tasks. Unfortunately, workers participating in crowdsourcing tend to supply work of inconsistent or low quality. We report on our experiences using AMT to verify hundreds of thousands of local business listings for the online directory Yelp.com. Using expert-verified changes, we evaluate the accuracy of our workforce and present the results of preliminary experiments that work towards filtering low-quality workers and correcting for worker bias. Our report seeks to inform the community of practical and financial constraints that are critical to understanding the problem of quality control in crowdsourcing systems.
Learning to Rank From a Noisy Crowd
Abhimanu Kumar, Matthew Lease (UT Austin)

We consider how to most effectively use crowd-based relevance assessors to produce training data for learning to rank. This integrates two lines of prior work: studies of unreliable crowd-based binary annotation for binary classification, and studies for aggregating graded relevance judgments from reliable experts for ranking. To model varying performance of the crowd, we simulate annotation noise with varying magnitude and distributional properties. Evaluation on three LETOR test collections reveals a striking trend contrary to prior studies: single labeling outperforms consensus methods in maximizing learner rate (relative to annotator effort). We also see surprising consistency of learning rate across noise distributions, as well as greater challenge with the adversarial case for multi-class labeling.