## Sunday, June 26, 2011

### Extreme value theory 101, or Newsweek researching minimum wage on Mechanical Turk

Last week, Newsweek published an article titled The Real Minimum Wage. The authors report that "in a weeks-long experiment, we posted simple, hourlong jobs (listening to audio recordings and counting instances of a specific keyword) and continually lowered our offer until we found the absolute bottom price that multiple people would accept, and then complete the task."

The results "showed" that Americans are the ones willing to accept the lowest possible salary for working on a task, compared even to people in India, Romania, Philippines, etc. In fact, they found the that there are Americans willing to work for 25 cents per hour, while they could not find anyone willing to work for less than \$1/hr in any other country. The conclusion of the article? Americans are more desperate than anyone else in the world. What is the key problem of this study? There are many more US-based workers on Mechanical Turk compared to other nationalities. So, if you have a handful of workers from other countries, and hundreds of workers from the US, you are guaranteed to find more extreme findings for the US. Why? To put it simply, you are searching harder within the US to find small values, compared to the effort placed on other countries. (There are other issues as well, e.g., workers that would work on this task are not necessarily representative of the overall population; the same workers are exposed to multiple, decreasing salaries, issues of anchoring, issues of workers falsely reporting to be from the US, whether the authors checked IP geo-location, etc. While all these are valid concerns, they are secondary to the very basic statistical problem.) Finding a Minimum Value: A Probabilistic Approach On an abstract, statistical level, by testing workers from multiple countries, to determine their minimum wage, we sample multiple "minimum wage distributions" trying to find the smallest value within each one of them. Each probability distribution corresponds to the minimum wages that workers from different countries are willing to accept. Let's call the CDF's of distributions$F_i(x)$, with, say,$F_1(x)$being the distribution for minimum wages for US,$F_2(x)$for India,$F_3(x)$for UK, etc etc. As an simplifying example, assume that$F(x)$is a uniform distribution, with minimum value \$0 and a maximum value \$10, for an average acceptable minimum wage of \$5. This means that:
• 10% of the population will accept a minimum wage below \$1, (i.e.,$F(\$1)=0.1$)
• 20% of the population will accept a minimum wage below \$2, (i.e.,$F(\$2)=0.2$)
• ...
• 90% of the population will accept a minimum wage below \$9, (i.e.,$F(\$9)=0.9$)
• 100% of the population will accept a minimum wage below \$10, (i.e.,$F(\$10)=1.0$)

Now, let's assume that we sample $n$ workers from one of the country-specific distributions. After running the experiment, we get back measurements $x_1, \ldots, x_n$, each one corresponding to the minimum wage for each of the workers that participated in the study, who comes from the country that we are measuring.

What is the probability of one of these wages being below, say, $z=\$0.25$? Here is the probability calculation:$\begin{eqnarray}
Pr(\mathit{min~wage} < z) &=& 1 - Pr(\mathit{all~wages} \geq z)\\
& =& 1 - Pr(x_1 \geq z, \ldots, x_n \geq z)
\end{eqnarray}$Assuming independence across the sampled values, we have:$\begin{eqnarray}
Pr(\mathit{min~wage} < z) &=& 1 - \prod_{i=1}^n Pr(x_i \geq z) \\
& =& 1 - \left(1 - F(z) \right)^n
\end{eqnarray}$So, if we sample$n$workers, set the minimum wage at$z=0.25$, and assume uniform distribution for$F$, then$F(\$0.25)=0.025$ and the probability that we will find at least one worker willing to work for 25 cents is:

$Pr(\mathit{min~wage} < z) = 1 - 0.975^n$

Plotting this, as a function of $n$, we have the following:

As we get more and more workers, the more likely it is to find a value that will be at or below 25 cents/hour.

So, how this approach explains the findings of Newsweek?

We know that all countries are not equally represented on Mechanical Turk. Most workers are from the US (50% or so), followed by India (35% or so), and then by Canada (2%), UK (2%), Philippines (2%), and a variety of other countries with similarly small percentages. This means that in the study, we expect to have more Americans participating, followed by Indians, and then a variety of other countries. So, even if the distribution of minimum wages was identical across all countries, we expect to find lower wages in the country with the largest number of participants.

Since the majority of the workers on Mechanical Turk are from US, followed by India, followed by Canada, and UK, etc, the illustration by Newsweek simply gives us the country of origin of the workers, in reverse order of popularity!

At this point, someone may ask: what happens if the distribution is not uniform but, say, lognormal? (A much more plausible distribution for minimum acceptable wages.) For this specific question, as you can see from the analysis above, this does not make much of a difference: The only thing that we need to know if the value of $F(z)$ for the $z$ value of interest.

Going in depth: Extreme Value Theory

A more general question is: What is the expected maximum (or minimum) value that we expect to find when we sample from an arbitrary distribution? This is the topic of extreme value theory, a field in statistics that tries to predict the probability of extreme events (e.g., what is the possible biggest possible drop in the stock market? what is the biggest rainfall in this region?) Given the events in the financial markets in 2008, this theory has received significant attention in the last few years.

What is nice about this theory is that the fundamentals can be summarized very succinctly. The Fisher–Tippett–Gnedenko theorem states that, if we sample from a distribution, the maximum values that we expect to find will be a random variable, belonging to one of the three distributions:
• If the distribution from which we are sampling has a tail that decreases exponentially (e.g., normal distribution, exponential, Gamma, etc), then the maximum value is described by the (reversed) Gumbel distribution (aka "type I extreme value distribution")
• If the distribution from which we are sampling has a tail that decreases as a polynomial (i.e., has a "long tail") (e.g., power-laws, Cauchy, Student-t, etc), then the maximum value is described by the Frechet distribution (aka "type II extreme value distribution")
• If the distribution from which we are sampling has a tail that is finite (i.e., has a "short tail") (e.g., uniform, Beta, etc), then the maximum follows the (reversed) Weibull distribution (aka "type III extreme value distribution")

The three types of the distributions are all special cases of the generalized extreme value distribution.

This theory has significant applications not only when modeling risk (stock market, weather, earthquakes, etc), but also when modeling decision-making for humans: Often, we model humans as utility maximizers, who are making decisions that maximize their own well-being. This maximum-seeking behavior results often in the distributions described above. I will give a more detailed description in a later blog post.

## Friday, June 24, 2011

### Accepted papers for the 3rd Human Computation Workshop (HCOMP 2011)

We have posted online the schedule for the 3rd Human Computation Workshop (HCOMP 2011), which will be organized as part of AAAI 2011, in San Francisco, on August 8th. The registration fee for participating in the workshop is a pretty modest \$125 for graduate students, and \$155 for other participants. Just make sure to register before July 1st to get these rates, as afterwards the rates jump to \$165 and \$185. I should also mention that, following the tradition established in Paris in HCOMP 2009, we will have a group dinner for all the participants after the workshop to continue the discussions from the day...

We have a strong program, with 16 long papers accepted, and 16 papers being presented as demos and posters. Below you can find the titles of the papers and their abstracts. The PDF versions of the papers will be posted online by AAAI, after the completion of the conference are available through the AAAI Digital Library. Until then, you can search Google, or just ask the authors for a pre-print. So, if you are interested in crowdsourcing and human computation, we hope to see you there in San Francisco in August!

Long Papers

• Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds
Sudheendra Vijayanarasimhan, Kristen Grauman (UT Austin)

Active learning and crowdsourcing are promising ways to efficiently build up training sets for object recognition, but thus far techniques are tested in artificially controlled settings. Typically the vision researcher has already determined the dataset's scope, the labels actively" obtained are in fact already known, and/or the crowd-sourced collection process is iteratively fine-tuned. We present an approach for *live learning* of object detectors, in which the system autonomously refines its models by actively requesting crowd-sourced annotations on images crawled from the Web. To address the technical issues such a large-scale system entails, we introduce a novel part-based detector amenable to linear classifiers, and show how to identify its most uncertain instances in sub-linear time with a hashing-based solution. We demonstrate the approach with experiments of unprecedented scale and autonomy, and show it successfully improves the state-of-the-art for the most challenging objects in the PASCAL benchmark. In addition, we show our detector competes well with popular nonlinear classifiers that are much more expensive to train.

• Robust Active Learning using Crowdsourced Annotations for Activity Recognition
Liyue Zhao, Gita Sukthankar (UCF); Rahul Sukthankar (Google Research/CMU)

Recognizing human activities from wearable sensor data is an important problem, particularly for health and eldercare applications. However, collecting sufficient labeled training data is challenging, especially since interpreting IMU traces is difficult for human annotators. Recently, crowdsourcing through services such as Amazon's Mechanical Turk has emerged as a promising alternative for annotating such data, with active learning serving as a natural method for affordably selecting an appropriate subset of instances to label. Unfortunately, since most active learning strategies are greedy methods that select the most uncertain sample, they are very sensitive to annotation errors (which corrupt a significant fraction of crowdsourced labels). This paper proposes methods for robust active learning under these conditions. Specifically, we make three contributions: 1) we obtain better initial labels by asking labelers to solve a related task; 2) we propose a new principled method for selecting instances in active learning that is more robust to annotation noise; 3) we estimate confidence scores for labels acquired from MTurk and ask workers to relabel samples that receive low scores under this metric. The proposed method is shown to significantly outperform existing techniques both under controlled noise conditions and in real active learning scenarios. The resulting method trains classifiers that are close in accuracy to those trained using ground-truth data.

• Beat the Machine: Challenging workers to find the unknown unknowns
Josh Attenberg, Panos Ipeirotis, Foster Provost (NYU)

This paper presents techniques for gathering data that expose errors of automatic classification models. Prior work has demonstrated the promise of having humans seek training data, as an alternative to active learning, in cases where there is extreme class imbalance. We now explore the direction where we ask humans to identify cases what will cause the classification system to fail. Such techniques are valuable in revealing problematic cases that do not reveal themselves during the normal operation of the system, and may include cases that are rare but catastrophic. We describe our approach for building a system to satisfy this requirements, trying to encourage humans to provide us with such data points. In particular, we reward a human when the provided example is difficult for the model to handle, and the reward is proportional to the magnitude of the error. In a sense, the humans are asked to ''Beat the Machine'' and find cases where the automatic model (''the machine'') is wrong. Our experimental data show that the density of the identified problems is an order of magnitude higher compared to alternative approaches, and that the proposed technique can identify quickly the big flaws'' that would typically remain uncovered.

• Human Intelligence Needs Artificial Intelligence
Daniel Weld, Mausam Mausam, Peng Dai (University of Washington)

Crowdsourcing platforms, such as Amazon Mechanical Turk, have enabled the construction of scalable applications for tasks ranging from product categorization and photo tagging to audio transcription and translation. These vertical applications are typically realized with complex, self-managing workflows that guarantee quality results. But constructing such workflows is challenging, with a huge number of alternative decisions for the designer to consider. Artificial intelligence methods can greatly simplify the process of creating complex crowdsourced workflows. We argue this thesis by presenting the design of TurKontrol 2.0, which uses machine learning to continually refine models of worker performance and task difficulty. Using these models, TurKontrol 2.0 uses decision-theoretic optimization to 1) choose between alternative workflows, 2) optimize parameters for a workflow, 3) create personalized interfaces for individual workers, and 4) dynamically control the workflow. Preliminary experience suggests that these optimized workflows are significantly more economical than those generated by humans.

• Worker Motivation in Crowdsourcing and Human Computation
Nicolas Kaufmann; Thimo Schulze (University of Mannheim)

Many human computation systems use crowdsourcing markets like Amazon Mechanical Turk to recruit human workers. The payment in these markets is usually very low, and still collected demographic data shows that the participants are a very diverse group including highly skilled full time workers. Many existing studies on their motivation are rudimental and not grounded on established motivation theory. Therefore, we adapt different models from classic motivation theory, work motivation theory and Open Source Software Development to crowdsourcing markets. The model is tested with a survey of 431 workers on Mechanical Turk. We find that the extrinsic motivational categories (immediate payoffs, delayed payoffs, social motivation) have a strong effect on the time spent on the platform. For many workers, however, intrinsic motivation aspects are more important, especially the different facets of enjoyment based motivation like “task autonomy” and “skill variety”. Our contribution is a preliminary model based on established theory intended for the comparison of different crowdsourcing platforms.

• Honesty in an Online Labor Market
Winter Mason, Siddharth Suri, Daniel Goldstein (Yahoo! Research)

The efficient functioning of markets and institutions assume a certain degree of honesty from participants. In labor markets, for instance, employers benefit from employees who will render meaningful work, and employees benefit from employers who will pay the promised amount for services rendered. We use an established method for detecting dishonest behavior in a series of experiments conducted on \amt, a popular online labor market. Our first experiment estimates a baseline amount of dishonesty for this task in the population sample. The second experiment tests the hypothesis that the level of dishonesty in the population will be sensitive to the relative amount that can be gained by dishonest reporting, and the third experiment, manipulates the degree to which dishonest reporting can be detected at the individual level. We conclude with a demographic and cross-cultural analysis of the predictors of dishonest reporting in this market.

• Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection
David Chen (UT Austin); William Dolan (Microsoft Research)

Traditional methods of collecting translation and paraphrase data are prohibitively expensive, making constructions of large, new corpora difficult. While crowdsourcing offers a cheap alternative, quality control and scalability can become problematic. We discuss a novel annotation task that uses videos as the stimulus which discourages cheating. It also only requires monolingual speakers, thus making it easier to scale since more workers are qualified to contribute. Finally, we employed a multi-tiered payment system that helps retain good workers over the long-term, resulting in a persistent, high-quality workforce. We present the results of one of the largest linguistic data collection efforts using Mechanical Turk, yielding 85K English sentences and more than 1k sentences for each of a dozen more languages.

• CrowdSight: Rapidly Prototyping Intelligent Visual Processing Apps
Mario Rodriguez (UCSC); James Davis

We describe a framework for rapidly prototyping applications which require intelligent visual processing, but for which there does not yet exist reliable algorithms, or for which engineering those algorithms is too costly. The framework, CrowdSight, leverages the power of crowdsourcing to offload intelligent processing to humans, and enables new applications to be built quickly and cheaply, affording system builders the opportunity to validate a concept before committing significant time or capital. Our service accepts requests from users either via email or simple mobile applications, and handles all the communication with a backend human computation platform. We build redundant requests and data aggregation into the system freeing the user from managing these requirements. We validate our framework by building several test applications and verifying that prototypes can be built more easily and quickly than would be the case without the framework.

• Digitalkoot: Making Old Archives Accessible Using Crowdsourcing
Otto Chrons, Sami Sundell (Microtask)

In this paper, we present Digitalkoot, a system for fixing errors in the Optical Character Recognition (OCR) process of old texts through the use of human computation. By turning the work into simple games, we are able to attract a great number of volunteers to donate their time and cognitive capacity for the cause. Our analysis shows how untrained people can reach very high accuracy through the use of crowdsourcing. Furthermore we analyze the effect of social media and gender on participation levels and the amount of work accomplished.

• Error Detection and Correction in Human Computation: Lessons from the WPA
David Alan Grier (GWU)

Human Computation is, of course, a very old field with a forgotten literature that treats many of the key problems, especially error detection and correction. The obvious methods of error detection, duplicate calculation, have proven to be subject to Babbage's Rule: Different workers using the same methods on the same data will tend to make the same errors. To avoid the consequences of this rule, early human computers developed a disciplined regimen to identify and correct mistakes. This paper reconstructs those methods, puts them in a modern context and identifies their implications for the modern version of human computation.

• Programmatic gold: targeted and scalable quality assurance in crowdsourcing
Dave Oleson, Vaughn Hester, Alex Sorokin, Greg Laughlin, John Le, Lukas Biewald (CrowdFlower)

Crowdsourcing is an effective tool for scalable data annotation in both research and enterprise contexts. Due to crowdsourcing's open participation model, quality assurance is critical to the success of any project. Present methods rely on EM-style post-processing or manual annotation of large gold standard sets. In this paper we present an automated quality assurance process that is inexpensive and scalable. Our novel process relies on programmatic gold creation to provide targeted training feedback to workers and to prevent common scamming scenarios. We find that it decreases the amount of manual work required to manage crowdsourced labor while improving the overall quality of the results.

• An Iterative Dual Pathway Structure for Speech-to-Text Transcription
Beatrice Liem, Haoqi Zhang, Yiling Chen (Harvard University)

In this paper, we develop a new human computation algorithm for speech-to-text transcription that can potentially achieve the high accuracy of professional transcription using only microtasks deployed via an online task market or a game. The algorithm partitions audio clips into short 10-second segments for independent processing and joins adjacent outputs to produce the full transcription. Each segment is sent through an iterative dual pathway structure that allows participants in either path to iteratively refine the transcriptions of others in their path while being rewarded based on transcriptions in the other path, eliminating the need to check transcripts in a separate process. Initial experiments with local subjects show that produced transcripts are on average 96.6% accurate.

• An Extendable Toolkit for Managing Quality of Human-based Electronic Services
David Bermbach, Robert Kern, Pascal Wichmann, Sandra Rath, Christian Zirpins (KIT)

Micro-task markets like Amazon MTurk enable online workers to provide human intelligence as Web-based on demand services (so called people services). Businesses facing large amounts of knowledge work can benefit from increased flexibility and scalability of their workforce but need to cope with reduced control of result quality. While this problem is well recognized, it is so far only rudimentarily addressed by existing platforms and tools. In this paper, we present a flexible research toolkit which enables experiments with advanced quality management mechanisms for generic micro-task markets. The toolkit enables control of correctness and performance of task fulfillment by means of dynamic sampling, weighted majority voting and worker pooling. We demonstrate its application and performance for an OCR scenario building on Amazon MTurk. The toolkit however enables the development of advanced quality management mechanisms for a large variety of people service scenarios and platforms.

• What’s the Right Price? Pricing Tasks for Finishing on Time
Siamak Faridani, Bjoern Hartmann (UC Berkeley); Panos Ipeirotis (NYU)

Many practitioners currently use rules of thumb to price tasks on online labor markets. Incorrect pricing leads to task starvation or inefficient use of capital. Formal optimal pricing policies can address these challenges. In this paper we argue that an optimal pricing policy must be based on the tradeoff between price and desired completion time. We show how this duality can lead to a better pricing policy for tasks in online labor markets. This paper makes three contributions. First, we devise an algorithm for optimal job pricing using a survival analysis model. We then show that worker arrivals can be modeled as a non-homogenous Poisson Process (NHPP). Finally using NHPP for worker arrivals and discrete choice models we present an abstract mathematical model that captures the dynamics of the market when full market information is presented to the task requester. This model can be used to predict completion times and optimal pricing policies for both public and private crowds.

• Pricing Mechanisms for Online Labor Market
Yaron Singer, Manas Mittal (UC Berkeley EECS)

In online labor markets, determining the appropriate incentives is a difficult problem. In this paper, we present dynamic pricing mechanisms for determining the optimal prices for such tasks. In particular, the mechanisms are designed to handle the intricacies of the markets like mechanical turk (workers are coming online, requesters have budgets, etc.). The mechanisms have desirable theoretical guarantees (incentive compatibility, budget feasibility, and competitive ration performance) and perform well in practice. Experiments demonstrate the effectiveness and feasibility of using such mechanisms in practice.

• Labor Allocation in Paid Crowdsourcing: Experimental Evidence on Positioning, Nudges and Prices
John Horton (ODesk); Dana Chandler (MIT)

This paper reports the results of a natural field experiment where workers from a paid crowdsourcing environment self-select into tasks and are presumed to have limited attention. In our experiment, workers labeled any of six pictures from a 2 x 3 grid of thumbnail images. In the absence of any incentives, workers exhibit a strong default bias and tend to select images from the top-left (focal'') position; the bottom-right (non-focal'') position, was the least preferred. We attempted to overcome this bias and increase the rate at which workers selected the least preferred task, by using a combination of monetary and non-monetary incentives. We also varied the saliency of these incentives by placing them in either the focal or non-focal position. Although both incentive types caused workers to re-allocate their labor, monetary incentives were more effective. Most interestingly, both incentive types worked better when they were placed in the focal position and made more salient. In fact, salient non-monetary incentives worked about as well as non-salient monetary ones. Our evidence suggests that user interface and cognitive biases play an important role in online labor markets and that salience can be used by employers as a kind of incentive multiplier''.

Posters

• Developing Scripts to Teach Social Skills: Can the Crowd Assist the Author?
Fatima Boujarwah, Jennifer Kim, Gregory Abowd, Rosa Arriaga (Georgia Tech)

The social world that most of us navigate effortlessly can prove to be a perplexing and disconcerting place for individuals with autism. Currently there are no models to assist non-expert authors as they create customized social script-based instructional modules for a particular child. We describe an approach to using human computation to develop complex models of social scripts for a plethora of complex and interesting social scenarios, possible obstacles that may arise in those scenarios, and potential solutions to those obstacles. Human input is the natural way to build these models, and in so doing create valuable assistance for those trying to navigate the intricacies of a social life.

• CrowdLang - First Steps Towards Programmable Human Computers for General Computation
Patrick Minder, Abraham Bernstein (University of Zurich)

Crowdsourcing markets such as Amazon’s Mechanical Turk provide an enormous potential for accomplishing work by combining human and machine computation. Today crowdsourcing is mostly used for massive parallel information processing for a variety of tasks such as image labeling. However, as we move to more sophisticated problem-solving there is little knowledge about managing dependencies between steps and a lack of tools for doing so. As the contribution of this paper, we present a concept of an executable, model-based programming language and a general purpose framework for accomplishing more sophisticated problems. Our approach is inspired by coordination theory and an analysis of emergent collective intelligence. We illustrate the applicability of our proposed language by combining machine and human computation based on existing interaction patterns for several general computation problems.

• Ranking Images on Semantic Attributes using CollaboRank
Jeroen Janssens, Eric Postma, Jaap Van den Herik (Tilburg University)

In this paper, we investigate to what extent a large group of human workers is able to produce collaboratively a global ranking of images, based on a single semantic attribute. To this end, we developed CollaboRank, which is a method that formulates and distributes tasks to human workers, and aggregates their personal rankings into a global ranking. Our results show that a relatively high consensus can be achieved, depending on the type of the semantic attribute.

• Artificial Intelligence for Artificial Artificial Intelligence
Peng Dai, Mausam, Daniel Weld (University of Washington)

Crowdsourcing platforms such as Amazon Mechanical Turk have become popular for a wide variety of human intelligence tasks; however, quality control continues to be a significant challenge. Recently, Dai et al (2010) propose TurKontrol, a theoretical model based on POMDPs to optimize iterative, crowd-sourced workflows. However, they neither describe how to learn the model parameters, nor show its effectiveness in a real crowd-sourced setting. Learning is challenging due to the scale of the model and noisy data: there are hundreds of thousands of workers with high-variance abilities. This paper presents an end-to-end system that first learns TurKontrol's POMDP parameters from real Mechanical Turk data, and then applies the model to dynamically optimize live tasks. We validate the model and use it to control a successive-improvement process on Mechanical Turk. By modeling worker accuracy and voting patterns, our system produces significantly superior artifacts compared to those generated through static workflows using the same amount of money.

• One Step beyond Independent Agreement: A Tournament Selection Approach for Quality Assurance of Human Computation Tasks
Yu-An Sun, Shourya Roy (Xerox); Greg Little (MIT CSAIL)

Quality assurance remains a key topic in the human computation research field. Prior work indicates that independent agreement is effective for low difficulty tasks, but has limitations. This paper addresses this problem by proposing a tournament selection based quality control process. The experimental results from this paper show that humans are better at identifying the correct answers than generating them.

• Turkomatic: Automatic, Recursive Task and Workflow Design for Mechanical Turk
Anand Kulkarni, Matthew Can, Bjoern Hartmann (UC Berkeley)

On today’s human computation systems, designing tasks and workflows is a difficult and labor-intensive process. Can workers from the crowd be used to help plan workflows? We explore this question with Turkomatic, a new interface to microwork platforms that uses crowd workers to help plan workflows for complex tasks. Turkomatic uses a general-purpose divide-and-conquer algorithm to solve arbitrary natural-language requests posed by end users. The interface includes a novel real-time visual workflow editor that enables requesters to observe and edit workflows while the tasks are being completed. Crowd verification of work and the division of labor among members of the crowd can be handled automatically by Turkomatic, which substantially simplifies the process of using human computation systems. These features enable a novel means of interaction with crowds of online workers to support successful execution of complex work.

• MuSweeper: Collect Mutual Exclusions with Extensive Game
Tao-Hsuan Chang, Cheng-wei Chan, Jane Yung-jen Hsu (National Taiwan University)

Mutual exclusions are important information for machine learning. Games With A Purpose (or GWAP) provide an effective way to get large amount of data from web users. This research proposes MuSweeper, a minesweeper-like game, to collect mutual exclusions. By embedding game theory into game mechanism, the precision is guaranteed. Experiment showed MuSweeper can efficiently collect mutual exclusions with high precision.

• MobileWorks: A Mobile Crowdsourcing Platform for Workers at the Bottom of the Pyramid
Prayag Narula, Philipp Gutheim, David Rolnitzky, Anand Kulkarni, Bjoern Hartmann (UC Berkeley)

We present MobileWorks, a mobile phone-based crowdsourcing platform. MobileWorks targets workers in developing countries who live at the bottom of the economic pyramid. This population does not have access to desktop computers, so existing microtask labor markets are inaccessible to them. MobileWorks offers human OCR tasks that can be accomplished on low-end mobile phones; workers access it through their mobile web browser. To address the limited screen resolution available on low-end phones, MobileWorks segments documents into many small pieces, and sends each piece to a different worker. A first pilot study with 10 users over a period of 2 months revealed that it is feasible to do simple OCR tasks using simple Mobile Web based application. We found that on an average the workers do the tasks at 120 tasks per hour. Using single entry the accuracy of workers across the different documents is 89% . We propose a multiple entry solution which increases the theoretical accuracy of the OCR to more than 99%.

• Towards Task Recommendation in Micro-Task Markets
Vamsi Ambati, Stephan Vogel, Jaime Carbonell (CMU)

As researchers embrace micro-task markets for eliciting human input, the nature of the posted tasks moves from those requiring simple mechanical labor to requiring specific cognitive skills. On the other hand, increase is seen in the number of such tasks and the user population in micro-task market places requiring better search interfaces for productive user participation. In this paper we posit that understanding user skill sets and presenting them with suitable tasks not only maximizes the over quality of the output, but also attempts to maximize the benefit to the user in terms of more successfully completed tasks. We also implement a recommendation engine for suggesting tasks to users based on implicit modeling of skills and interests. We present results from a preliminary evaluation of our system using publicly available data gathered from a variety of human computation experiments recently conducted on Amazon's Mechanical Turk.

• On Quality Control and Machine Learning in Crowdsourcing
Matthew Lease (UT Austin)

The advent of crowdsourcing has created a variety of new opportunities for improving upon traditional methods of data collection and annotation. This in turn has created intriguing new opportunities for data-driven machine learning (ML). Convenient access to crowd workers for simple data collection has further generalized to leveraging more arbitrary crowd-based human computation to supplement ML. While new potential applications of crowdsourcing continue to emerge, a variety of practical and sometimes unexpected obstacles have already limited the degree to which its promised potential can be actually realized in practice. This paper considers two particular aspects of crowdsourcing and their interplay, data quality control (QC) and ML, reflecting on where we have been, where we are, and where we might go from here.

• CollabMap: Augmenting Maps using the Wisdom of Crowds
Ruben Stranders, Sarvapali Ramchurn, Bing Shi, Nicholas Jennings (University of Southampton)

In this paper we develop a novel model of geospatial data creation, called CollabMap, that relies on human computation. CollabMap is a crowdsourcing tool to get users contracted via Amazon Mechanical Turk or a similar service to perform micro-tasks that involve augmenting existing maps (e.g. GoogleMaps or Ordnance Survey) by drawing evacuation routes, using satellite imagery from GoogleMaps and panoramic views from Google Street View. We use human computation to complete tasks that are hard for a computer vision algorithm to perform or to generate training data that could be used by a computer vision algorithm to automatically define evacuation routes.

• Improving Consensus Accuracy via Z-score and Weighted Voting
Hyun Joon Jung, Matthew Lease (UT Austin)

We describe a Z-score based outlier detection method for detection and filtering of inaccurate crowd workers. After filtering, we aggregate labels from remaining workers via simple majority voting or feature-weighted voting. Both su-pervised and unsupervised features are used, individually and in combination, for both outlier detection and weighted voting. We evaluate on noisy judgments collected from Amazon Mechanical Turk which assess Websearch relevance of query/document pairs. We find that filtering in combination with multi-feature weighted voting achieves 8.94% relative error reduction for graded accuracy (4.25% absolute) and 5.32% for binary accuracy (3.45% absolute).

• Making Searchable Melodies: Human vs. Machine
Mark Cartwright, Zafar Rafii, Jinyu Han, Bryan Pardo (Northwestern University)

Systems that find music recordings based on hummed or sung, melodic input are called Query-By-Humming (QBH) systems. Such systems employ search keys that are more similar to a cappella singing than the original recordings. Successful deployed systems use human computation to create these search keys: hand-entered midi melodies or recordings of a cappella singing. Tunebot is one such system. In this paper, we compare search results using keys built from two automated melody extraction system to those gathered using two populations of humans: local paid sing-ers and Amazon Turk workers.

• PulaCloud: Using Human Computation to Enable Development at the Bottom of the Economic Ladder
Andrew Schriner (University of Cincinnati); Daniel Oerther (Missouri University of Science and Technology); James Uber (University of Cincinnati)

This research aims to explore how Human Computation can be used to aid economic development in communities experiencing extreme poverty throughout the world. Work is ongoing with a community in rural Kenya to connect them to employment opportunities through a Human Computation system. A feasibility study has been conducted in the community using the 3D protein folding game Foldit and Amazon’s Mechanical Turk. Feasibility has been confirmed and obstacles identified. Current work includes a pilot study doing image analysis for two research projects and developing a GUI that is usable by workers with little computer literacy. Future work includes developing effective incentive systems that operate both at the individual level and the group level and integrating worker accuracy evaluation, worker compensation, and result-credibility evaluation.

• Towards Large-Scale Processing of Simple Tasks with Mechanical Turk
Paul Wais, Shivaram Lingamneni, Duncan Cook, Jason Fennell, Benjamin Goldenberg, Daniel Lubarov, David Marin, Hari Simons (Yelp, inc.)

Crowdsourcing platforms such as Amazon's Mechanical Turk (AMT) provide inexpensive and scalable workforces for processing simple online tasks. Unfortunately, workers participating in crowdsourcing tend to supply work of inconsistent or low quality. We report on our experiences using AMT to verify hundreds of thousands of local business listings for the online directory Yelp.com. Using expert-verified changes, we evaluate the accuracy of our workforce and present the results of preliminary experiments that work towards filtering low-quality workers and correcting for worker bias. Our report seeks to inform the community of practical and financial constraints that are critical to understanding the problem of quality control in crowdsourcing systems.

• Learning to Rank From a Noisy Crowd
Abhimanu Kumar, Matthew Lease (UT Austin)

We consider how to most effectively use crowd-based relevance assessors to produce training data for learning to rank. This integrates two lines of prior work: studies of unreliable crowd-based binary annotation for binary classification, and studies for aggregating graded relevance judgments from reliable experts for ranking. To model varying performance of the crowd, we simulate annotation noise with varying magnitude and distributional properties. Evaluation on three LETOR test collections reveals a striking trend contrary to prior studies: single labeling outperforms consensus methods in maximizing learner rate (relative to annotator effort). We also see surprising consistency of learning rate across noise distributions, as well as greater challenge with the adversarial case for multi-class labeling.

## Monday, June 20, 2011

### Crowdsourcing and the discovery of a hidden treasure

A few months back, I started advising Tagasauris, a company that provides media annotation services, using crowdsourcing.

This month, Tagasauris is featured in a Wired article, titled "Hidden Treasure". It is a story of rediscovering a "lost" set of photos, from the shooting of the movie "American Graffiti". You can see the article by clicking the image:

Since there are some interesting aspects of the story, which go beyond the simple "tag using MTurk" story, I would like to give a few more details that I consider interesting.

Magnum Photos

One of the clients of Tagasauris is Magnum Photos, a cooperative owned by its own photographer members, designated to handle the commercial aspect of their own work. The list of members of Magnum Photos include photographers such as Robert Capa, Henri Cartier-Bresson, David SeymourGeorge Rodger, Steve McCurry, and many others. (See their Wikipedia entry for further details.) A few photos in the Magnum Photos archive that you may recognize:

One of my favorite parts of the Magnum website is the Archival Calendar, where they have a set of photos showcasing various historic events. Beats Facebook browsing by a wide margin. But let's get back to the story.

The problem

So, what is the problem of Magnum Photos? The same problem that almost every single big media company faces: a very large number of media objects without useful, descriptive metadata. No keywords, no description, nothing to aid the discovery process. Just the image file and mechanical data about film number etc. (Well, my own photo archive looks very similar...)

This lack of metadata is the case not only for the archive but also for the new, incoming photos that arrive every day from its members. (To put it mildly, photographers are not exactly eager to sit, tag, and describe the hundreds of photos they shoot every day.) This means that a large fraction of the Magnum Photos archive, which contains millions of photos, is virtually unsearchable. The photos are effectively lost in the digital world, even though they are digitized and available on the Internet.

An example of such case of "lost" photos is a set of photos from the shooting of the movie "American Graffitti". People at Magnum Photos knew that one of their photographers, Dennis Stock who died in 2009, was on set during the production of the movie, and he had taken photos of the, then young and unknown, members of the team. Magnum Photos had no idea where these photos were. They knew they digitized the archive of Dennis Stock, they knew that the photos are in the archive, but nobody could locate the photos within the millions of other, untagged photos.

For those unfamiliar with the movie, American Graffiti is a 1973 film, by George Lucas (pre-Star Wars), with starring actors the, then unknowns, Richard DreyfussRon HowardPaul Le MatCharles Martin Smith,Cindy WilliamsCandy ClarkMackenzie Phillips and Harrison Ford. The latter shot to stardom of all the actors makes the movie almost a cult.

The Magnum Photos archive is a trove of similar "hidden treasures". Sitting there, waiting for some accidental, serendipitous discovery.

The tagging solution and the machine support

Magnum Photos had its own set of annotators. However, the annotators could not even catch up even with the volume of incoming photos. The task of going back and annotating the archive was an even more daunting task. This meant lost revenue for Magnum Photos, as if you cannot find a photo, you cannot license it, and you cannot sell it.

Tagasauris proposed to solve the problem using crowdsourcing. With hundreds of workers working in parallel, it became possible to tame the influx of untagged incoming photos, and start going backwards and tagging the archive.

Of course, vanilla photo tagging is not a solution. Workers type misspelled words (named entities are systematic offenders), try to get away with generic tags, etc. Following the lessons learned from ESP Game, and all the subsequent studies, Tagasauris built solutions for cleaning the tags, rewarding specificity, and, in general, clean up and ensure high-quality for the noisy tagging process.

A key component was the ability to match the tags entered by the workers with named entities, which themselves were then connected to Freebase entities.

The result? When workers were tagging the photos from Magnum Photos, they identified the actors in the shots, and the machine process in the background assigned "semantic tags" to the photos, such as [George Lucas], [Richard Dreyfuss], [Ron Howard], [Mackenzie Phillips], [Harrison Ford] and others.

Yes, humans + machines generate things that are better than the sum of the parts.

The machine support, cont.

So, how the workers discovered the photos from American Graffiti? As you may imagine, the workers had no idea that the photos that they were tagging were from the shooting of the film. They could identify the actors, but that was it.

Going from actor tagging to understanding the context of the photo shooting, is a task that cannot be required by layman, non-expert taggers. You need experts that can "connect the dots". Unfortunately, subject experts are expensive. And they tend not to be interested in tedious tasks, such as assigning tags to photos.

However, this "connecting the dots" is a task where machines are better than humans. We have recently seen how Watson, by having access to semantically connected ontologies (often generated by humans), could identify the correct answers to a wide variety of questions.

Tagasauris employed a similar strategy. Knowing the entities that appear in a set of photos, it is then possible to identify additional metadata. For example, look at the five actors that were identified in the photos (red boxes, with white background), and the associated semantic graph that links the different entities together:

Bingo! The entity that connects together the different entities is the entity "American Graffiti", which was not used by any worker.

At this point, you can understand how the story evolved. A graph activation/spreading algorithm suggests the tag, experts can verify it, and the rest is history.

Meagan Young looked at the stream of incoming photos, noticed the American Graffiti tag, realized that the "lost" photos were found, and she notified the others at Magnum Photos and Todd Carter, the CEO of Tagasauris. The "hidden treasure" was identified, and the Wired story was underway...

Crowdsourcing: It is not just about the humans

This is not a story to show how cool discovery based on linked entities is. This is old news for many people that work with such data. However, this is a simple example of using crowdsourcing in a more intelligent way that it is currently being used. Machines cannot do everything (in fact, they are especially bad in tasks that are "trivial" for humans) but when humans provide enough input, the machines can take it from there, and improve significantly the overall process.

Someone can even see the obvious next step: Use face recognition and allow tagging to be done collaboratively with humans and machines. Google and Facebook have very advanced algorithms for face recognition. Match them intelligently with humans, and you are way ahead of solutions that rely simply on humans to tag faces.

I think the lesson is clear: Let humans do what they do best, and let machines do what they do best. (And expect the balance to change as we move forward and machines can do more.) Undoing and ignoring decades of research in computer science, just because it is easier to use cheap labor, is a disservice not only to computer science. It is a disservice to the potential of crowdsourcing as well.