## Friday, June 24, 2011

### Accepted papers for the 3rd Human Computation Workshop (HCOMP 2011)

We have posted online the schedule for the 3rd Human Computation Workshop (HCOMP 2011), which will be organized as part of AAAI 2011, in San Francisco, on August 8th. The registration fee for participating in the workshop is a pretty modest \$125 for graduate students, and \$155 for other participants. Just make sure to register before July 1st to get these rates, as afterwards the rates jump to \$165 and \$185. I should also mention that, following the tradition established in Paris in HCOMP 2009, we will have a group dinner for all the participants after the workshop to continue the discussions from the day...

We have a strong program, with 16 long papers accepted, and 16 papers being presented as demos and posters. Below you can find the titles of the papers and their abstracts. The PDF versions of the papers will be posted online by AAAI, after the completion of the conference are available through the AAAI Digital Library. Until then, you can search Google, or just ask the authors for a pre-print. So, if you are interested in crowdsourcing and human computation, we hope to see you there in San Francisco in August!

Long Papers

• Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds
Sudheendra Vijayanarasimhan, Kristen Grauman (UT Austin)

Active learning and crowdsourcing are promising ways to efficiently build up training sets for object recognition, but thus far techniques are tested in artificially controlled settings. Typically the vision researcher has already determined the dataset's scope, the labels actively" obtained are in fact already known, and/or the crowd-sourced collection process is iteratively fine-tuned. We present an approach for *live learning* of object detectors, in which the system autonomously refines its models by actively requesting crowd-sourced annotations on images crawled from the Web. To address the technical issues such a large-scale system entails, we introduce a novel part-based detector amenable to linear classifiers, and show how to identify its most uncertain instances in sub-linear time with a hashing-based solution. We demonstrate the approach with experiments of unprecedented scale and autonomy, and show it successfully improves the state-of-the-art for the most challenging objects in the PASCAL benchmark. In addition, we show our detector competes well with popular nonlinear classifiers that are much more expensive to train.

• Robust Active Learning using Crowdsourced Annotations for Activity Recognition
Liyue Zhao, Gita Sukthankar (UCF); Rahul Sukthankar (Google Research/CMU)

Recognizing human activities from wearable sensor data is an important problem, particularly for health and eldercare applications. However, collecting sufficient labeled training data is challenging, especially since interpreting IMU traces is difficult for human annotators. Recently, crowdsourcing through services such as Amazon's Mechanical Turk has emerged as a promising alternative for annotating such data, with active learning serving as a natural method for affordably selecting an appropriate subset of instances to label. Unfortunately, since most active learning strategies are greedy methods that select the most uncertain sample, they are very sensitive to annotation errors (which corrupt a significant fraction of crowdsourced labels). This paper proposes methods for robust active learning under these conditions. Specifically, we make three contributions: 1) we obtain better initial labels by asking labelers to solve a related task; 2) we propose a new principled method for selecting instances in active learning that is more robust to annotation noise; 3) we estimate confidence scores for labels acquired from MTurk and ask workers to relabel samples that receive low scores under this metric. The proposed method is shown to significantly outperform existing techniques both under controlled noise conditions and in real active learning scenarios. The resulting method trains classifiers that are close in accuracy to those trained using ground-truth data.

• Beat the Machine: Challenging workers to find the unknown unknowns
Josh Attenberg, Panos Ipeirotis, Foster Provost (NYU)

This paper presents techniques for gathering data that expose errors of automatic classification models. Prior work has demonstrated the promise of having humans seek training data, as an alternative to active learning, in cases where there is extreme class imbalance. We now explore the direction where we ask humans to identify cases what will cause the classification system to fail. Such techniques are valuable in revealing problematic cases that do not reveal themselves during the normal operation of the system, and may include cases that are rare but catastrophic. We describe our approach for building a system to satisfy this requirements, trying to encourage humans to provide us with such data points. In particular, we reward a human when the provided example is difficult for the model to handle, and the reward is proportional to the magnitude of the error. In a sense, the humans are asked to ''Beat the Machine'' and find cases where the automatic model (''the machine'') is wrong. Our experimental data show that the density of the identified problems is an order of magnitude higher compared to alternative approaches, and that the proposed technique can identify quickly the big flaws'' that would typically remain uncovered.

• Human Intelligence Needs Artificial Intelligence
Daniel Weld, Mausam Mausam, Peng Dai (University of Washington)

Crowdsourcing platforms, such as Amazon Mechanical Turk, have enabled the construction of scalable applications for tasks ranging from product categorization and photo tagging to audio transcription and translation. These vertical applications are typically realized with complex, self-managing workflows that guarantee quality results. But constructing such workflows is challenging, with a huge number of alternative decisions for the designer to consider. Artificial intelligence methods can greatly simplify the process of creating complex crowdsourced workflows. We argue this thesis by presenting the design of TurKontrol 2.0, which uses machine learning to continually refine models of worker performance and task difficulty. Using these models, TurKontrol 2.0 uses decision-theoretic optimization to 1) choose between alternative workflows, 2) optimize parameters for a workflow, 3) create personalized interfaces for individual workers, and 4) dynamically control the workflow. Preliminary experience suggests that these optimized workflows are significantly more economical than those generated by humans.

• Worker Motivation in Crowdsourcing and Human Computation
Nicolas Kaufmann; Thimo Schulze (University of Mannheim)

Many human computation systems use crowdsourcing markets like Amazon Mechanical Turk to recruit human workers. The payment in these markets is usually very low, and still collected demographic data shows that the participants are a very diverse group including highly skilled full time workers. Many existing studies on their motivation are rudimental and not grounded on established motivation theory. Therefore, we adapt different models from classic motivation theory, work motivation theory and Open Source Software Development to crowdsourcing markets. The model is tested with a survey of 431 workers on Mechanical Turk. We find that the extrinsic motivational categories (immediate payoffs, delayed payoffs, social motivation) have a strong effect on the time spent on the platform. For many workers, however, intrinsic motivation aspects are more important, especially the different facets of enjoyment based motivation like “task autonomy” and “skill variety”. Our contribution is a preliminary model based on established theory intended for the comparison of different crowdsourcing platforms.

• Honesty in an Online Labor Market
Winter Mason, Siddharth Suri, Daniel Goldstein (Yahoo! Research)

The efficient functioning of markets and institutions assume a certain degree of honesty from participants. In labor markets, for instance, employers benefit from employees who will render meaningful work, and employees benefit from employers who will pay the promised amount for services rendered. We use an established method for detecting dishonest behavior in a series of experiments conducted on \amt, a popular online labor market. Our first experiment estimates a baseline amount of dishonesty for this task in the population sample. The second experiment tests the hypothesis that the level of dishonesty in the population will be sensitive to the relative amount that can be gained by dishonest reporting, and the third experiment, manipulates the degree to which dishonest reporting can be detected at the individual level. We conclude with a demographic and cross-cultural analysis of the predictors of dishonest reporting in this market.

• Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection
David Chen (UT Austin); William Dolan (Microsoft Research)

Traditional methods of collecting translation and paraphrase data are prohibitively expensive, making constructions of large, new corpora difficult. While crowdsourcing offers a cheap alternative, quality control and scalability can become problematic. We discuss a novel annotation task that uses videos as the stimulus which discourages cheating. It also only requires monolingual speakers, thus making it easier to scale since more workers are qualified to contribute. Finally, we employed a multi-tiered payment system that helps retain good workers over the long-term, resulting in a persistent, high-quality workforce. We present the results of one of the largest linguistic data collection efforts using Mechanical Turk, yielding 85K English sentences and more than 1k sentences for each of a dozen more languages.

• CrowdSight: Rapidly Prototyping Intelligent Visual Processing Apps
Mario Rodriguez (UCSC); James Davis

We describe a framework for rapidly prototyping applications which require intelligent visual processing, but for which there does not yet exist reliable algorithms, or for which engineering those algorithms is too costly. The framework, CrowdSight, leverages the power of crowdsourcing to offload intelligent processing to humans, and enables new applications to be built quickly and cheaply, affording system builders the opportunity to validate a concept before committing significant time or capital. Our service accepts requests from users either via email or simple mobile applications, and handles all the communication with a backend human computation platform. We build redundant requests and data aggregation into the system freeing the user from managing these requirements. We validate our framework by building several test applications and verifying that prototypes can be built more easily and quickly than would be the case without the framework.

• Digitalkoot: Making Old Archives Accessible Using Crowdsourcing

In this paper, we present Digitalkoot, a system for fixing errors in the Optical Character Recognition (OCR) process of old texts through the use of human computation. By turning the work into simple games, we are able to attract a great number of volunteers to donate their time and cognitive capacity for the cause. Our analysis shows how untrained people can reach very high accuracy through the use of crowdsourcing. Furthermore we analyze the effect of social media and gender on participation levels and the amount of work accomplished.

• Error Detection and Correction in Human Computation: Lessons from the WPA
David Alan Grier (GWU)

Human Computation is, of course, a very old field with a forgotten literature that treats many of the key problems, especially error detection and correction. The obvious methods of error detection, duplicate calculation, have proven to be subject to Babbage's Rule: Different workers using the same methods on the same data will tend to make the same errors. To avoid the consequences of this rule, early human computers developed a disciplined regimen to identify and correct mistakes. This paper reconstructs those methods, puts them in a modern context and identifies their implications for the modern version of human computation.

• Programmatic gold: targeted and scalable quality assurance in crowdsourcing
Dave Oleson, Vaughn Hester, Alex Sorokin, Greg Laughlin, John Le, Lukas Biewald (CrowdFlower)

Crowdsourcing is an effective tool for scalable data annotation in both research and enterprise contexts. Due to crowdsourcing's open participation model, quality assurance is critical to the success of any project. Present methods rely on EM-style post-processing or manual annotation of large gold standard sets. In this paper we present an automated quality assurance process that is inexpensive and scalable. Our novel process relies on programmatic gold creation to provide targeted training feedback to workers and to prevent common scamming scenarios. We find that it decreases the amount of manual work required to manage crowdsourced labor while improving the overall quality of the results.

• An Iterative Dual Pathway Structure for Speech-to-Text Transcription
Beatrice Liem, Haoqi Zhang, Yiling Chen (Harvard University)

In this paper, we develop a new human computation algorithm for speech-to-text transcription that can potentially achieve the high accuracy of professional transcription using only microtasks deployed via an online task market or a game. The algorithm partitions audio clips into short 10-second segments for independent processing and joins adjacent outputs to produce the full transcription. Each segment is sent through an iterative dual pathway structure that allows participants in either path to iteratively refine the transcriptions of others in their path while being rewarded based on transcriptions in the other path, eliminating the need to check transcripts in a separate process. Initial experiments with local subjects show that produced transcripts are on average 96.6% accurate.

• An Extendable Toolkit for Managing Quality of Human-based Electronic Services
David Bermbach, Robert Kern, Pascal Wichmann, Sandra Rath, Christian Zirpins (KIT)

Micro-task markets like Amazon MTurk enable online workers to provide human intelligence as Web-based on demand services (so called people services). Businesses facing large amounts of knowledge work can benefit from increased flexibility and scalability of their workforce but need to cope with reduced control of result quality. While this problem is well recognized, it is so far only rudimentarily addressed by existing platforms and tools. In this paper, we present a flexible research toolkit which enables experiments with advanced quality management mechanisms for generic micro-task markets. The toolkit enables control of correctness and performance of task fulfillment by means of dynamic sampling, weighted majority voting and worker pooling. We demonstrate its application and performance for an OCR scenario building on Amazon MTurk. The toolkit however enables the development of advanced quality management mechanisms for a large variety of people service scenarios and platforms.

• What’s the Right Price? Pricing Tasks for Finishing on Time
Siamak Faridani, Bjoern Hartmann (UC Berkeley); Panos Ipeirotis (NYU)

Many practitioners currently use rules of thumb to price tasks on online labor markets. Incorrect pricing leads to task starvation or inefficient use of capital. Formal optimal pricing policies can address these challenges. In this paper we argue that an optimal pricing policy must be based on the tradeoff between price and desired completion time. We show how this duality can lead to a better pricing policy for tasks in online labor markets. This paper makes three contributions. First, we devise an algorithm for optimal job pricing using a survival analysis model. We then show that worker arrivals can be modeled as a non-homogenous Poisson Process (NHPP). Finally using NHPP for worker arrivals and discrete choice models we present an abstract mathematical model that captures the dynamics of the market when full market information is presented to the task requester. This model can be used to predict completion times and optimal pricing policies for both public and private crowds.

• Pricing Mechanisms for Online Labor Market
Yaron Singer, Manas Mittal (UC Berkeley EECS)

In online labor markets, determining the appropriate incentives is a difficult problem. In this paper, we present dynamic pricing mechanisms for determining the optimal prices for such tasks. In particular, the mechanisms are designed to handle the intricacies of the markets like mechanical turk (workers are coming online, requesters have budgets, etc.). The mechanisms have desirable theoretical guarantees (incentive compatibility, budget feasibility, and competitive ration performance) and perform well in practice. Experiments demonstrate the effectiveness and feasibility of using such mechanisms in practice.

• Labor Allocation in Paid Crowdsourcing: Experimental Evidence on Positioning, Nudges and Prices
John Horton (ODesk); Dana Chandler (MIT)

This paper reports the results of a natural field experiment where workers from a paid crowdsourcing environment self-select into tasks and are presumed to have limited attention. In our experiment, workers labeled any of six pictures from a 2 x 3 grid of thumbnail images. In the absence of any incentives, workers exhibit a strong default bias and tend to select images from the top-left (focal'') position; the bottom-right (non-focal'') position, was the least preferred. We attempted to overcome this bias and increase the rate at which workers selected the least preferred task, by using a combination of monetary and non-monetary incentives. We also varied the saliency of these incentives by placing them in either the focal or non-focal position. Although both incentive types caused workers to re-allocate their labor, monetary incentives were more effective. Most interestingly, both incentive types worked better when they were placed in the focal position and made more salient. In fact, salient non-monetary incentives worked about as well as non-salient monetary ones. Our evidence suggests that user interface and cognitive biases play an important role in online labor markets and that salience can be used by employers as a kind of incentive multiplier''.

Posters

• Developing Scripts to Teach Social Skills: Can the Crowd Assist the Author?
Fatima Boujarwah, Jennifer Kim, Gregory Abowd, Rosa Arriaga (Georgia Tech)

The social world that most of us navigate effortlessly can prove to be a perplexing and disconcerting place for individuals with autism. Currently there are no models to assist non-expert authors as they create customized social script-based instructional modules for a particular child. We describe an approach to using human computation to develop complex models of social scripts for a plethora of complex and interesting social scenarios, possible obstacles that may arise in those scenarios, and potential solutions to those obstacles. Human input is the natural way to build these models, and in so doing create valuable assistance for those trying to navigate the intricacies of a social life.

• CrowdLang - First Steps Towards Programmable Human Computers for General Computation
Patrick Minder, Abraham Bernstein (University of Zurich)

Crowdsourcing markets such as Amazon’s Mechanical Turk provide an enormous potential for accomplishing work by combining human and machine computation. Today crowdsourcing is mostly used for massive parallel information processing for a variety of tasks such as image labeling. However, as we move to more sophisticated problem-solving there is little knowledge about managing dependencies between steps and a lack of tools for doing so. As the contribution of this paper, we present a concept of an executable, model-based programming language and a general purpose framework for accomplishing more sophisticated problems. Our approach is inspired by coordination theory and an analysis of emergent collective intelligence. We illustrate the applicability of our proposed language by combining machine and human computation based on existing interaction patterns for several general computation problems.

• Ranking Images on Semantic Attributes using CollaboRank
Jeroen Janssens, Eric Postma, Jaap Van den Herik (Tilburg University)

In this paper, we investigate to what extent a large group of human workers is able to produce collaboratively a global ranking of images, based on a single semantic attribute. To this end, we developed CollaboRank, which is a method that formulates and distributes tasks to human workers, and aggregates their personal rankings into a global ranking. Our results show that a relatively high consensus can be achieved, depending on the type of the semantic attribute.

• Artificial Intelligence for Artificial Artificial Intelligence
Peng Dai, Mausam, Daniel Weld (University of Washington)

Crowdsourcing platforms such as Amazon Mechanical Turk have become popular for a wide variety of human intelligence tasks; however, quality control continues to be a significant challenge. Recently, Dai et al (2010) propose TurKontrol, a theoretical model based on POMDPs to optimize iterative, crowd-sourced workflows. However, they neither describe how to learn the model parameters, nor show its effectiveness in a real crowd-sourced setting. Learning is challenging due to the scale of the model and noisy data: there are hundreds of thousands of workers with high-variance abilities. This paper presents an end-to-end system that first learns TurKontrol's POMDP parameters from real Mechanical Turk data, and then applies the model to dynamically optimize live tasks. We validate the model and use it to control a successive-improvement process on Mechanical Turk. By modeling worker accuracy and voting patterns, our system produces significantly superior artifacts compared to those generated through static workflows using the same amount of money.

• One Step beyond Independent Agreement: A Tournament Selection Approach for Quality Assurance of Human Computation Tasks
Yu-An Sun, Shourya Roy (Xerox); Greg Little (MIT CSAIL)

Quality assurance remains a key topic in the human computation research field. Prior work indicates that independent agreement is effective for low difficulty tasks, but has limitations. This paper addresses this problem by proposing a tournament selection based quality control process. The experimental results from this paper show that humans are better at identifying the correct answers than generating them.

• Turkomatic: Automatic, Recursive Task and Workflow Design for Mechanical Turk
Anand Kulkarni, Matthew Can, Bjoern Hartmann (UC Berkeley)

On today’s human computation systems, designing tasks and workflows is a difficult and labor-intensive process. Can workers from the crowd be used to help plan workflows? We explore this question with Turkomatic, a new interface to microwork platforms that uses crowd workers to help plan workflows for complex tasks. Turkomatic uses a general-purpose divide-and-conquer algorithm to solve arbitrary natural-language requests posed by end users. The interface includes a novel real-time visual workflow editor that enables requesters to observe and edit workflows while the tasks are being completed. Crowd verification of work and the division of labor among members of the crowd can be handled automatically by Turkomatic, which substantially simplifies the process of using human computation systems. These features enable a novel means of interaction with crowds of online workers to support successful execution of complex work.

• MuSweeper: Collect Mutual Exclusions with Extensive Game
Tao-Hsuan Chang, Cheng-wei Chan, Jane Yung-jen Hsu (National Taiwan University)

Mutual exclusions are important information for machine learning. Games With A Purpose (or GWAP) provide an effective way to get large amount of data from web users. This research proposes MuSweeper, a minesweeper-like game, to collect mutual exclusions. By embedding game theory into game mechanism, the precision is guaranteed. Experiment showed MuSweeper can efficiently collect mutual exclusions with high precision.

• MobileWorks: A Mobile Crowdsourcing Platform for Workers at the Bottom of the Pyramid
Prayag Narula, Philipp Gutheim, David Rolnitzky, Anand Kulkarni, Bjoern Hartmann (UC Berkeley)

We present MobileWorks, a mobile phone-based crowdsourcing platform. MobileWorks targets workers in developing countries who live at the bottom of the economic pyramid. This population does not have access to desktop computers, so existing microtask labor markets are inaccessible to them. MobileWorks offers human OCR tasks that can be accomplished on low-end mobile phones; workers access it through their mobile web browser. To address the limited screen resolution available on low-end phones, MobileWorks segments documents into many small pieces, and sends each piece to a different worker. A first pilot study with 10 users over a period of 2 months revealed that it is feasible to do simple OCR tasks using simple Mobile Web based application. We found that on an average the workers do the tasks at 120 tasks per hour. Using single entry the accuracy of workers across the different documents is 89% . We propose a multiple entry solution which increases the theoretical accuracy of the OCR to more than 99%.

Vamsi Ambati, Stephan Vogel, Jaime Carbonell (CMU)

As researchers embrace micro-task markets for eliciting human input, the nature of the posted tasks moves from those requiring simple mechanical labor to requiring specific cognitive skills. On the other hand, increase is seen in the number of such tasks and the user population in micro-task market places requiring better search interfaces for productive user participation. In this paper we posit that understanding user skill sets and presenting them with suitable tasks not only maximizes the over quality of the output, but also attempts to maximize the benefit to the user in terms of more successfully completed tasks. We also implement a recommendation engine for suggesting tasks to users based on implicit modeling of skills and interests. We present results from a preliminary evaluation of our system using publicly available data gathered from a variety of human computation experiments recently conducted on Amazon's Mechanical Turk.

• On Quality Control and Machine Learning in Crowdsourcing
Matthew Lease (UT Austin)

The advent of crowdsourcing has created a variety of new opportunities for improving upon traditional methods of data collection and annotation. This in turn has created intriguing new opportunities for data-driven machine learning (ML). Convenient access to crowd workers for simple data collection has further generalized to leveraging more arbitrary crowd-based human computation to supplement ML. While new potential applications of crowdsourcing continue to emerge, a variety of practical and sometimes unexpected obstacles have already limited the degree to which its promised potential can be actually realized in practice. This paper considers two particular aspects of crowdsourcing and their interplay, data quality control (QC) and ML, reflecting on where we have been, where we are, and where we might go from here.

• CollabMap: Augmenting Maps using the Wisdom of Crowds
Ruben Stranders, Sarvapali Ramchurn, Bing Shi, Nicholas Jennings (University of Southampton)

In this paper we develop a novel model of geospatial data creation, called CollabMap, that relies on human computation. CollabMap is a crowdsourcing tool to get users contracted via Amazon Mechanical Turk or a similar service to perform micro-tasks that involve augmenting existing maps (e.g. GoogleMaps or Ordnance Survey) by drawing evacuation routes, using satellite imagery from GoogleMaps and panoramic views from Google Street View. We use human computation to complete tasks that are hard for a computer vision algorithm to perform or to generate training data that could be used by a computer vision algorithm to automatically define evacuation routes.

• Improving Consensus Accuracy via Z-score and Weighted Voting
Hyun Joon Jung, Matthew Lease (UT Austin)

We describe a Z-score based outlier detection method for detection and filtering of inaccurate crowd workers. After filtering, we aggregate labels from remaining workers via simple majority voting or feature-weighted voting. Both su-pervised and unsupervised features are used, individually and in combination, for both outlier detection and weighted voting. We evaluate on noisy judgments collected from Amazon Mechanical Turk which assess Websearch relevance of query/document pairs. We find that filtering in combination with multi-feature weighted voting achieves 8.94% relative error reduction for graded accuracy (4.25% absolute) and 5.32% for binary accuracy (3.45% absolute).

• Making Searchable Melodies: Human vs. Machine
Mark Cartwright, Zafar Rafii, Jinyu Han, Bryan Pardo (Northwestern University)

Systems that find music recordings based on hummed or sung, melodic input are called Query-By-Humming (QBH) systems. Such systems employ search keys that are more similar to a cappella singing than the original recordings. Successful deployed systems use human computation to create these search keys: hand-entered midi melodies or recordings of a cappella singing. Tunebot is one such system. In this paper, we compare search results using keys built from two automated melody extraction system to those gathered using two populations of humans: local paid sing-ers and Amazon Turk workers.

• PulaCloud: Using Human Computation to Enable Development at the Bottom of the Economic Ladder
Andrew Schriner (University of Cincinnati); Daniel Oerther (Missouri University of Science and Technology); James Uber (University of Cincinnati)

This research aims to explore how Human Computation can be used to aid economic development in communities experiencing extreme poverty throughout the world. Work is ongoing with a community in rural Kenya to connect them to employment opportunities through a Human Computation system. A feasibility study has been conducted in the community using the 3D protein folding game Foldit and Amazon’s Mechanical Turk. Feasibility has been confirmed and obstacles identified. Current work includes a pilot study doing image analysis for two research projects and developing a GUI that is usable by workers with little computer literacy. Future work includes developing effective incentive systems that operate both at the individual level and the group level and integrating worker accuracy evaluation, worker compensation, and result-credibility evaluation.

• Towards Large-Scale Processing of Simple Tasks with Mechanical Turk
Paul Wais, Shivaram Lingamneni, Duncan Cook, Jason Fennell, Benjamin Goldenberg, Daniel Lubarov, David Marin, Hari Simons (Yelp, inc.)

Crowdsourcing platforms such as Amazon's Mechanical Turk (AMT) provide inexpensive and scalable workforces for processing simple online tasks. Unfortunately, workers participating in crowdsourcing tend to supply work of inconsistent or low quality. We report on our experiences using AMT to verify hundreds of thousands of local business listings for the online directory Yelp.com. Using expert-verified changes, we evaluate the accuracy of our workforce and present the results of preliminary experiments that work towards filtering low-quality workers and correcting for worker bias. Our report seeks to inform the community of practical and financial constraints that are critical to understanding the problem of quality control in crowdsourcing systems.

• Learning to Rank From a Noisy Crowd
Abhimanu Kumar, Matthew Lease (UT Austin)

We consider how to most effectively use crowd-based relevance assessors to produce training data for learning to rank. This integrates two lines of prior work: studies of unreliable crowd-based binary annotation for binary classification, and studies for aggregating graded relevance judgments from reliable experts for ranking. To model varying performance of the crowd, we simulate annotation noise with varying magnitude and distributional properties. Evaluation on three LETOR test collections reveals a striking trend contrary to prior studies: single labeling outperforms consensus methods in maximizing learner rate (relative to annotator effort). We also see surprising consistency of learning rate across noise distributions, as well as greater challenge with the adversarial case for multi-class labeling.