Unfortunately, due to the half-day duration of the workshop, we had space for only 8 papers (4 long and 4 short). We really had to make very hard decisions to fit as many papers as possible in the program. We reduced the length of the talks, "downgraded" some papers to short, some others to posters, and at the end of the day we even had to reject papers that had no reject ratings! Quite a few of the rejected papers were interesting and I would love the opportunity to talk to the authors about their work. Let's hope that next time we will manage to get a full day for the workshop in order to accommodate the demand. Having acceptance rates of ~25% for a workshop is simply too low.
I expect to see a very exciting program! If you can be in DC on July 25th, you should make the effort to come to the workshop!
Special thanks go to Chandra for running the show this year!
Below, I list the accepted papers, with links to the unofficial versions that the authors posted on their websites. Once we have the camera-ready versions available, I will put the proper links.
Long Papers
- Toward Automatic Task Design: A Progress Report
Eric Huang, Haoqi Zhang, David Parkes, Krzysztof Gajos, Yiling Chen
Abstract: A central challenge in human computation is in understanding how to design task environments that effectively attract participants and coordinate the problem solving process. In this paper, we consider a common problem that requesters face on Amazon Mechanical Turk: how should a task be designed so as to induce good output from workers? In posting a task, a requester decides how to break down the task into unit tasks, how much to pay for each unit task, and how many workers to assign to a unit task. These design decisions affect the rate at which workers complete unit tasks, as well as the quality of the work that results. Using image labeling as an example task, we consider the problem of designing the task to maximize the number of quality tags received within given time and budget constraints. We consider two different measures of work quality, and construct models for predicting the rate and quality of work based on observations of output to various designs. Preliminary results show that simple models can accurately predict the quality of output per unit task, but are less accurate in predicting the rate at which unit tasks complete. At a fixed rate of pay, our models generate different designs depending on the quality metric, and optimized designs obtain significantly more quality tags than baseline comparisons.
- Exploring Iterative and Parallel Human Computation Processes
Greg Little, Lydia Chilton, Max Goldman, Robert Miller
Abstract: Services like Amazon's Mechanical Turk have opened the door for exploration of processes that outsource computation to humans. These human computation processes hold tremendous potential to solve a variety of problems in novel and interesting ways. However, we are only just beginning to understand how to design such processes. This paper explores two basic approaches: one where workers work alone in parallel and one where workers iteratively build on each other's work. We present a series of experiments exploring tradeoffs between each approach in several problem domains: writing, brainstorming, and transcription. In each of our experiments, iteration increases the average quality of responses. The increase is statistically significant in writing and brainstorming. However, in brainstorming and transcription, it is not clear that iteration is the best overall approach, in part because both of these tasks benefit from a high variability of responses, which is more prevalent in the parallel process. Also, poor guesses in the transcription task can lead subsequent workers astray.
- Task Search in a Human Computation Market
Lydia Chilton, John Horton, Robert Miller, Shiri Azenkot
Abstract: "In order to understand how a labor market for human computation functions, it is important to know how workers search for tasks. This paper uses two complementary methods to gain insight into how workers search for tasks on Mechanical Turk. First, we perform a high frequency scrape of 36 pages of search results and analyze it by looking at the rate of disappearance of tasks across key ways Mechanical Turk allows workers to sort tasks. Second, we present the results of a survey in which we paid workers for self-reported information about how they search for tasks. Our main findings are that on a large scale, workers sort by which tasks are most recently posted and which have the largest number of tasks available. Furthermore, we find that workers look mostly at the first page of the most recently posted tasks and the first two pages of the tasks with the most available instances but in both categories the position on the result page is unimportant to workers. We observe that at least some employers try to manipulate the position of their task in the search results to exploit the tendency to search for recently posted tasks. On an individual level, we observed workers searching by almost all the possible categories and looking more than 10 pages deep. For a task we posted to Mechanical Turk, we confirmed that a favorable position in the search results do matter: our task with favorable positioning was completed 30 times faster and for less money than when its position was unfavorable. "
- The Anatomy of a Large-Scale Human Computation Engine
Shailesh Kochhar, Stefano Mazzocchi, Praveen Paritosh
Abstrat: In this paper we describe RABJ, an engine designed to simplify collecting human input. We have used RABJ to collect over 2.3 million human judgments to augment data mining, data entry, data validation and curation problems at Freebase over the course of a year. We illustrate several successful applications that have used RABJ to collect human judgment. We describe how the architecture and design decisions of RABJ are affected by the constraints of content agnosticity, data freshness, latency and visibility. We present work aimed at increasing the yield and reliability of human computation efforts. Finally, we discuss empirical observations and lessons learned in the course of a year of operating the service.
- Sellers' problems in human computation markets
M. Six Silberman, Joel Ross, Lilly Irani, Bill Tomlinson
Abstract: "Tools for human computers" is an underexplored design space in human computation research, which has focused on techniques for buyers of human computation rather than sellers. We characterize the sellers in one human computation market, Mechanical Turk, and describe some of the challenges they face. We list several projects developed to approach these problems, and conclude with a list of open questions relevant to sellers, buyers, and researchers.
- Sentence Recall Game: A Novel Tool for Collecting Data to Discover Language Usage Patterns
Jun Wang, Bei Yu
Abstract: Recently we ran a simple memory test experiment, called sentence recall, in which participants were asked to recall sentences that they had just seen on the screen. Many participants, especially non-native English speakers, made various deviations in their recalled sentences. Some deviations represent alternative ways to express the same meaning, but others suggest that there are missing pieces in the participants' language knowledge. The deviation data, on the one hand, can provide individual users valuable feedback on their language usage patterns that they may never notice, on the other hand, can be used as training data for automatically discovering language usage patterns in a subpopulation of language learners. This paper presents our attempts to create an enjoyable sentence recall game for collecting a large amount of deviation data. Our results show that the game is fun to play and the collected deviation data can reveal common language usage patterns among non-native speakers.
- Quality Management on Amazon Mechanical Turk
Panagiotis Ipeirotis, Jing Wang, Foster Provost
Abstract: Crowdsourcing services, such as Amazon Mechanical Turk, allow for easy distribution of small tasks to a large number of workers. Unfortunately, since manually verifying the quality of the submitted results is hard, malicious workers often take advantage of the verification difficulty and submit answers of low quality. Currently, most requesters rely on redundancy to identify the correct answers. However, redundancy is not a panacea. Massive redundancy is expensive, increasing significantly the cost of crowdsourced solutions. Therefore, we need techniques that will accurately estimate the quality of the workers, allowing for the rejection and blocking of the low-performing workers and spammers. However, existing techniques cannot separate the true (unrecoverable) error rate from the (recoverable) biases that some workers exhibit. This lack of separation leads to incorrect assessments of a worker's quality. We present algorithms that improve the existing state-of-the-art techniques, enabling the separation of bias and error. Our algorithm generates a scalar score representing the inherent quality of each worker. We illustrate how to incorporate cost-sensitive classification errors in the overall framework and how to seamlessly integrate unsupervised and supervised techniques for inferring the quality of the workers. We present experimental results demonstrating the performance of the proposed algorithm under a variety of settings.
- Human Computation for Word Sense Disambiguation
Nitin Seemakurty, Jonathan Chu, Luis von Ahn, Anthony Tomasic
Abstract: One formidable problem in language technology is the word sense disambiguation (WSD) problem: disambiguating the true sense of a word as it occurs in a sentence (e.g., recognizing whether the word "bank" refers to a river bank or to a financial institution). This paper explores a strategy for harnessing the linguistic abilities of human beings to develop datasets that can be used to train machine learning algorithms for WSD. To create such datasets, we introduce a new interactive system: a fun game designed to produce valuable output by engaging human players in what they perceive to be a cooperative task of guessing the same word as another player. Our system makes a valuable contribution by tackling the knowledge acquisition bottleneck in the WSD problem domain. Rather than using conventional and costly techniques of paying lexicographers to generate training data for machine learning algorithms, we delegate the work to people who are looking to be entertained.
- Frontiers of a Paradigm: Exploring Human Computation with Digital Games
Markus Krause - GiveALink Tagging Game: An Incentive for Social Annotation
Li Weng, Filippo Menczer - Crowdsourcing Participation Inequality; A SCOUT Model for the Enterprise Domain
Osamuyimen Stewart, David Lubensky, Juan Huerta, Julie Marcotte, Cheng Wu, Andrzej Sakrajda - Mutually Reinforcing Systems: A Method For The Acquisition Of Specific Data From Games With By-Products
John Ferguson, Marek Bell, Matthew Chalmers - A Note on Human Computation Limits
Paul Rohwer
- Reconstructing the World in 3D: Bringing Games with a Purpose Outdoors
Kathleen Tuite, Noah Snavely, Zoran Popovic, Dun-Yu Hsiao, Adam Smith - Improving Music Emotion Labeling Using Human Computation
Brandon Morton, Jacquelin Speck, Erik Schmidt, Youngmoo Kim - Webpardy: Harvesting QA by HC
Hidir Aras, Markus Krause, Andreas Haller, Rainer Malaka - Measuring Utility of Human-Computer Interactions
Michael Toomim, James Landay - Translation by Iterative Collaboration between Monolingual Users
Chang Hu, Benjamin Bederson, Philip Resnik