Wednesday, February 9, 2011

Notes from "Crowdsourcing in Search and Data Mining" (CSDM) workshop

I was attending the Crowdsourcing in Search and Data Mining workshop at the WSDM 2011 conference in Hong Kong. I kept some (informal) notes about the workshop:

Invited talk - Winter Mason, presenting his work on "Individual vs. Group Success in Social Networks". 

Winter started by presenting a summary of his HCOMP 2009 paper on how payment on Mechanical Turk affect quality and speed of completion (hint: affects speed, does not affect quality). Then he moved to talk on how agents can collaborate to solve complex problems. How structure of the communication network affect the results? How the position in the network affects the performance of an individual? The participants were playing a game, where they were attempting to discover oil in a map. The users could see where their neighbors searched for oil in the field. So, they were kind of guiding each other in discovering oil. As part of the experiment, the underlying, invisible graph connecting the players was changed, to see the results. Typically the players were copying each other more, as the clustering coefficient of the graph increased (i.e, less exploration). However, this did not have a statistically significant effect in "finding the peak of oil production" and all graph structured performed similarly in terms of overall success, although there was some non-significant decrease in performance. This work points the direction to a lot of interesting future research: When should we let people talk to each other vs let them work independently? What is the structure of the solution space that indicates that people should collaborate vs explore independently (e.g., if there is a "hard to find" peak of oil production, and many areas with moderate oil production that are easier to find, you want people to explore independently; if there is a single peak, collaboration helps)

Guido Zuccon, Teerapong Leelanupab, Stewart Whiting, Joemon Jose and Leif Azzopardi. Crowdsourcing Interactions - A proposal for capturing user interactions through crowdsourcing.

The main question is how to capture behavior of users of search engines, when you do not have a search engine available (ala Google, Microsoft, Yahoo, etc). The alternative is to have something done on the lab, but the population is homogeneous and expensive if you want to create a big data set.

The author described how they build a task-oriented system on top of MTurk, intermediating on top of Bing, and examined how users used the search engine to identify answers to the questions posed (taken by IIR track of TREC).

Richard McCreadie, Craig Macdonald and Iadh Ounis. Crowdsourcing Blog Track Top News Judgments at TREC.

The authors describe their experiences crowdsourcing relevance evaluations. "Find interesting stories on day d, for category c". Basic setup: display the results to the user, and measure speed, cost, and level of agreement (quality of assessments). They observe everything in the server, and they conclude that MTurk is good, cheap, and fast. To ensure quality they need at least three assessors per document. Workers seemed to just skip through the documents (~15 seconds per document) but the results seemed pretty consistent with the overall TREC results.

Carsten Eickhoff and Arjen de Vries. How Crowdsourcable is Your Task?

The authors list their experiences of how they fell in love with crowdsourcing, but then they realized (surprise!) that there are cheaters out to get you and submit junk. Their major conclusions: the inherent reputation metrics on Turk are pretty much useless, and gold testing tends to be good in specific cases (well-defined, unambiguous answers). They suggest to make the HITs more "creative" and less "mechanical" to be less susceptible to spam. Novelty helps, as refreshes the mind of the worker.

Christopher Harris. You’re Hired! An Examination of Crowdsourcing Incentive Models in Human Resource Tasks.

How can we structure a resume screening HIT, in order to achieve both good precision and recall? HR screening is mainly a recall task (you do not want to lose good candidates). The initial worker screening included a English skill test, plus an "attention to detail" task, which examines that workers pay attention. The author presents an experiment with different treatment conditions in terms of incentives (no incentive, bonus always, bonus only on performance framed as negative). The basic result: incentives for performance increase completion time, and positive incentives increase performance.

Jing Wang, Siamak Faridani and Panagiotis Ipeirotis. Estimating Completion Time for Crowdsourced Tasks Using Survival Analysis Models.

An analysis of the MTurk market, to identify how long it takes a task to be completed. Effect of price: 10x price, gives a 40% speedup. Effect of grouping HITS: 1000 HITs grouped get done 7x faster than 1000 sequential HITs. The slides are available online.

Raynor Vliegendhart, Martha Larson, Christoph Kofler, Carsten Eickhoff and Johan Pouwelse. Investigating Factors Influencing Crowdsourcing Tasks with High Imaginative Load.

[jet lag was hitting pretty hard at that point, and I had to take off and get some rest...]

Invited talk - Thore Graepel: The Smarter Crowd: Active Learning, Knowledge Corroboration, and Collective IQs

Some ideas on how to use graphical models to model users and their expertise to match them with appropriate items to work on. Basic idea: user modeling helps in understanding the workers, and can improve active learning.

Omar Alonso. Perspectives on Infrastructure for Crowdsourcing.

Description of techniques that facilitate the development of advanced crowdsourcing systems (e.g., Mapreduce, reputation systems, workflows etc). Can we devise a general computation system for the HPU (human processing unit), building on existing paradigms for computational systems? What are the fundamental building blocks that we need?

Abhimanu Kumar and Matthew Lease. Modeling Annotator Accuracies for Supervised Learning.

This paper examines where to allocate the labeling effort when building a machine learning task. Basic idea: If we know the quality of the workers, most of the solutions work pretty similarly.

Invited talk - Panos Ipeirotis: Crowdsourcing using Mechanical Turk: Quality Management and Scalability

I described our experiences in building systems for managing quality when dealing with imperfect human annotators (from our HCOMP 2010 paper), and on how to efficiently allocate resources in labeling when using the data to build machine learning models (KDD 2008 and working paper). I used plenty of examples from our AdSafe experience, and gave a brief glimpse into our latest explorations in using the use of psychology and biology to influence worker behavior.

Conclusions and Best Paper

The day ended with some overall discussion about the problems that we face with crowdsourcing. The discussion focused significantly on what we can do to separate inherent uncertainty in the signal from uninformative noise, and on how to be able to get the "informed minority" to get the truth out, without being drowned by the "tyranny of majority" (a good example is the question "Is Obama a Grammy winner?", where most people will intuitively say "no" but the correct answer is "yes"; in redundancy based approaches it is likely that the noise with bury the signal). Also people expressed concern that everyone is building his own little system from scratch, reinventing the wheel, instead of having a more coordinated effort to share experiences, and infrastructure. Also the paper "How Crowdsourcable is Your Task?" got the most-innovative paper award, and the discussion continued over beers and other distilled beverages....