Monday, July 26, 2010

Liveblogging from HCOMP 2010

Following last year's practice, I am blogging about the workshop this year as well. (If I do it well for a few more years, I hope to be a tenured blogger for HCOMP.)

The workshop was very well-attended, despite the strong competition for attention (there are 14 workshops at KDD this year).

8.40am, Invited Talk

The workshop started with an invited talk by Ross Smith from Microsoft, who talked about "Using Games to Improve Productivity in Software Engineering". He described how Microsoft used games internally to improve the quality of its products.

He particularly emphasized the attempts on using engineers to help with the localization/internationalization of the Microsoft products. One of the things that was interesting was the pride that engineers had when translating the messages into their own native language. The "my-work-is-in-the-product-that-you-are-using-mom"-factor was a strong motivation for engineers to contribute their time and effort to such volunteer efforts.

Ross also covered a variety of behavioral aspects of the process. For example, leaderboards has different effect, depending on how competitive the workers are in this particular country. When people compete, they have a positive factor. However, if someone has big difference than the rest, this is typically a demotivating factor, as many workers know that they cannot reach the top in any case.

Another interesting factor is that games should not be similar to the duties of the worker. If a programmer writes C++, then playing a game that requires the worker to write C++ is bad. First, the programmer may spend time in the game that he would have otherwise spend programming. Second if a worker is way too good in a task, and spends a lot of time there, other workers cannot compete easily and get demotivated.

An interesting aspect going forward is that games are increasingly being used to tap into the "discretionary" time of the workers, so there is now competition to make the games more interesting, more attractive, more meaningful etc. For example, currently workers accumulate points in the games, and their reward is that these points are translated into dollars that can be donated to disaster relief efforts.

Finally, games should not come with a mandate from the management. (Luis von Ahn mentioned that when you say "play the game" people do not.) The counterexample was Japan, in which the linguistic pride to get the applications translated into Japanese worked (despite? due to?) the clear mandate from the management to play the game that aids into the localization of Microsoft products.

9:10am, Session 1: Market Design

Task Search in a Human Computation Market 
Lydia Chilton (University of Washington) (*** presenting)
John Horton (Harvard University)
Robert Miller (MIT CSAIL)
Shiri Azenkot (University of Washington)

The paper described how workers tend to search for tasks on Amazon Mechanical Turk. The analysis indicated that "Most HITs Available" and "Most Recently Posted" are the most commonly ranking techniques for users to find tasks. By monitoring HITs and scraping the website every 30 seconds, the authors figured out how quickly different tasks are being done.

Plus they run a survey, trying to target rankings that are NOT frequently chosen by workers and compared "best case" scenario and "worst case" scenario. Interestingly enough, there is almost a 30x factor in the rate of completion. This explains all the gaming that is going on today, where major requesters keep posting HITs within their HITgroups to keep their HITs in the first page.

This paper really highlights the negative aspects of the prioritization schemes currently used on Mechanical Turk. Allowing workers to find easier tasks to work on, and employing some randomization in the presentation, Mechanical Turk can really contribute to more predictable completion times for the tasks.

The Anatomy of a Large-Scale Human Computation Engine
Shailesh Kochhar (Metaweb Technologies) (*** presenting)
Stefano Mazzocchi (Metaweb Technologies)
Praveen Paritosh (Metaweb Technologies)

This paper described Rabj, the platform used by Metaweb to improve the quality of FreeBase by having humans to look at the ambiguous cases, that cannot be handled well by automatic techniques. The basic goal is to take a system that is 99% accurate, and improve precision well above 99%

Metaweb did not use Mechanical Turk for this task. Instead, they hired people through oDesk, by first training them for a day, so that they can do their tasks properly and then let them work. By building some long term relationship, they were able to improve the quality of the results, without employing too complicated solutions for solving the worker quality problem. They use the oDesk API as well, and pay an hourly wage that varies from \$5 to \$15 per hour, depending on the complexity of the task

One thing that was interesting is that they are paying per hour, and not per piece. This is a conscious choice. The distribution of completion times for various tasks follows a lognormal distribution. At the very tail, we have the hard tasks that need a lot of time. These are actually the tasks that MetaWeb cares a lot to get right. Paying by piece means that workers have the incentives to do these tasks quickly, and move to the next. Paying for time means that workers can spend some time more in such hard tasks. The quality control process of Metaweb includes testing workers for throughput (if a worker is very significantly slower than the others gets warned and then dismissed).

Sellers' problems in human computation markets
M. Six Silberman (University of California, Irvine) (*** presenting)
Joel Ross (University of California, Irvine)
Lilly Irani (University of California, Irvine)
Bill Tomlinson (University of California, Irvine)

M. Six Silberman discussed the problems that workers (aka sellers) face in the marketplace of Mechanical Turk. There are bad requesters that reject good work and do not pay, or try to scam workers. Given the increasing number of workers that rely on Turk for income, it is not surprising that workers start demanding guarantees of fairness. The paper even points to a "Turker Bill of Rights". Tools like Turkopticon help in that respect. (Btw, readers that are interested in the labor law aspects of crowdsourcing, should definitely read the paper "Working the Crowd: Employment and Labor Law in the Crowdsourcing Industry" by Alek Felstiner.)

One interesting aspect of this presentation was its slideless nature. The speaker just read the conclusions from his notes. Although I found the mode of presentation difficult to follow, I think the message was clear: Do we care about the workers? Do we pay them fairly? There was significant discussion afterwards, and I bet this is the only place in KDD (or in any other CS conference) where people engaged into discussion about the fairness of minimum wage laws, issues of immigration and labor, and so on :-)

10 am, Session 2: Human Computation in Practice[Coffee, Demos & Posters, all in parallel]

In the next session, we had a set of very interesting demos and posters. I did not have time to see them all, so find below my notes about each one of them,

Frontiers of a Paradigm -- Exploring Human Computation with Digital Games
Markus Krause (University of Bremen)
Aneta Takhtamysheva (University of Bremen)
Marion Wittstock (University of Bremen)
Rainer Malaka (University of Bremen)

Markus had a very interesting game, for discovering synonyms and antonyms. You control a spaceship, and you try to shoot down the antonyms of the word given to you, and you try to collect the synonyms. This was a real arcade game, with graphics, collision detection, and so on. Markus mentioned that he writes it in Flash, because it is fast, and because there are websites where you post your game for people to procrastinate, and then you get effortlessly users. He routinely gets 3 million players a month. Even very simply games (e.g., click the boxes) get 5000 users to play them.

GiveALink Tagging Game: An Incentive for Social Annotation
Li Weng (Indiana University)
Filippo Menczer (Indiana University)

This was a game to find words/tags that will take you from one page to another, essentially uncovering the semantic connections between pages.

Crowdsourcing Participation Inequality: A SCOUT Model for the Enterprise Domain
Osamuyimen Stewart (IBM Research)
David Lubensky (IBM Research)
Juan M. Huerta (IBM Research)
Julie Marcotte (IBM GBS)
Cheng Wu (IBM Research)
Andrzej Sakrajda (IBM Research)

Mutually Reinforcing Systems
John Ferguson (University of Glasgow)
Marek Bell (University of Glasgow)
Matthew Chalmers (University of Glasgow)

A Note on Human Computation Limits
Paul Rohwer (Indiana University)

This a case study of two attempts to crowdsource writing a novel. The  first attempt by Penguin Books and De Montfort University used a wiki to crowd source a novel. The result was a failure. No organization, disconnected elements, incoherent result. When BBC attempted the same a couple of years later, the result was a success. The difference? BBC assigned a curator, who overlooked the process. Lesson? Any attempt to harness the wisdom of the crowds needs a reliable aggregator that will kick out the junky contributors and their contributions, keeping only the good contributions from the crowd.

Reconstructing the World in 3D: Bringing Games with a Purpose Outdoors
Kathleen Tuite (University of Washington)
Noah Snavely (Cornell University)
Dun-Yu Hsiao (University of Washington)
Adam Smith (UC Santa Cruz)
Zoran Popovic (University of Washington)

Interesting real-life game: The goal is to cover and create a 3D reconstruction of a city. Players get points when they go out, take a photo, and cover a part of a city/building that was not covered before. Using the images, they can reconstruct in 3D the buildings without gaping holes.

Improving Music Emotion Labeling Using Human Computation
Brandon G. Morton (Drexel University)
Jacquelin A. Speck (Drexel University)
Erik M. Schmidt (Drexel University)
Youngmoo E. Kim (Drexel University)

A game in which you listen to a song and try to guess the tempo and sentiment of the song, and agree with a co-listener. There is a continuous, intermittent feedback about the choice of the other player. The player that moves first to the agreed location gets extra points as influencer. I make it sound more complicated that it seems. I played it and it was very very intuitive and easy to play.

Webpardy: Harvesting QA by HC
Hidir Aras (University of Bremen)
Markus Krause (University of Bremen)
Andreas Haller (University of Bremen)
Rainer Malaka (University of Bremen)

Measuring Utility of Human-Computer Interaction
Michael Toomim (University of Washington)
James A. Landay (University of Washington)

A very interesting study about how the design of a HIT can influence participation. They change HIT parameters (price/design/etc) and examine for how long users will keep doing HITs. Reminded me a little bit of Dan Ariely's work on how motivation affects desire to work on a task.

Translation by Iterative Collaboration between Monolingual Users
Chang Hu (University of Maryland)
Benjamin B. Bederson (University of Maryland)
Philip Resnik (University of Maryland)

A demo that showed how two monolingual humans can collaborate to translate a document. They start with a human translation, and the human examines which part of the human translation do not make sense. After rephrasing and sending back (again through machine translation), the other human check if the translation makes sense and whether it corresponds to the original sentence that was translated. What I missed was how users can get motivated to participate in this system.

11:00am, Session 3: Task and Process Design

Sentence Recall Game: A Novel Tool for Collecting Data to Discover Language Usage Patterns
Jun Wang (Syracuse University) ****
Bei Yu (Syracuse University)

This game worked as follows: The user looks at a sentence, then the sentence disappears, and the user has to type the sentence again. Typically people cannot retype the exact sentence but type something similar. The main outcome is that through this game we can discover paraphrases and (especially when played by non-natives) typical mistakes in specific language constructs.

Word Sense Disambiguation via Human Computation
Nitin Seemakurty (Carnegie Mellon University)
Jonathan Chu (Carnegie Mellon University)
Luis von Ahn (Carnegie Mellon University)
Anthony Tomasic (Carnegie Mellon University) *****

The goal of this game is to disambiguate words (e.g., think of the different meanings of "bass" in "I can hear bass sounds" and "I like grilled bass"). The idea follows the ESP game, and asks users to type alternate words for the given underlined word in a phrase. If two people agree, then move on. Taboo words appear when their usage does not allow the disambiguation of a word (e.g., the word is associated with two senses). The experimental results clearly showed the fact that users are learning over time and perform better.

Quality Management on Amazon Mechanical Turk 
Panagiotis G. Ipeirotis (New York University)
Foster Provost (New York University)
Jing Wang (New York University)

For many tasks on Mechanical Turk, there are spammers submitting wrong results. Using repeated labeling and an algorithm like Dawid and Skene, we can estimate the error rates of the workers. The question is, can we infer from the confusion matrixes who is a spammer? Error rate alone is not enough: Spammers that put everything in the majority class have lower rates than honest but imperfect workers. Also, biased workers who are systematically off (e.g., more conservative or more liberal than other workers) end up having very high error rates. The solution is to compensate for the errors and see how the assigned class looks like after compensating for the errors. If the corrected labels are concentrated in one class, the worker is good. If they are spread across all classes, the worker is bad.

Exploring Iterative and Parallel Human Computation Processes

Greg Little (MIT) ****
Lydia B. Chilton (University of Washington)
Max Goldman (MIT CSAIL)
Robert C. Miller (MIT CSAIL)

The TurkIt toolkit introduced the idea of iterative tasks, introducing the ideas of iterative elimination voting, the idea of iternative tasks in which workers build on each others results, and so on. This paper examines the outcomes of different task designs. Basic question: Does it make sense to run tasks in parallel, or does it make sense to let workers build on each other's results? For description of images, iterative tends to be better, as people really build on each other's results. Similarly for transcriptions of highly noisy results. However, for tasks with shorter answers (e.g., coming up with company names) there is an interesting tradeoff: Iterative process tends to have higher average, but parallel has higher variance. If you are interested in the max and not in the average rating of the responses, then parallel is better. Iterative will find the consensus, but it will not be great. Parallel will generate some disasters, but also some gems. So if the goal is to find the "best", then parallel processes (i.e., independence) should work best. However, if you are afraid of disastrous outcomes, then workers should interact to eliminate outliers.

Toward Automatic Task Design: A Progress Report
Eric Huang (Harvard University)
Haoqi Zhang (Harvard University) ****
David C. Parkes (Harvard University)
Krzysztof Z. Gajos (Harvard University)
Yiling Chen (Harvard University)

The final talk of the workshop focused on optimizing task design, an area that I see as having significant potential for follow-up work. The basic question asked is: How should we design optimally a task for crowdsourcing, given a set of constraints? What will generate best quality? What design aspects will improve speed? In a sense, how can we start moving crowdsourcing from an ad-hoc execution, into a mode in which we specify the task, and a black box optimizer selects all the appropriate aspects of the design for us. The paper gave some first results on predicting the quality and quantity of tags assigned to an image and showed that designs that are predicted to be optimal before execution indeed perform much better than designs that are suboptimal.

12:00noon, Concluding Remarks

Yours truly, at the end, was assigned with the task of coming up with conclusions and describing the overall themes. I think the keyword is this workshop was "Design". Design for individual tasks (either games or MTurk HITs), and design of processes in handling such crowdsources tasks in marketplaces. One theme that I would have liked to see more is incentive designs to motivate people to participate and contribute. But I was very happy overall.

After the concluding remarks there was some discussion of quality control and examining the robustness of crowdsourcing systems to manipulation attacks. While we have no definite answers on how to guarantee protection from coordinated attacks in the absence of ground truth, in the current settings we rarely see extensive collusion and coordination across attackers. Most of the current spammers are there to make an easy buck, and not to spend extensive amounts of time trying to scam pennies. (There are better targets for that.)

Of course, having ground truth for verification of answers and for worker evaluation can help significantly in that respect: Luis von Ahn mentioned the attack on reCAPTCHA from the 4chan clan, which randomly entered the word "penis" as one of the two words, hoping to fill in the digitizations of books with the word "penis". (Given that they were failing in 50% of the attempts, it was easy to isolate them and remove their entries.)

The problem of handling a completely anonymous crowd, without any ground truth knowledge, and getting good results is hard to solve. Perhaps some security people will need to take a look and examine the theoretical guarantees.