Monday, June 29, 2009

LiveBlogging from HCOMP 2009

I am now attending HCOMP2009, the workshop on Human Computation that I previously advertised in this blog.

Overall, the workshop was well-attended with more than 40 people in the audience, pretty much filling the room. Below you can find the notes that I was keeping during the workshop. (See also the associated tweets.) I hope that they capture nicely some of the discussion for those of you that could not be there.

######################################################################

The workshop started with introductions, and then Luis von Ahn described MonoLingo, a human powered system for translation, using people that know ONLY ONE language. Monolingo relies on the fact that machine translation is pretty good at this point, but not perfect. So MonoLingo starts by by translating each word using a dictionary, giving multiple interpretations for each word. The human then (who is a native speaker of the target language) selects the translation for each word and forms the sentence that makes most sense.

To give the incentives to users to participate, MonoLingo appears to be a language learning site: as the user learns the language, makes small translations posed as exercises. Such small translations are used to contribute to a larger translation task.

The system reminded me a little bit of the crowdsourced-powered New York startup SpeakLike, that I have seen at the New York Tech Meetup.

I was also wondering how to best integrate MonoLingo with the more advanced machine translation systems, e.g., Google Translate, which generate significantly better translations than the word-by-word translation(s) generated by MonoLingo. Perhaps a more iterative approach (e.g., using the TurkIt toolkit and ideas) might actually work wonders in this task, where people take turns to improve the overall translation.

######################################################################

The second talk of the workshop described "Herd It: Designing a social game to tag music" by Luke Barrington, Douglas Turnbull, Damien O’Malley, and Gert Lanckriet.

HerdIt uses an active learning approach to tag music. Users tag music online and then a machine learning algorithm is trained to tag a few more songs etc. (The approach reminds me of our own KDD08 paper on active learning using multiple noisy labelers.)

The HerdIt game starts playing the music and the players sees bubbles floating around each one with a tag (e.g., rock, pop, romantic, ballad etc). The players gets more points when they hit the bubble that corresponds to the more popular tags. The authors also added quizzes in the game (e.g., a song plays and the question "the singer has long hair?" appears). For the quizzed, there is a parimutuel prediction market running in the background, where users bet on different outcomes/answers and the winners split the common pool for the bet.

The HerdIt game, reminds me a little bit of MajorMiner, a music game introduced by Michael Mandel, a classmate of mine at Columbia University.

######################################################################

The third talk described "KissKissBan: A Competitive Human Computation Game for Image Annotation" by Chien-Ju Ho, Tao-Hsuan Chang, Jong-Chuan Lee, Jane Yung-Jen Hsu, and Kuan-Ta Chen.

The authors are building on and improving the ESP game. They try to solve the "coalition problem" (people develop a strategy of typing a "common" word, e.g., "the" and proceeding to the next image) and the problem of limited data diversity of tags if you have a limited user base.

The basic idea is to make the the 2-player ESP game, a 3-player game: a new type of player is introduced, named "blocker". The blocker is trying to prevent the other two users from proceeding by typing the "obvious taboo" words that the other two players may type to describe the depicted image. The more words the blocker catches, the better the score of the blocker. This encourages the diversity of tags, since the two "matcher" players will attempt non-obvious tags to describe the image. The blocker can also observe what words the "match" users are using across images. So if two players develop a cheating strategy and type some unrelated word often (e.g., "the") the blocker can catch that behavior and put such words in the "blocked" list. KissKissBan uses a zero-sum approach where the blocker gets the points that are being lost by the matchers. So the blocker has the incentive of entering many words that are then used by the matchers. I suspect that the blocker also loses points by entering words that are NOT used by the matcher players.

Unfortunately, it does not seem to work if the matchers have access to a communication channel beyond the game (e.g., skype or being in the same room). Also, I would have liked to see some discussion of cheating strategies that can be used in such games.

######################################################################

The third talk was the paper "Community-based Game Design: Experiments on Social Games for Commonsense Data Collection" by Yen-Ling Kuo, Kai-Yang Chiang, Cheng-Wei Chan, Jong-Chuan Lee, Rex Wang, Edward Yu-Te Shen, and Jane Yung-Jen Hsu.

The authors describe a game-based an approach for building "common-sense" ontologies using "virtual pets" that are being "fed knowledge" by the friends of the player (e.g., in a Facebook-like setting). The game has some quiz-like templates (e.g., "______--likes to--______") which are then filled in by friends of the player using sensible values (e.g., "a student--likes to--have no homework"). To make it fun, the pets compete online playing such quizzes and they become smarter, getting "smart points". The smarter pets that have the most knowledge and give the most sensible answers appear in the leaderboards. The pets get points when they give the same answers as the pets of other owners. The game is named "Rapport Game".

Sounded fun having a very distinct East Asian flavor, reminding me of a social-networking-enabled Tamagochi...

######################################################################

After the break, we attended the demos and posters. Unfortunately, the time was not enough to really check the large number of posters and demos that were available.

From the demos, I liked TurkIt a lot, not so much for the library and toolkit, but mainly for introducing the idea of "iterative tasks": People that have used Mechanical Turk know that it is very hard to get people to do a good job in tasks that have free responses. Having a voting round after the submission (e.g., "Is this answer good or not") discourages fraud but does not generate high-quality results; it simply generates results that are "good enough". TurkIt introduces the idea of iterations, essentially soliciting multiple rounds of submissions, where users vote whether the "new" submission is better than the "old" one. Looking at the handwriting recognition task is pretty revealing why the approach works well: The first submitter takes a first pass, leaving blanks wherether the transcription is difficult. The second submitter fills-in (some of) the blanks, the third improves even more and so on. When there is no further improvement, the task ends. Very nice idea, I will be using it in the future.

I also learned about the internal crowdsourcing efforts at IBM (basic conclusion: giving a prize like iPod works much better than trying to motivate people through other marketing and management efforts... a good lesson for people trying to deploy enterprise prediction markets...)

######################################################################

In the second part of the workshop, we have seen some discussion on the role of game theory in the design of the human computation games.

The position paper by Shaili Jain and David Parkes "The Role of Game Theory in Human Computation Systems" gave an outline of promising directions for future research in this area.

The basic idea is that human computation may benefit by using game theoretic concepts to improve the design of the games. The basic argument is that the use of game theory also solved problems in settings like P2P networks, where game theory has been used to avoid the problem of free riding. Shaili gave a brief introduction to game theoretic analysis of some games and systems (e.g., PhotoSlap, ESP game and Yahoo Answers).

The talk advocated a modeling of user actions (e.g., in the ESP games players select "easy" or "difficult" words), the corresponding costs and benefits for the users, and how these affect the outcome of the game. The nice outcome is that game theory helps predict the equilibrium of such games, essentially predicting what the stable state of these games will be.

Some of the intriguing open questions:
  1. How to quantify "fun"
  2. How to introduce altruism in the models.
  3. How to generate game theoretic models for the games described in the GWAP taxonomy (games with output agreement, input agreement, and inversion-problem games).
  4. How to use game theoretic concepts of "population games" to model large number of agents that interact.
The high-level conclusion of the talk: It is good to build a game-theoretic model of each game, so that we can see how robust is the game to pertubations of design options.

#####################################################################

The game-theory discussion continued with the paper "On Formal Models for Social Verification" by Chien-Ju Ho and Kuan-Ta Chen.

The authors described how to use game theory to show the effect of sequential verification vs parallel verification. Parallel verification is the process in which two users submit an answer for a question, and if it matches they get a reward. Sequential is the process in which the user submits an answer that needs to match a pre-given "correct" answer. (e.g., "Guess what is the original question for this answer", which allows a sanity check of the answers). The paper provides the corresponding equilibria that result from these mechanisms.

######################################################################

The paper "Efficient Human Computation: the Distributed Labeling Problem" by Ran Gilad-Bachrach, Aharon Bar Hillel, and Liat Ein-dor tackled the following problem: Using humans we can collect labels and tags that describe an object (e.g., an image). When the number of possible labels is large, then we will start seeing consistency problems as different labelers are not coordinated. The labelers may provide correct answers but due to polysemy they end up giving superficially different labels, even though they mean the same thing ("truck" and "lorry"). Or due to homonymy they give the same label even though they mean different things (e.g. "greyhound" the dog, and "greyhound" the bus company).

The authors describe graph-theoretic algorithms that can be used to resolve such problems and provide bounds about the optimality of the proposed approaches. The question arised of how to deal with the fact that the same user may not be self-consistent over time. Also the question on how to deal with users of various degrees of reliability.

######################################################################

The last talk of the workshop was "Financial Incentives and the Performance of Crowds" by Winter Mason and Duncan J. Watts. The authors examine how Turkers respond to various levels of payment. Confirming my own observations, generally Turkers work more for higher pay but do not do much better work in terms of quality. So, paying higher gets more things done but the quality remains statistically the same (my own observation indicates that HITs paying more than 50 cents also tend to attract spammers, effectively decreasing the overall quality!)

In a test of anchoring, authors asked users if they felt that they have been fairly compensated. Interestingly enough, they felt that they were underpaid, valuing their work 2-3 cents more per HIT compared to the payment --- interesting enough, the difference was consistent across levels of payment (e.g., a 5 cent HIT is worth 10, a 10 cent HIT is worth 15 etc). So it may make sense to start giving a low-payment for the HITs at first, and once the psychological anchor is established, increase the payment to satisfy the workers that felt that the initial payment was small :-)

#####################################################################

The overall themes that emerged from the workshop were rather clear: One the one hand, there is the experimental side of human computation, where researchers are trying to devise new incentives for users to participate, new types of actions, and new modes of interaction. On the other hand, we have the more abstract/theoretic side, where researchers are trying to model these actions and incentives to examine what theory predicts about these designs. Also, there is work that examines what to do with the noisy results that are being generated by such games and systems: how can we best handle the noise and use the generated data for data mining purposes?

The final question over dinner was where to organize the next instance of the workshop? There are so many disciplines coming together in this line of work that finding a commonly acceptable venue might be a challenge. Will the CHI people contine coming to KDD? Or will data miners attend CHI to attend this workshop? What about the WWW conference? Would be a good match? I personally favor KDD, but we will have to wait and see the result of the human computation on this issue...

(We also have the wiki with the bibliography in the area http://hcomp2009.wikispaces.com/ which we will have to curate and organize better to reflect the results of this workshop.)

Tuesday, June 9, 2009

Google Fusion Tables: Databases on the Cloud

From the Google Research Blog: Google Fusion Tables.

Now it is possible to upload tabular data sets on Google, let other people use the data, and provide easy-to-use visualizations. No complicated joins or other heavy-duty relational stuff but there is functionality to connect (fuse) tables. There is also functionality embedded to discuss the contents of the data set.

Here is an early example. I took the data from a survey of Mechanical Turkers and imported it in Google Tables. Here is the resulting intensity map that shows the distribution of workers per country:



and the "lift" of the distribution of workers per state (we are comparing actual population percentage with percentage of Turkers):



I am truly excited about this feature. Just the idea that it will be possible to release "live" data sets, without having to set up complicated web interfaces, worrying about security, SQL injections, and so on, makes this absolutely wonderful for me.

For comparison, see the corresponding visualizations from Many Eyes:





But the flexibility of Google Tables for data management counters the relative lack of visualization options.

My only real complaint: The 100Mb limit. I was ready to upload my Mechanical Turk archive (see the related blog post) there, and let other people use it. Unfortunately, it is larger than the 100Mb limit. If only I could use the extra storage that I bought from Google for my Gmail account...

Tuesday, June 2, 2009

Acceptance Rate: 100%

I have been thinking lately about the concept of binary decisions, mainly in the context of paper reviewing. Most of the time, the decision for a paper is binary: Accept a paper or not.

Any binary decision that depends on some explicit or implicit threshold will always be problematic. Whatever is threshold+epsilon gets in, whatever is threshold-espilon is out. An epsilon difference generates significantly different outcomes. To make things worse, the area around the threshold is typically densely populated. No matter where we put the threshold, papers with small differences in quality, even under perfect quality assesment, get very different treatments.

Here is an alternative. Allow all authors to decide whether to publish their papers or not. With one condition. They will also publish together the reviews for the paper. The paper got 3 strong rejects and the authors still want to publish the paper? Fine!

If such a system was in place, then most of the authors would seek to get good reviews instead of trying to pass the threshold and get into the publish-land.

Going a step further? Keep the reviewed versions of the paper together with reviews of earlier versions. Did the authors address the comments? They can have a statement describing that. Later reviewers can take a look and see whether this is the case. This policy would also encourage submissions of only high-quality results.

This can also be matched with the requirement for the reviews to come from at least some high quality reviewers, but I will leave this for another post.