When I joined Stern, I was glad to find out about the "behavioral lab", which made recruiting of users much easier. We just have to specify on a web form the target demographics and then a set of 1000 registered (real) volunteers are notified. This facilitates greatly the process and makes sure that the participating users are indeed willing to participate in the experiments. Still though, the process is tiring as someone has to wait in the lab for the users, give directions, pay the participants, and so on.
One interesting alternative is Amazon Mechanical Turk, a service introduced by Amazon in November 2005. Mechanical Turk allows requesters to post a large number of small tasks, and pay a small amount to the person who completes the task. Examples of such tasks include:
- can you see a person in the photo?
- is the document relevant to a query?
- is the review of this product positive or negative?
One of the problems that we faced was the uncertainty about the validity of the submitted answers. We had no way to ensure that an answer submitted by an "Turker" was a carefully thought answer or just a random response.
To avoid this problem, we decided to get multiple, redundant answers for the same question. To be able to decide about statistical significance, we asked for 5 answers for the same question. We marked an answer as correct only if at least 4/5 answers agreed. Furthermore, to discourage users from submitting random responses, we clarified in the instructions that we will pay only for submissions that agree with the responses submitted by other annotators. This followed the spirit of the ESP game by Luis von Ahn, and ensured that the Turkers had the appropriate incentives to submit correct answers. Even though this approach increases the cost, it ensures that the received answers are consistent and the level of noise is low.
A second approach for minimizing noise in the answers is the use of the "qualification tests." Instead of letting users submit directly answers, we wanted to see if they are competent enough to participate in these experiments. For example, we had a task where we were soliciting relevance feedback for sets of queries and documents. To make sure that the users follow the instructions, we asked users to submit their answers for already labeled query-document pairs (in this case, the pairs were coming from TREC collections). We also required annotators to retake the qualification tests if they wanted to mark a large number of query-document pairs. This ensured that annotators that submit a large number of evaluations also pass a proportionally larger number of qualification tests. However, the presence of qualification tests slows down the process by a factor of 3 to 4. (Nothing comes for free :-)
Overall, our experience with Mechanical Turk has been very positive. The interface is very clean, easy to program, and the answers come back very quickly. It is not uncommon to send thousands of requests in the evening and have all the results ready by the following morning.
Now, let's see if the reviewers will like the results :-)
Update (Nov 24, 2007): Our first paper that uses Amazon Mechanical Turk for conducting the experimental evaluation has been accepted at IEEE ICDE 2008 and is now available online. Hopefully, more will follow soon.