Sunday, January 25, 2009

Time Out New York and Mechanical Turk

My wife was browsing through the latest issue of Time Out New York, which has the general theme "Make Money." They list various ways that someone can earn some extra money. Smiling, she pointed me to the following entry:

So, yes, Amazon Mechanical Turk is now officially mainstream, together with other occupations and activities like "be a nude model," "walk dogs," "be a magician," and "donate eggs."

Saturday, January 24, 2009

Mechanical Turk, Human Subjects, and IRB's

Academics that engage in user studies have often to apply to an IRB (Institutional Review Board) to get permission to do research with human subjects. These applications are mainly targeting researchers in biomedical sciences, but to be on the safe side many universities also require IRBs for usability studies, or any other research that obtains personally identifiable information from the participants.

So, the question is: Does someone need to apply for an IRB to use Mechanical Turk? The answer is "it depends". My own take is that for the majority of the tasks posted on MTurk, an IRB is not required. An IRB is required only when someone studies the Turkers, not when postings tasks to be completed by Turkers.

Here is what I have found from 1 and 2

Am I proposing Human Subjects Research?

Research is considered to involve human subjects when an investigator conducting research obtains (1) data through intervention or interaction with a living individual, or (2) identifiable private information about a living individual.
(f) Human subject means a living individual about whom an investigator (whether professional or student) conducting research obtains:
(1) Data through intervention or interaction with the individual, or
(2) Identifiable private information.
In the case of Mechanical Turk, we do not retrieve any identifiable private information about a living individual. So, to engage in human subjects research, we need to collect data data through intervention or interaction with a living individual.

Let's see the corresponding definitions for intervention and interaction.

Intervention, as it pertains to research involving human subjects

defined in 46.102 within the Human Subject definition – includes both physical procedures by which data are gathered (for example, venipuncture) and manipulations of the subject or the subject's environment that are performed for research purposes
Interaction, as it pertains to research involving human subjects

defined in 46.102 within the Human Subject definition – includes communication or interpersonal contact between investigator and subject
Intervention: My own take is that we do not have any intervention. We do not physically interact with the Turkers, and we do not modify the environment of the subjects for research purposes.

Interaction: My own take is that we do not have any interaction with the Turkers either as we simply post the tasks and ask them to complete them. If this is considered interaction, we could classify as interaction any visit to our own web pages.

Of course, the above are my own points of view. I am not a lawyer, so my interpretations above may be wrong. Judge yourself if my interpretations are correct or not. (And let me know in the comments if you disagree.)

I also found some additional information is available from San Francisco State University.
Content Experts/Consultants/Key Informants

It may not necessary to get human subjects approval if interview questions are with experts about a particular policy, agency, program, technology, technique, or best practice. The questions are not about the interviewee themselves, but rather about the external topic. For instance, questions will not include demographic queries about age, education, income or other personal information.

Human Subjects review will be required when a researcher is interviewing individuals about content, but there is a research question or hypothesis involved, or an “agenda.” The researcher intends to analyze and generalize the results, that is, look for common themes in the collected data, try to universalize the interviewees’ experiences, or quantify the results in some way.

Examples of content expert projects that may not require human subjects review:

In all the following examples, the questions are focused on the facts about the program, policy, software, curriculum, procedures or project. The researcher will simply report the facts as they are related by the content experts. You may not need to submit a protocol or an informed consent form for human subjects approval if:
  • you are interviewing managers in a company about their billing procedures, or their use of a particular software program, or
  • you are interviewing or surveying teachers about what should be included in the development of a particular curriculum unit, or
  • you are interviewing entrepreneurs about the obstacles they faced in starting their own businesses, and how they overcame them, or
  • you are asking a panel of nurses and doctors to review your antismoking program for teens for correct medical content, or
  • you are interviewing social agency directors about their client intake procedures.
So, if you are doing some annotation work, where you ask not for personal opinions, but instead you ask Turkers (or any user actually) to tell you something about the "true state of the world", then an IRB is not required. If however you ask for personalized experiences (and you happen to be in academia) then it seems that an IRB may be required.

Friday, January 23, 2009

How do you charge the Mechanical Turk expenses?

From time to time I get emails asking me how I deal with the university bureaucracy when using MTurk for my research. One common question is about budgeting and charging the expenses:
Under what line item in your busget you put the Mechanical Turk expense? Don't you have to go through your HR department to "hire" the Turkers?
My answer: I do not hire anyone. I am simply paying Amazon for some software calls, in the same way that I am paying Amazon for their EC2 service, their S3 service, and so on.

From what I know, I place some requests using software tools, and Amazon returns back the answers. I have no idea who completes the request. OK, I know that someone with the id AC5CHSQTK8O9Z completed the request but this may be an Amazon employee (temp, contractor, part-time, or full-time). That's why I pay Amazon, not the worker AC5CHSQTK8O9Z. It is the task of Amazon to send the tax forms to AC5CHSQTK8O9Z, to make sure that AC5CHSQTK8O9Z can be employed and paid, and deal with all the HR stuff.

As a parallel example, in Starbucks I know that Mary prepared my doppio espresso but I do not hire Mary as my barrista; I pay Starbucks and Starbucks pays Mary.

So, in my budget it goes under "software" and paid together with my expenses for Amazon EC2, S3, and so on. Incidentally, this is exactly the same way that I get coders on Rent-A-Coder to prepare customized software for me. In that case I pay Exhedra Solutions (the owner of Rent-A-Coder), and themselves they pay the contractors that participate in their marketplace. Again, the charge goes under "software", since what I am buying from them is actually custom-made software.

Thursday, January 22, 2009

Soliciting Reviews on Mechanical Turk

A few days ago, Mechanical Turk was featured in a not-so-flattering story: Mike Baynard, a Business Development Representative working for Belkin has posted HITs on Mechanical Turk asking Turkers to write 5 star reviews on Amazon, for a set of Belkin products that were getting mainly negative reviews. Furthermore, Mike Baynard was asking Turkers, after posting the 5/5 review, to vote as "not helpful" the negative reviews that appeared on Amazon. The story was picked by major tech sites (Gizmodo, Slashdot) and, fortunately, users put the blame on Belkin and not on Mechanical Turk.

However, this got me thinking. How often is Mechanical Turk used for such sort of activities? Fortunately, a few weeks back I set up a crawler that visits Mechanical Turk periodically and keeps track of the tasks posted there.

So, in my first use of this MTurk archive, I digged in and tried to find review-related HITs. I discovered about 100 HIT groups posted over the last couple of weeks. Except for three HITs posted by Michael Bayard, the rest did not seem to explicitly solicit positive reviews. In fact, most of the requests seem to be legitimate and in my opinion, ethical. Yes, in an ideal world all websites would get millions of users submitting reviews and generating network effects, but sometimes it is better to pay and have a review than not having a review at all!



(The reviews are listed in a table, which is contained within an IFRAME; this means that you will need to visit the blog page to see it as RSS readers typically do not render IFRAMEs)

Now, the question is: Are these solicited/paid reviews better or worse than the reviews posted by users without any financial incentive? I expect to have some results in that front rather soon.

Wednesday, January 21, 2009

How good are you, Turker?

One common question when working with Mechanical Turk is "How good are the Turkers? Can I trust their answers?" In a previous post I gave some pointers to the existing literature on estimating Turker quality based on the returned responses and Bob Carpenter has also developed an excellent Bayesian framework for the same task.

All this line of work assumes that the only thing that we have available are the responses of the Turkers for the task in hand or potentially for previous tasks as well.

An alternative direction is to examine whether Turkers can self-report their own quality. To examine whether this direction is promising, we ran the following experiment on Amazon Mechanical Turk: We picked 1000 movie reviews from the sentiment analysis data set collected by Pang and Lee and posted them on Amazon Mechanical Turk.

We asked the participants on Mechanical Turk to read the text of a movie review, and estimate the star rating (from 0.1 to 0.9) that the movie critic assigned to the movie. We also asked users to self-report how difficult it was to estimate the rating, giving a difficulty rating of 0 to the easiest, and a rating of 4 to the most difficult.

Our first results were encouraging: There is a significant correlation between the "true" rating, assigned by the author of the reviews (not visible to the Mechanical Turk workers), and the average rating assigned by the labelers. Across the full dataset, the correlation was approximately 0.7, indicating that Mechanical Turk workers can recognize sentiment effectively. However, correlation of 0.7 is not perfect and indicates that there is a significant amount of noise.

The interesting part though is when we break down the responses by self-reported difficulty. The figure below shows the average labeler rating, as a function of the true rating, broken down by different levels of self-reported difficulty ($D=0$ are the easiest, $D\geq 3$ are the hardest.)

By computing the correlations for different levels of difficulty, we get: correlation of 0.99 (!) for reported difficulty $D=0$, 0.68 for $D=1$, 0.44 for $D=2$ and correlation of just 0.17 when $D \geq 3$.

In other words, Turkers can self-report accurately the difficulty of correctly labeling an example! Since example difficulty and labeling quality are strongly interconnected, this also means that they are good at estimating their own quality! (Puzzled how we can infer worker quality since the workers report example difficulty? Think of a well-prepared student for an exam, and a badly prepared one; the well-prepared student will find an exam to be "easy", while the badly prepared student will find the exam to be "difficult".)

So instead of devising sophisticated algorithms to estimate the labeler quality we can simply ask the Turkers: "How good are you?"