Thursday, March 19, 2009

Fitting a Power-Law with Censored Data

A few days back, I was describing how to model the distribution of waiting times for Mechanical Turk. There, I described how to use the maximum likelihood estimator to infer the parameter of the power-law, using as input to the estimator the duration of the completed tasks.

However, this approach introduces a bias: There are tasks that we observed that have not finished yet. Take a look for example at the two distribution plots for the completion time of tasks on Mechanical Turk:

You see that there is a "cutoff" in the pdf plot at some point between 512 and 1024 hours. Similarly, we see a sudden drop in the cdf plot. This is just the effect of keeping only the completed tasks, ignoring the tasks that are still running.

By using only the completed tasks, we effectively ignore the information provided by the incomplete tasks: knowing that a task has not finished after running for $U$ hours, gives us information that the duration of the task will be at least $U$: This is valuable information. Such data points are called censored. Let me give some background on censored data for readers not familiar with the concept.

What do you mean by calling a data point censored?

In general, we call a data point censored when we cannot get the exact value for the data point, but we can provide a lower bound for its value, or an upper value, or both. According to the type of bounds, the data points are classified as: right censored (lower bound), left censored (upper bound), and interval censored (both lower and upper bound).

By far the most typical type of censored data is the right censored data. The duration of the unfinished tasks on Mechanical Turk are right censored. In a salary survey if the income is marked as "100K+ per year" is also a right censored data point. The lifetime of a patient treated by a drug who is still alive at the end of the study is also a right censored data point.

A left censored point is an observation for which we know its maximum possible value. In a salary survey if the income is marked as "Less than 10K+ per year" this is a left censored data point.


How can we modify a maximum likelihood estimator to use censored data points?

Deriving the simple MLE estimator (no censored data points)

Let's see first how we derive a maximum likelihood estimator. I will use as an example the continuous power-law distribution but the process is largely the same for other distributions as well. For the continuous power law, the probability density function is

$Pr(x=x_i) = (\alpha-1) \cdot {x_i}^{-\alpha}$

First, let's say that we observe $n$ data points with values $x_1, \ldots, x_n$. We assume that the data points come indeed from a continuous power-law and we are trying to estimate the most likely parameter $\alpha$ that generated these points.

The likelihood of observing these data points is the probability of seeing the the value $x_1$ and the value $x_2$ and the value $x_3$, ... and the value $x_n$. So, assuming independence across the points, the likelihood function is:

$L(\alpha) = \prod_{i=1}^n Pr(x=x_i) $

By substituting the value $Pr(x=x_i)$ with the power-law instantiation, we have:

$L(\alpha) = \prod_{i=1}^n (\alpha-1) \cdot {x_i}^{-\alpha}$

We need to find the $\alpha$ that maximizes $l(\alpha)$. Instead of maximizing $l(\alpha)$ directly, we opt to maximize the logarithm which is also maximized at the same value, and it much easier to work with analytically:

$l(\alpha) = \ln(L(\alpha)) = \sum_{i=1}^n \ln\left( (\alpha-1) \cdot {x_i}^{-\alpha} \right)$
which results in:
$l(\alpha) = n \ln(\alpha-1) - \alpha \sum_{i=1}^n \log(x_i)$

To find the value of $\alpha$ that maximizes $l(\alpha)$, we take the derivative with respect to $\alpha$ and set the derivative equal to 0:

$\frac{d}{d \alpha}l(\alpha) = 0$

so

$n \frac{1}{\alpha-1} - \sum_{i=1}^n \log(x_i) = 0$

So the MLE estimator is:

$\alpha = 1 + n \left[\sum_{i=1}^n \ln(x_i) \right]^{-1}$

Let's move on now to the case of censored data.

Deriving the MLE estimator with right censored data

Now, let's say that we observe $n$+$m$ data points drawn from the distribution: We have $n$ completed tasks, with duration values $x_1, \ldots, x_n$ and $m$ non-completed tasks that had duration $u_1, \ldots, u_m$ when we observed them but they were still going on (and so their final duration is going to be longer than the currently observed one --- right censored points).

Again, we assume that the data points come indeed from a continuous power-law and we are trying to estimate the most likely parameter $\alpha$ that generated these points.

Now the likelihood of observing these data points is slightly different. We need to accommodate the fact that the duration for the censored observations will be longer than the currently observed one. So we get this version, (for more details see also the related Wikipedia entry) boldfacing the part that deals with the censored data:

$L(\alpha) = \prod_{i=1}^n Pr(x=x_i) \cdot$ $ \mathbf{\prod_{i=1}^m Pr(x>u_i)}$

Since $Pr(x>u_i) = {u_i}^{-(\alpha-1)}$ the likelihood becomes:

$L(\alpha) = \prod_{i=1}^n (\alpha-1) {x_i}^{-\alpha} \cdot \prod_{i=1}^m {u_i}^{-\alpha+1}$

Analogously to the case above, we take the log:

$l(\alpha) = \ln(L(\alpha))$

which becomes:

$l(\alpha) =n \ln(\alpha-1) -\alpha \sum_{i=1}^n \ln(x_i) + \sum_{i=1}^m \ln(u_i) -\alpha \sum_{i=1}^m \ln(u_i)$

and set the derivative equal to 0:

$\frac{d}{d \alpha}l(\alpha) = 0$
$n \frac{1}{\alpha-1} -\sum_{i=1}^n \ln(x_i) -\sum_{i=1}^m \ln(u_i) = 0$

So, here is the MLE estimator that uses censored data points as well:

$\alpha= 1 + n \left[ \sum_{i=1}^n \ln(x_i) +\sum_{i=1}^m \ln(u_i)\right]^{-1}$

As you will see the MLE estimator with censored data also contains the factor $\sum_{i=1}^m \ln(u_i) \right$ but the normalizing constant is still $n$ and not $n+m$.

So, what is the difference? For power laws, the higher the fraction of censored points, the lower the exponent, i.e., the tail is heavier: an expected outcome as the censored data points tend to represent unobserved tail points. For example, for Mechanical Turk, the exponent without censored data points is 1.35 and with censored data points is 1.29. Not a big difference in this case, but there are situations where this correction can change significantly the overall analysis (e.g., if the exponent is close to 2).

Monday, March 16, 2009

Turker Demographics vs Internet Demographics

My post about the demographics of Mechanical Turk workers is one of the most popular posts on this blog, and I often see the results being discussed in other forums as well. One of the common questions that I see is: "How these demographics compare to the general Internet population?"

Apparently, this is not a very easy question to answer. It is relatively hard to find publicly available data about the demographics of Internet users. Fortunately, I found some data from ComScore, dating back to June 2008. I also had data about Mechanical Turk from two separate surveys that I ran on October 2008 and on December 2008 (both asking 1000 Turkers). The results across the two MTurk surveys were rather consistent, indicating that the results are rather trustworthy.

So, how Turkers compare to the general US Internet population? The short answers:
  • Turkers are younger. 54% of Turkers are between 21-35 years old, compared to 22% of the general population.
  • Turkers are mainly female. 70% of the Turkers are female, compared to 50% of the general population.
  • Turkers have lower income. 65% of Turkers have household income less than 60K, compared to 45% of the general population.
  • Turkers have smaller families. 55% of Turkers do not have children, compared to the 40% of the general population.
  • Geographical distribution of Turkers and Internet users is similar.
  • Race composition of Turkers and Internet users is similar, although there are slightly more Asians on Mechanical Turk.
Of course, the last two bullets may be simply the result of the first: Younger people have lower income, do not have children, and live in smaller households.

For those of you that would like to have a more detailed look at the statistics, here is the corresponding table:


June 2008 October 2008 December 2008

Internet Turks Turks
Total Audience 100 100 100
Persons - Age

Persons: 15+ 85.9 100 100
Persons: 18+ 80.1 99.6 99.5
Persons: 21+ 74.3 92.9 91.1
Persons: 35+ 52.4 39.3 37.1
Persons: 50+ 24.3 11.2 10.7
Persons: 55+ 16.2 5.2 5.4
Persons: 2-11 9.5 0 0
Persons: 2-17 19.9 0.2 0.4
Persons: 6-11 7.4 0 0
Persons: 6-14 12 0 0
Persons: 9-14 8.9 0 0
Persons: 12-17 10.4 0.2 0.4
Persons: 12-24 22.9 19 21.5
Persons: 12-34 38 57.8 60
Persons: 12-49 66.2 87.4 88.2
Persons: 18-24 12.5 18.7 21.1
Persons: 18-34 27.6 57.5 59.7
Persons: 18-49 55.8 87.2 87.8
Persons: 21-34 21.9 53.3 53.9
Persons: 21-49 50 82.9 82
Persons: 25-34 15.1 38.8 38.6
Persons: 25-49 43.2 68.4 66.7
Persons: 25-54 51.3 75.2 72.3
Persons: 35-44 18.7 22.4 21.5
Persons: 35-49 28.2 29.7 28.1
Persons: 35-54 36.2 36.4 33.7
Persons: 35-64 46.8 41.4 38.8
Persons: 45-54 17.6 14 12.2
Persons: 45-64 28.1 19 17.4
Persons: 55-64 10.5 5 5.2
Persons: 65+ 5.7 0.7 1.1
Males - Age

All Males 49.5 28 36.6
Male: 15+ 42.1 28 36.6
Male: 18+ 39.1 27.8 36.3
Male: 21+ 36.1 24.7 32.4
Male: 35+ 25.7 9.5 11.3
Male: 50+ 12 2.8 2.6
Male: 55+ 8.1 1.4 1.1
Male: 2-11 4.9 0 0
Male: 2-17 10.4 0.1 0.2
Male: 6-11 3.9 0 0
Male: 6-14 6.3 0 0
Male: 9-14 4.5 0 0
Male: 12-17 5.5 0.1 0.2
Male: 12-24 11.6 7.5 9.1
Male: 12-34 18.9 17.3 24.2
Male: 12-49 32.5 25 33.9
Male: 18-24 6.1 7.4 8.9
Male: 18-34 13.4 17.2 23.9
Male: 18-49 27.1 24.9 33.7
Male: 21-34 10.4 15.2 21.1
Male: 21-49 24.1 22.9 30.8
Males: 25-34 7.3 9.8 15
Male: 25-49 20.9 17.6 24.8
Male: 25-54 24.8 19 26.3
Males: 35-44 9.1 6 8
Male: 35-49 13.7 7.7 9.7
Male: 35-54 17.5 9.1 11.2
Male: 35-64 22.6 10.6 12.3
Male: 45-54 8.4 3.1 3.2
Male: 45-64 13.5 4.5 4.3
Males: 55-64 5.1 1.4 1.1
Males: 65+ 3 0 0.1
Females - Age

All Females 50.5 72 63.4
Female: 15+ 43.8 72 63.4
Female: 18+ 41 71.9 63.3
Female: 21+ 38.2 68.2 58.7
Female: 35+ 26.8 29.8 25.8
Female: 50+ 12.3 8.3 8.1
Female: 55+ 8.1 3.8 4.3
Female: 2-11 4.6 0 0
Female: 2-17 9.5 0.1 0.1
Female: 6-11 3.6 0 0
Female: 6-14 5.7 0 0
Female: 9-14 4.5 0 0
Female: 12-17 4.9 0.1 0.1
Female: 12-24 11.3 11.5 12.3
Female: 12-34 19.1 40.5 35.9
Female: 12-49 33.6 62.4 54.3
Female: 18-24 6.4 11.5 12.2
Female: 18-34 14.2 40.5 35.8
Female: 18-49 28.7 62.4 54.1
Female: 21-34 11.5 38.1 32.8
Female: 21-49 25.9 60 51.2
Females: 25-34 7.8 28.9 23.6
Female: 25-49 22.3 50.9 41.9
Female: 25-54 26.5 56.2 46
Females: 35-44 9.5 16.4 13.4
Female: 35-49 14.5 21.9 18.4
Female: 35-54 18.7 27.3 22.4
Female: 35-64 24.1 30.8 26.5
Female: 45-54 9.2 10.9 9
Female: 45-64 14.6 14.5 13.1
Females: 55-64 5.4 3.6 4.1
Females: 65+ 2.6 0.7 1
HH Income (US)

HHI USD: Less than 15,000 6 11.4 12.9
HHI US: Under $25K 9.3 22.8 23.1
HHI US: Under $60K 44.5 64.8 60.5
HHI US: $60K+ 55.5 34.8 39.1
HHI US: $75K+ 43 22.7 27.5
HHI USD: 15,000 - 24,999 3.4 11.4 10.1
HHI USD: 25,000 - 39,999 9.9 21.8 18.9
HHI USD: 40,000 - 59,999 25.3 20.2 18.6
HHI USD: 60,000 - 74,999 12.6 12.1 11.6
HHI USD: 75,000 - 99,999 17.7 10.2 11.5
HHI USD: 100,000 or more 25.3 12.5 16
Region (US)

Region US:West North Central 7.6 5.8 7.5
Region US:Mountain 6.9 6.4 7.4
Region US:Pacific 15.4 13.3 15.7
Region US:New England 5.5 6.4 4.7
Region US:Mid Atlantic 14.2 13.9 15.8
Region US:South Atlantic 18.7 19.2 19.9
Region US:East South Central 5.1 8.3 5.2
Region US:West South Central 10.5 10.7 9
Region US:East North Central 16.1 15.7 14.8
Children


Children:No 39.3 52.7 57.6
Children:Yes 60.7 47.3 42.3
HH Size


HH Size: 1 4.4 17.7 17.3
HH Size: 2 24.2 28.9 30.6
HH Size: 3 21.4 19.7 19.2
HH Size: 4 25.3 20.5 21.9
HH Size: 5+ 24.8 12.9 10.7
HH Size: 1-2 28.5 46.6 47.8
HH Size: 3+ 71.5 33.5 32.7
Race


Race:White 87.3 82.7 82
Race:Black 8 6.5 5.3
Race:Asian 1.6 5.7 6.8
Race:Other 3.1 4.9 5.8








Wednesday, March 11, 2009

Mechanical Turk: Profitable or Not?

I was chatting with a couple of students of mine, who are trying to build a startup using Mechanical Turk. So, they asked me what is the value of the tasks that are being posted every day on Mechanical Turk.

Since I have the archive of the tasks posted on MTurk for the last couple of months, this was an easy question to answer. A simple query on the database, find the new HITs posted every hour, group by day, and here is the plot:

Long story short, the average value of HITs posted in any day is around \$2000. I have not analyzed the distribution of the values, but it seems to be (not surprisingly) a power-law or a lognormal.

The 2K/day value means that the average revenue per day for Amazon is around 200 per day (10% of the requester's payment), or 6K/month. This hardly covers the expense of dedicating a developer to the service!

It seems that Mechanical Turk is not generating any significant revenue for Amazon. It is also unlikely that it generates any profit. And we know these days what happens to products that are not generating profits...

Thankfully Amazon uses Mechanical Turk for its own purposes, so there are second-degree benefits for Amazon to keep the service around. I truly hope though that the service will attract more customers soon.

Update: I should also clarify that my figures are slight underestimates of the actual figures: I can only "see" Mechanical Turk through the eyes of an average worker. So I cannot see if a requester asks multiple people to complete the same HIT and, sometimes, I cannot observe the details for HITs for which I have not passed the qualification test. I still think that the numbers will be at the same order of magnitude.

Update 2: If you want more details see my previous post on the dynamics of Mechanical Turk and the chart at http://hyperion.stern.nyu.edu/mturk/ that shows the available HITs and rewards at any given time.

Friday, March 6, 2009

Human Computation Workshop (HCOMP 2009)

Are you interested in Mechanical Turk and applications? Do you think that the ESP Game and ReCAPTHA's are great ideas and have some ideas that you want to share with the world? Or you are just interested in learning about human computation in general?

Now you can! We (an embarrassingly long list of organizers) are organizing a half-day workshop at KDD 2009 in Paris this year. The workshop will be on June 28th, and submissions are due on April 18, 2009 8pm Eastern Time. You can find more details about the submission requirements and find other useful information at http://www.hcomp2009.org

Below you can is the complete Call for Papers, for those that enjoy reading such stuff:
Human Computation Workshop (HCOMP 2009)
KDD-09 Workshop, Paris France
June 28, 2009

We invite you to participate in the first annual Human Computation Workshop
(HCOMP 2009), to be held on June 28th in conjunction with the KDD-09
conference (http://www.sigkdd.org/kdd2009/) in Paris, France.

Human computation is a new research area that studies the process of
channeling the vast internet population to perform tasks or provide data
towards solving difficult problems that no known efficient computer
algorithms can yet solve. The goal of this half-day workshop is to bring
together academic and industry researchers in a stimulating discussion of
existing human computation applications (e.g. games, CAPTCHAs, Mechanical
Turk) and future directions of this new subject area. We solicit papers
related to various aspects of both general human computation techniques and
specific applications, e.g. general design principles; implementation;
cost-benefit analysis; theoretical approaches; privacy and security
concerns; and incorporation of machine learning / artificial intelligence
techniques. An integral part of this workshop will be a demo session where
participants can showcase their human computation applications.

Detailed information about the workshop and submission procedures can be
found at http://www.hcomp2009.org. We solicit long papers (at most nine
pages), short papers (at most four pages) and demos. Demo submissions must
include either a previously published paper or a one page extended abstract.
Deadline for submission is April 18, 2009 8pm Eastern Time.

Organizing Committee
---------------------
Paul Bennett (Microsoft Research)
Raman Chandrasekar (Microsoft Research)
Max Chickering (Microsoft Live Labs)
Panos Ipeirotis (New York University)
Edith Law (Carnegie Mellon University)
Anton Mityagin (Microsoft Live Labs)
Foster Provost (New York University)
Luis von Ahn (Carnegie Mellon University)

Positive or Negative? The Economist, Prediction Markets, and Mechanical Turk

A few days back, the Economist published a short story about prediction markets, saying that the concept has not taken off as many people were expecting. Chris Masse from MidasOracle got a little obsessed about the piece, publishing many blog posting about the article, and even making calls to everyone to post answers and discussion in their blog. Everyone who just read the blog post would think that the Economist article was pretty much tearing down the concept of prediction markets, their practicality, and their usefulness.

Michael Giberson actually responded saying that the article was pretty much a run-of-the mill article, and there was nothing to respond to and that the article was not even negative. Chris listed a long list of argument why the article was negative.

OK, so is it positive or negative? Well, let the crowd decide. I posted the article on Mechanical Turk, asking 100 Turkers to read the article and rate it in a scale from 0 (most negative) to 10 (most positive) about the sentiment of the article towards prediction markets.

Here are the results:


Well, the average was a 5.8/10, meaning that the average detected sentiment was pretty much neutral with some hints on positivity.

The lesson: never let your own judgment drive your decisions. Measure and analyze the data!

Sorry Chris! Too much drama for nothing!