Thursday, February 5, 2009

Your estimated waiting time: Infinite!

In my last post I showed some statistics about the completion time of the tasks submitted on Mechanical Turk. After taking a more careful look, it seems that there is a serious problem with the MTurk marketplace: the average time for completing a task is actually infinite!

OK, got your attention?

Of course, I do not mean that nothing ever gets done on MTurk. I have completed plenty of tasks, and I am pretty sure other people did as well. However, there is a subtle problem indicating that the long term prospects for Mechanical Turk do not look good.

Let me explain my thinking.

By looking at the distribution of completion times, it was easy to visually inspect the count-count plot, and see that it is a power-law distribution. Having a power-law distribution of completion times (aka a "heavy tail" distribution) is by itself a pretty alarming fact. In contrast to the "well-behaving" systems with exponentially-distributed waiting times, a system with a heavy-tail distribution can frequently generate waiting times that are way above the average waiting time.

This is actually not a big problem by itself. Many computer systems tend to have workloads that have heavy tails of completion times, mainly due to the self-similar burstiness of the requests. Actually by looking at the distribution of the requests, it is often possible to infer the average completion time of a task, even under heavy-tail workloads.

So, here is the question of the day: What is the average completion time for a Mechanical Turk task?

The easy answer is to examine the completed past tasks and report the average time from the sample. Unfortunately, this is an incorrect answer: the sample mean and the distribution mean can be two very different values when dealing with power-law distributions.

Let's see how we can estimate the distribution of the task completion time, and (by extension) its mean and variance. We start by getting the count-count plot, which depicts on the x-axis the duration of a task in hours, and on the y-axis the number of tasks that required exactly that many hours for completion.


The first task is to take this (highly noisy) count-count plot, and convert it into the (complementary) CDF plot. We show below the revised plot, where the x-axis is again the duration of the task, and the y-axis is the probability $Pr(duration \geq x)$, i.e., the probability that a randomly picked task lasted longer than x.


The plot is now much clearer and has a power-lae structure for "small" completion times (i.e. less than 120 hours / 5 days). Then we see a drop as we approach longer task durations. This is actually an artifact of the "censoring" in the collected data sample. We have been observing the marketplace for a month now, so the samples does not contain longer completion times.

Now, given these values, how do we estimate the parameters of the power-law distribution and get the average completion time?

Since we have in our sample discrete completion times, the pdf of the power-law distribution is the Zeta distribution, with:

$Pr\{ duration = x \} = x^{-\alpha}/\zeta(\alpha)$

where $\zeta(\alpha) = \sum_{n=1}^\infty \frac{1}{n^\alpha}$ is the Riemann zeta function and $\alpha$ is the parameter of the distribution.

How can we estimate the parameter $\alpha$? As I have blogged previously, fitting a regression line on the plot is a bad idea! Instead it is much better to use the maximum likelihood estimator:

$\frac{\zeta^\prime(\hat{\alpha})}{\zeta(\hat{\alpha})} = -\frac{1}{n} \sum_{i=1}^{n} \ln(x_i)$

In our case, the $x_i$ values are the observed durations of the tasks.

I keep the task durations in a table Task_Duration(HITid, duration) in a relational database, so computing the value $\frac{1}{n} \sum_{i=1}^n \ln {x_i}$ is a simple query:
SELECT SUM(log(duration)*cnt)/SUM(cnt)
FROM (SELECT duration, COUNT(HITid) AS cnt
FROM Task_Duration GROUP BY duration) AS A

In the case of Mechanical Turk, we have:

$\frac{\zeta^\prime(\hat\alpha)}{\zeta(\hat{\alpha})} = -2.3926$

By looking up the table with the values of $\frac{\zeta^\prime(\hat\alpha)}{\zeta(\hat{\alpha})}$



we find that the most likely value for $\hat\alpha_{MTurk} = 1.35$.

And now the interesting part! A power-law distribution with $\alpha \leq 2$ does not have a well defined mean value! (It is infinite.) In such cases, the sample mean is never a good representation of the distribution mean. Instead, the mean value for the sample is expected to increase over time, without ever converging to a stable value.

Is this the case for Mechanical Turk? To check whether this is (so far) confirmed by the data collected until today, I plotted the average task duration for all tasks that finished in each date. Here is the alarming image:

In a stable system, you would expect the average completion time to increase initially (due to censoring effects, i.e. not observing long tasks) and then stabilize. However, what we can see is that indeed the average completion time is increasing, reinforcing the evidence that the system has a power-law distribution of completion times, with infinite mean. Not an encouraging sign at all!

I am now curious to investigate further the marketplace in terms of its queueing behavior. What is the arrival process? What is the serving process? What causes the bottlenecks? Is it the type of the tasks (boring, long tasks left to die?) or it is due to the behavior of the Turkers (doing a few HITs and then taking long breaks)?

Anyone more familiar with scheduling and queuing theory, please give pointers to the comments!