Monday, February 23, 2009

Why Failing to Fail is a Failure (Prediction Market edition)

Oscars brought a new wave of predictions, reviving a little bit the discussion for prediction markets. A lot of discussion today about who got it right, who got it wrong, and some stupid bragging about "we got them all correctly, we are SOOOO good".

No, Mr. Right, if you got them all right, then you are a failure. If your markets do not fail, they are a failure!

Let's see the set of claims from HubDub, who believe that they nailed the results, in contrast to InTrade that missed the "best actor" award. So here are the results for the 6 major categories from the two exchanges, together with the probabilities for the frontrunners:


Category Hubdub InTrade
Best Picture 98% 90%
Best Director 76% 90%
Best Actor 63% 70.0% (wrong)
Best Actress 87% 85%
Best Sup Actor 100% 95%
Best Sup Actress 64% 58.8%



So the question is: How many contracts should they get correctly to claim good accuracy?

The knee-jerk answer is "all of them". Unfortunately, it is incorrect as well. According to the reported numbers, for HubDub the probability of getting all the answers correctly is 0.98*0.76*0.63*0.87*1.0*0.64=0.26. For InTrade, the corresponding probability is 0.287.

Is there a more likely outcome? Yes, for HubDub, according to their own numbers, they should have picked correctly 5 out of 6 frontrunners, with probability 0.42. For InTrade, they should have picked correctly 5 out of 6 frontrunners with probability 0.43. Here is the complete table:


SuccessesProbability
HubDub
Probability
InTrade

002.00E-05
11.00E-043.00E-04
20.005240.00624
30.05950.05285
40.247130.22217
50.426410.431
60.261620.28742



And here is the expected number of successes and the respective probability for the two exchanges:

So, the most likely outcome for both exchanges was to guess 5 out of 6 correctly! Not to guess 6 out of 6! In this respect InTrade is actually better than HubDub, since they got 5 out of the 6 frontrunners right.

Just in case, you want to run your own experiments, here is the 15-line Java code:
public class PM_probability {

public static int countSuccesses(double[] probabilities) {
int successes =0;
for (double d: probabilities) {
double p = Math.random();
if (p<d) successes++;
}
return successes;
}

public static void main(String[] args) {
double[] prices = {0.90, 0.90, 0.75, 0.85, 0.95, 0.588};
int[] histogram = new int[prices.length+1];
int N=100000;
for (int i=0; i<N; i++) {
int s = countSuccesses(prices);
histogram[s]++;
}
for (int i=0; i<histogram.length; i++) {
System.out.println(i +"\t" +1.0*histogram[i]/N);
}
}
}

Guessing all the frontrunners correctly is something to brag about ONLY if the reported confidences are high enough. If they are not and you get them all correctly, then the markets have biases and are NOT accurate.

If your markets succeed more often than they should, then your markets have failed!

Tuesday, February 10, 2009

Luis von Blog

Through My Biased Coin: Luis von Ahn started blogging. Be forewarned: if you do not have at least 30 minutes available, do not start reading!

I am wondering whether Luis managed to use the time that I spent procrastinating reading his blog to get something productive done. Because, you know, I am a young assistant professor, and I should actually be writing papers and teaching instead of goofing around.

Thursday, February 5, 2009

Your estimated waiting time: Infinite!

In my last post I showed some statistics about the completion time of the tasks submitted on Mechanical Turk. After taking a more careful look, it seems that there is a serious problem with the MTurk marketplace: the average time for completing a task is actually infinite!

OK, got your attention?

Of course, I do not mean that nothing ever gets done on MTurk. I have completed plenty of tasks, and I am pretty sure other people did as well. However, there is a subtle problem indicating that the long term prospects for Mechanical Turk do not look good.

Let me explain my thinking.

By looking at the distribution of completion times, it was easy to visually inspect the count-count plot, and see that it is a power-law distribution. Having a power-law distribution of completion times (aka a "heavy tail" distribution) is by itself a pretty alarming fact. In contrast to the "well-behaving" systems with exponentially-distributed waiting times, a system with a heavy-tail distribution can frequently generate waiting times that are way above the average waiting time.

This is actually not a big problem by itself. Many computer systems tend to have workloads that have heavy tails of completion times, mainly due to the self-similar burstiness of the requests. Actually by looking at the distribution of the requests, it is often possible to infer the average completion time of a task, even under heavy-tail workloads.

So, here is the question of the day: What is the average completion time for a Mechanical Turk task?

The easy answer is to examine the completed past tasks and report the average time from the sample. Unfortunately, this is an incorrect answer: the sample mean and the distribution mean can be two very different values when dealing with power-law distributions.

Let's see how we can estimate the distribution of the task completion time, and (by extension) its mean and variance. We start by getting the count-count plot, which depicts on the x-axis the duration of a task in hours, and on the y-axis the number of tasks that required exactly that many hours for completion.


The first task is to take this (highly noisy) count-count plot, and convert it into the (complementary) CDF plot. We show below the revised plot, where the x-axis is again the duration of the task, and the y-axis is the probability $Pr(duration \geq x)$, i.e., the probability that a randomly picked task lasted longer than x.


The plot is now much clearer and has a power-lae structure for "small" completion times (i.e. less than 120 hours / 5 days). Then we see a drop as we approach longer task durations. This is actually an artifact of the "censoring" in the collected data sample. We have been observing the marketplace for a month now, so the samples does not contain longer completion times.

Now, given these values, how do we estimate the parameters of the power-law distribution and get the average completion time?

Since we have in our sample discrete completion times, the pdf of the power-law distribution is the Zeta distribution, with:

$Pr\{ duration = x \} = x^{-\alpha}/\zeta(\alpha)$

where $\zeta(\alpha) = \sum_{n=1}^\infty \frac{1}{n^\alpha}$ is the Riemann zeta function and $\alpha$ is the parameter of the distribution.

How can we estimate the parameter $\alpha$? As I have blogged previously, fitting a regression line on the plot is a bad idea! Instead it is much better to use the maximum likelihood estimator:

$\frac{\zeta^\prime(\hat{\alpha})}{\zeta(\hat{\alpha})} = -\frac{1}{n} \sum_{i=1}^{n} \ln(x_i)$

In our case, the $x_i$ values are the observed durations of the tasks.

I keep the task durations in a table Task_Duration(HITid, duration) in a relational database, so computing the value $\frac{1}{n} \sum_{i=1}^n \ln {x_i}$ is a simple query:
SELECT SUM(log(duration)*cnt)/SUM(cnt)
FROM (SELECT duration, COUNT(HITid) AS cnt
FROM Task_Duration GROUP BY duration) AS A

In the case of Mechanical Turk, we have:

$\frac{\zeta^\prime(\hat\alpha)}{\zeta(\hat{\alpha})} = -2.3926$

By looking up the table with the values of $\frac{\zeta^\prime(\hat\alpha)}{\zeta(\hat{\alpha})}$



we find that the most likely value for $\hat\alpha_{MTurk} = 1.35$.

And now the interesting part! A power-law distribution with $\alpha \leq 2$ does not have a well defined mean value! (It is infinite.) In such cases, the sample mean is never a good representation of the distribution mean. Instead, the mean value for the sample is expected to increase over time, without ever converging to a stable value.

Is this the case for Mechanical Turk? To check whether this is (so far) confirmed by the data collected until today, I plotted the average task duration for all tasks that finished in each date. Here is the alarming image:

In a stable system, you would expect the average completion time to increase initially (due to censoring effects, i.e. not observing long tasks) and then stabilize. However, what we can see is that indeed the average completion time is increasing, reinforcing the evidence that the system has a power-law distribution of completion times, with infinite mean. Not an encouraging sign at all!

I am now curious to investigate further the marketplace in terms of its queueing behavior. What is the arrival process? What is the serving process? What causes the bottlenecks? Is it the type of the tasks (boring, long tasks left to die?) or it is due to the behavior of the Turkers (doing a few HITs and then taking long breaks)?

Anyone more familiar with scheduling and queuing theory, please give pointers to the comments!

Tuesday, February 3, 2009

Duration of Tasks on Mechanical Turk

OK, time for some more statistics about Mechanical Turk. I was checking how long it takes for different tasks posted on Mechanical Turk to be completed. So, I checked the tasks posted over the last month, and I checked for how long these tasks have been available.

Plotting duration of tasks in hours, against the total number of HITs that took that that many hours to be completed. What was the resulting distribution? The usual suspect! A power law! Here is the plot:

Here is also a dynamic visualization, using the Google Visualuation API (so that it gets updated dynamically, as I gather more data):


(Since most readers do not supports IFRAMEs, those of you reading the post using a reader will need to visit the blog post to see the scatterplot.)

Interestingly enough, I observed the same behavior for individual HITs as well: while many of them are being completed in a short period of time, some of them tend to take a very significant period of time. Not sure what exactly causes the long tails in this process but seems to be something worth investigating.

Sunday, February 1, 2009

Monitoring the Dynamics of Mechanical Turk

Every time that I post a task on Mechanical Turk, I have the same experience: Turkers start working very fast on the posted HITs, and it seems that the task will be completed in a few hours. Then, as time progresses, the rate slows down.

So, I have been wondering why this happens? Did someone else post many HITs, burying my own? Does it make sense to post my HITs at some specific day or time, when Turkers work a lot and there are only a few competing HITs?

To answer this question, I started collecting data from Mechanical Turk, so that I can examine the dynamics of the system. As a first outcome of this effort, I built a small dashboard that shows the number of posted projects, the number of available HITs (a project may have many HITs), and the total amount of rewards that is available on MTurk.



As a side note, it was a nice opportunity for me to actually write some code. I built the dashboard using the Google Visualization API (pretty cool!). Now I am learning about exporting data sets using the Google Data Source API's, which will allow for easy embedding of the generated charts and will allow third parties to get direct access to my data.

Since this is the very first version of the dashboard, I welcome any comments. What else you would like to see? Is there something that you do not like? Let me know in the comments!

Is it unethical to pay for reviews?

My previous post about soliciting reviews using Mechanical Turk generated some reactions by the companies that were listed in the post. It seems that there is a perceived stigma associated with populating a website using paid reviews and companies feel obliged to either clarify that they are soliciting unbiased reviews, even though that was clear from the description of the task posted on Mechanical Turk.

This got me thinking: Why do we consider paid reviews to be inferior compared to their unpaid counterparts?

Yes, most of today's websites that are hosting reviews for anything under the sun would love to have armies of users contributing voluntarily their reviews. There is however a thing called "network effects" that require the website to have some content to attract users, that will in turn write additional reviews. But do you get the reviews to start with? Paying for them seems to be a legitimate approach, in my opinion.

Before the advent of the web, most of the "reviewers" (aka critics) were paid by their publishers to evaluate objectively the products in their area of interest. Nobody complained for the fact that the critics were being paid to write the reviews.

The complaints were coming from biases of the reviewer, or (even worse) when the reviewer was paid to write a non-objective review, effectively deceiving the readers.

So, it is the same game with Mechanical Turk. Nothing wrong with soliciting reviews in exchange for payment. However, it should be clear from the requester that the reviews have to be unbiased and should reflect the experience of the reviewer.

The outrage for Belkin was not due to the solicitation of reviews on Mechanical Turk. It was due to the solicitation of biased, positive reviews for a product that seemed to be a flop, which was equivalent to deceiving Belkin's customers.