Thursday, November 20, 2008

The Faces of Mechanical Turk

Andy Baio is yet another enthusiast of Mechanical Turk:
When you experiment with Amazon's Mechanical Turk, it feels like magic. You toss 500 questions into the ether, and the answers instantly start rolling in from anonymous workers around the world. It was great for getting work done, but who are these people? I've seen the demographics, but that was too abstract for me.

Last week, I started a new Turk experiment to answer two questions: what do these people look like, and how much does it cost for someone to reveal their face?

So Andy paid Turkers 50 cents to upload their photo of theirs with a handwritten note saying why they Turk. So here is how Turkers look like and why they Turk!

Needless to say, this is already hanging outside of my office door. This picture is going into a slide in my Mechanical Turk talk, so that I can give a good answer to the question "Who are these people and why do they do it?"

(via Brendan O'Connor)

Thursday, November 13, 2008

Social Annotation of the NYT Corpus?

While I am waiting for the arrival of the New York Times Annotated Corpus, I have been thinking about the different tasks that we could use the corpus for. For some tasks, we might have to run additional extraction systems, to identify entities that are not currently marked. So, for example, we could use the OpenCalais system to extract patent issuances, company legal issues, and so on.

And then, I realized that most probably, tens of other groups will end up doing the same, over and over again. So, why not run such tasks once, and store them for others to use? In other words, we could have a "wiki-style" contribution site, where different people could submit their annotations, letting other people use them. This would save a significant amount of computational and human resources. (Freebase is a good example of such an effort.)

Extending the idea even more, we could have reputational metrics around these annotations, where other people provide feedback on the accuracy, comprehensiveness, and general quality of the submitted annotations.

Is there any practical problem with the implementation of this idea? I understand that someone needs access to the corpus to start with, but I am trying to think of more high-level obstacles (e.g., copyright, or conflict with the interests of publishers)?

Wednesday, November 5, 2008

Use of Excel-generated HTML Considered Harmful

This was one of the most strange bugs that I had to resolve.

While I was writing the blog post about computing electoral correlations across states using prediction markets, I wanted to include a table with some results, to illustrate how different states are correlated.

So, I prepared the table in Excel, and then copied and pasted it on Blogger.

Then a strange thing happened: My Feedburner feed stopped working. Nobody received any updates, and suddenly the number of subscribers fell to zero.

Trying to figure out that was wrong, I got a message that my feed was bigger than 512Kb. Admittedly, my table was kind of big, with more than 300 entries. So, I decided to trim it down, to 30-50 rows.

After that fix my feed started working again.

I was still puzzled though why the problem did not appear earlier, given that I have written some pretty long posts (e.g., Why People Participate on Mechanical Turk?) and I never exceeded the 512Kb limit.

Well, the problem was not over. Even though my feed was working, the post about computing electoral correlations across states using prediction markets did not appear in Google Reader, and in other readers. However, the reader on my cell phone was displaying the post. Very very strange.

I followed all the troubleshooting steps on Feedburner, nothing.

So, I decided to take a closer look at the HTML source. I was in for a surprise! The table that I copied and pasted from Excel, had a seriously fat, ugly, and problematic HTML code.

As an example, instead of having a table cell written as "<td>NTH.DAKOTA</td>" it had the following code instead:
<td class="xl63" style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color rgb(149, 179, 215); border-width: medium medium 0.5pt; background: rgb(219, 229, 241) none repeat scroll 0% 0%; font-size: 10pt; color: black; font-weight: 400; text-decoration: none; font-family: Calibri; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;">NTH.DAKOTA</td>
This not only resulted in a seriously padded HTML, it was also generating validation problems, causing Google Reader to reject the post and not display it at all.

Solution? Nuking by regular expression. I replaced all the "<td [^>]+>" instances with "<td>", and I had a seriously trimmed table from 116Kb (!) to 7Kb. After that, Google picked the post within seconds....

Lesson? Never, ever use an Excel-generated table in Blogger. Or if you need to do that, make sure to remove all the fat...

Monday, November 3, 2008

Academic Trees

The summer of 2004, after completing my thesis, I found myself with plenty of time on my hands. So, I decided that it would be fun to research my academic genealogy. I knew the advisor of my advisor, Hector Garcia-Molina, and it was rather easy to find his advisor, Gio Widerhold. Gio had also listed John Amsden Starkweather as his own advisor.

Going beyond that was proven kind of difficult. I had to order the thesis of John Starkweather and see the dedication there: His advisor was Carl Porter Duncan. In a similar pattern, and spending considerable time at the library, I managed to dig my genealogy back to 1800's and to Hermann von Helmhotz. After that, I hit the entry at Chemical Genealogy and relied on the tree there.

After that, I posted my academic ancestors tree on my home page, letting friends and (academic) family know about it.

Today, through chain of events, I happened to run into that also contains my genealogy and goes back to 1000AD. By expanding the tree as much as possible, I managed to get a pretty impressive printouts, taking four 11x17 pages :-)

Until now, my tree was going back "only" to 1500's and to Pierre Richer de Belleval, who was teaching in Avignon, France. Now, I can proudly say that my tree goes back 1000AD, and its oldest roots are Greek Byzantines, including names such as Ioannis Mauropous, Michail Psellos, and Grigorios Palamas.

Accuracy of the information? I have no idea. But I have something to talk about when I go back to Greece for the winter break.

Sunday, November 2, 2008

Computing State Correlations in Elections Using (Only) Prediction Markets

(The post below is largely based on the work of Nikolay. I am responsible for all the mistakes.)

One thing that always puzzled me was how to compute correlations across electoral results in the state level. In 2006, there was some discussion about the accuracy of the prediction markets in predicting the outcome of the senate races. From the Computational Complexity blog:
For example, the markets gave a probability of winning 60% for each of Virginia and Missouri and the democrats needed both to take the senate. If these races were independent events, the probability that the democrats take both is 36% or a 64% chance of GOP senate control assuming no other surprises.
However, everyone will agree that races across states are not independent events. For example, if Obama wins Georgia (currently trading at 0.36), the probability of winning Ohio will be higher than 0.70, the current price of the Ohio.DEM contract. (As we will see later in the post, the price of Ohio, given that Democrats win Georgia, is close to 0.81.)

So, what we would like to estimate is, for two states A and B, what is the probability $Pr\{A|B\}$ that a candidate wins state $A$, given that the candidate won state $B$.

One way to model these dependencies is to run conditional prediction markets but this leads to an explosion of possible contracts. Participation and liquidity is not great even in the current markets for state-level elections, there is little hope that combinatorial markets will attract significant interest.

Another way to compute these probabilities is to use and expand the model described in my previous post about modeling volatility in prediction markets. Let's see how to do that.

Expressing Conditional Contracts using Ability Processes

Following this model, for each state we have a "ability difference processes" $S(t)$ that tracks the difference in abilities of the two candidates. If at expiration time $T$, $S(T)$ is positive, the candidate wins the state; otherwise the candidate loses. So, we can write:

$$Pr\{A|B\} = Pr\{ S_A(T) \geq 0 | S_B(T) \geq 0 F(t) \}$$

where $F(t)$ is the information available at time $t$. Using Bayes rule:

$Pr\{A|B\} = \frac{Pr\{ S_A(T)\geq 0, S_B(T)\geq 0 | F(t) \}}{Pr\{ S_B(T)\geq 0 | F(t) \}}$

In the equation above, $Pr\{ S_B(T)\geq 0 | F(t) \}$ is simply the price of the contract for state $B$ at time $t$, i.e., $\pi_B(t)$.

Pricing Joint Contracts using Ito Diffusions

The challenging term is the price of the joint contract $Pr\{ S_A(T)\geq 0, S_B(T)\geq 0 | F(t) \}$.

To price this contract, we generalize the Brownian motion model, and we assume that the joint movement of $S_A(t)$ and $S_B(t)$ is a 2d Brownian motion. Of course, we do not want the 2d motion to be independent! Since the $S_A(t)$ and $S_B(t)$ represent the abilities, they are correlated! So, we assume that the two Brownian motions have some (unknown) correlation $\rho$. Intuitively, if they are perfectly correlated, when $S_A(t)$ goes up, then $S_B(t)$ goes up by the same amount. If they are not correlated, the movement of $S_A$ does not give any information about the movement of $S_B$, and in this case $Pr\{A|B\} = Pr\{A\}$

Without going into much details, in this model the price of the joint contract is:

$\pi_{AB}(t) = Pr\{ S_A(T)>0, S_B(T)>0 | F(t) \} = N_\rho ( N^{-1}(\pi_A(t)), N^{-1}(\pi_B(t)) )$

where $N_\rho$ is the CDF of the standard bivariate normal distribution with correlation $\rho$ and $N^{-1}$ is the inverse CDF of the standard normal. Intuitively, this is nothing more than the generalization of the result that we presented in the previous post.

However, the big question remains: How do we compute the value $\rho$?

Now the neat part: We can infer $\rho$ by observing the past time series of the two state-level contracts. Why is that?

First of all, we know that the price changes of the contracts are given by:

$d\pi(t) = V(\pi(t), t) dW$, which gives $dW = \frac{d\pi(t)}{ V( \pi(t), t)} $,

We can observe $d\pi(t)$ over time. We also know that $V( \pi(t), t) = \frac{1}{\sqrt{T-t}} \cdot \varphi( N^{-1}( \pi(t) ) )$ is the instantaneous volatility of the a contract trading at price $\pi(t)$ at time $t$.

So essentially we take the price differences over time and we normalize them by the expected volatility. This process generates the "normalized" changes in abilities, over time and across states.

Therefore, we can now use standard correlation measures of time series to infer the hidden correlation of the ability processes. (And then compute the conditional probability.) If the two ability processes were powered by independent Brownian motions $W_A$ and $W_B$, then $dW_A$ and $dW_B$ would not exhibit any correlation. If the two processes are correlated, then we can measure their cross-correlation by observing their past behavior.

Now, by definition of cross-correlation we get:

$\rho \approx \Sigma_{i=o}^t \frac{ (\pi_A(i+1) - \pi_A(i)) \cdot (\pi_B(i+1) - \pi_B(i)) }{ V(\pi_A(i), i) \cdot V(\pi_B(i), i) }$

Conditional Probabilities using InTrade

OK, if you stayed with me so long, here are some of the strong correlations as observed and computed based on the InTrade data. How to read the table? If Democrats win state B, what is the probability Pr(A|B) that they will also win state A? To make comparisons easy, we also list the current price of the contracts A and B. The "lift" shows how much the conditional probability increases compared to the base probability. I skipped the cases when a state has very high probability, i.e., above 0.9 (as they are either uninformative) or very low probability, i.e., less than 0.2 (as they are highly unlikely to happen). I also list only state pairs with lift larger than 1.10. You can also get the list as an Excel spreadsheet.