Thursday, November 20, 2008

The Faces of Mechanical Turk

Andy Baio is yet another enthusiast of Mechanical Turk:
When you experiment with Amazon's Mechanical Turk, it feels like magic. You toss 500 questions into the ether, and the answers instantly start rolling in from anonymous workers around the world. It was great for getting work done, but who are these people? I've seen the demographics, but that was too abstract for me.

Last week, I started a new Turk experiment to answer two questions: what do these people look like, and how much does it cost for someone to reveal their face?

So Andy paid Turkers 50 cents to upload their photo of theirs with a handwritten note saying why they Turk. So here is how Turkers look like and why they Turk!

Needless to say, this is already hanging outside of my office door. This picture is going into a slide in my Mechanical Turk talk, so that I can give a good answer to the question "Who are these people and why do they do it?"

(via Brendan O'Connor)

Thursday, November 13, 2008

Social Annotation of the NYT Corpus?

While I am waiting for the arrival of the New York Times Annotated Corpus, I have been thinking about the different tasks that we could use the corpus for. For some tasks, we might have to run additional extraction systems, to identify entities that are not currently marked. So, for example, we could use the OpenCalais system to extract patent issuances, company legal issues, and so on.

And then, I realized that most probably, tens of other groups will end up doing the same, over and over again. So, why not run such tasks once, and store them for others to use? In other words, we could have a "wiki-style" contribution site, where different people could submit their annotations, letting other people use them. This would save a significant amount of computational and human resources. (Freebase is a good example of such an effort.)

Extending the idea even more, we could have reputational metrics around these annotations, where other people provide feedback on the accuracy, comprehensiveness, and general quality of the submitted annotations.

Is there any practical problem with the implementation of this idea? I understand that someone needs access to the corpus to start with, but I am trying to think of more high-level obstacles (e.g., copyright, or conflict with the interests of publishers)?

Wednesday, November 5, 2008

Use of Excel-generated HTML Considered Harmful

This was one of the most strange bugs that I had to resolve.

While I was writing the blog post about computing electoral correlations across states using prediction markets, I wanted to include a table with some results, to illustrate how different states are correlated.

So, I prepared the table in Excel, and then copied and pasted it on Blogger.

Then a strange thing happened: My Feedburner feed stopped working. Nobody received any updates, and suddenly the number of subscribers fell to zero.

Trying to figure out that was wrong, I got a message that my feed was bigger than 512Kb. Admittedly, my table was kind of big, with more than 300 entries. So, I decided to trim it down, to 30-50 rows.

After that fix my feed started working again.

I was still puzzled though why the problem did not appear earlier, given that I have written some pretty long posts (e.g., Why People Participate on Mechanical Turk?) and I never exceeded the 512Kb limit.

Well, the problem was not over. Even though my feed was working, the post about computing electoral correlations across states using prediction markets did not appear in Google Reader, and in other readers. However, the reader on my cell phone was displaying the post. Very very strange.

I followed all the troubleshooting steps on Feedburner, nothing.

So, I decided to take a closer look at the HTML source. I was in for a surprise! The table that I copied and pasted from Excel, had a seriously fat, ugly, and problematic HTML code.

As an example, instead of having a table cell written as "<td>NTH.DAKOTA</td>" it had the following code instead:
<td class="xl63" style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color rgb(149, 179, 215); border-width: medium medium 0.5pt; background: rgb(219, 229, 241) none repeat scroll 0% 0%; font-size: 10pt; color: black; font-weight: 400; text-decoration: none; font-family: Calibri; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;">NTH.DAKOTA</td>
This not only resulted in a seriously padded HTML, it was also generating validation problems, causing Google Reader to reject the post and not display it at all.

Solution? Nuking by regular expression. I replaced all the "<td [^>]+>" instances with "<td>", and I had a seriously trimmed table from 116Kb (!) to 7Kb. After that, Google picked the post within seconds....

Lesson? Never, ever use an Excel-generated table in Blogger. Or if you need to do that, make sure to remove all the fat...

Monday, November 3, 2008

Academic Trees

The summer of 2004, after completing my thesis, I found myself with plenty of time on my hands. So, I decided that it would be fun to research my academic genealogy. I knew the advisor of my advisor, Hector Garcia-Molina, and it was rather easy to find his advisor, Gio Widerhold. Gio had also listed John Amsden Starkweather as his own advisor.

Going beyond that was proven kind of difficult. I had to order the thesis of John Starkweather and see the dedication there: His advisor was Carl Porter Duncan. In a similar pattern, and spending considerable time at the library, I managed to dig my genealogy back to 1800's and to Hermann von Helmhotz. After that, I hit the entry at Chemical Genealogy and relied on the tree there.

After that, I posted my academic ancestors tree on my home page, letting friends and (academic) family know about it.

Today, through chain of events, I happened to run into that also contains my genealogy and goes back to 1000AD. By expanding the tree as much as possible, I managed to get a pretty impressive printouts, taking four 11x17 pages :-)

Until now, my tree was going back "only" to 1500's and to Pierre Richer de Belleval, who was teaching in Avignon, France. Now, I can proudly say that my tree goes back 1000AD, and its oldest roots are Greek Byzantines, including names such as Ioannis Mauropous, Michail Psellos, and Grigorios Palamas.

Accuracy of the information? I have no idea. But I have something to talk about when I go back to Greece for the winter break.

Sunday, November 2, 2008

Computing State Correlations in Elections Using (Only) Prediction Markets

(The post below is largely based on the work of Nikolay. I am responsible for all the mistakes.)

One thing that always puzzled me was how to compute correlations across electoral results in the state level. In 2006, there was some discussion about the accuracy of the prediction markets in predicting the outcome of the senate races. From the Computational Complexity blog:
For example, the markets gave a probability of winning 60% for each of Virginia and Missouri and the democrats needed both to take the senate. If these races were independent events, the probability that the democrats take both is 36% or a 64% chance of GOP senate control assuming no other surprises.
However, everyone will agree that races across states are not independent events. For example, if Obama wins Georgia (currently trading at 0.36), the probability of winning Ohio will be higher than 0.70, the current price of the Ohio.DEM contract. (As we will see later in the post, the price of Ohio, given that Democrats win Georgia, is close to 0.81.)

So, what we would like to estimate is, for two states A and B, what is the probability $Pr\{A|B\}$ that a candidate wins state $A$, given that the candidate won state $B$.

One way to model these dependencies is to run conditional prediction markets but this leads to an explosion of possible contracts. Participation and liquidity is not great even in the current markets for state-level elections, there is little hope that combinatorial markets will attract significant interest.

Another way to compute these probabilities is to use and expand the model described in my previous post about modeling volatility in prediction markets. Let's see how to do that.

Expressing Conditional Contracts using Ability Processes

Following this model, for each state we have a "ability difference processes" $S(t)$ that tracks the difference in abilities of the two candidates. If at expiration time $T$, $S(T)$ is positive, the candidate wins the state; otherwise the candidate loses. So, we can write:

$$Pr\{A|B\} = Pr\{ S_A(T) \geq 0 | S_B(T) \geq 0 F(t) \}$$

where $F(t)$ is the information available at time $t$. Using Bayes rule:

$Pr\{A|B\} = \frac{Pr\{ S_A(T)\geq 0, S_B(T)\geq 0 | F(t) \}}{Pr\{ S_B(T)\geq 0 | F(t) \}}$

In the equation above, $Pr\{ S_B(T)\geq 0 | F(t) \}$ is simply the price of the contract for state $B$ at time $t$, i.e., $\pi_B(t)$.

Pricing Joint Contracts using Ito Diffusions

The challenging term is the price of the joint contract $Pr\{ S_A(T)\geq 0, S_B(T)\geq 0 | F(t) \}$.

To price this contract, we generalize the Brownian motion model, and we assume that the joint movement of $S_A(t)$ and $S_B(t)$ is a 2d Brownian motion. Of course, we do not want the 2d motion to be independent! Since the $S_A(t)$ and $S_B(t)$ represent the abilities, they are correlated! So, we assume that the two Brownian motions have some (unknown) correlation $\rho$. Intuitively, if they are perfectly correlated, when $S_A(t)$ goes up, then $S_B(t)$ goes up by the same amount. If they are not correlated, the movement of $S_A$ does not give any information about the movement of $S_B$, and in this case $Pr\{A|B\} = Pr\{A\}$

Without going into much details, in this model the price of the joint contract is:

$\pi_{AB}(t) = Pr\{ S_A(T)>0, S_B(T)>0 | F(t) \} = N_\rho ( N^{-1}(\pi_A(t)), N^{-1}(\pi_B(t)) )$

where $N_\rho$ is the CDF of the standard bivariate normal distribution with correlation $\rho$ and $N^{-1}$ is the inverse CDF of the standard normal. Intuitively, this is nothing more than the generalization of the result that we presented in the previous post.

However, the big question remains: How do we compute the value $\rho$?

Now the neat part: We can infer $\rho$ by observing the past time series of the two state-level contracts. Why is that?

First of all, we know that the price changes of the contracts are given by:

$d\pi(t) = V(\pi(t), t) dW$, which gives $dW = \frac{d\pi(t)}{ V( \pi(t), t)} $,

We can observe $d\pi(t)$ over time. We also know that $V( \pi(t), t) = \frac{1}{\sqrt{T-t}} \cdot \varphi( N^{-1}( \pi(t) ) )$ is the instantaneous volatility of the a contract trading at price $\pi(t)$ at time $t$.

So essentially we take the price differences over time and we normalize them by the expected volatility. This process generates the "normalized" changes in abilities, over time and across states.

Therefore, we can now use standard correlation measures of time series to infer the hidden correlation of the ability processes. (And then compute the conditional probability.) If the two ability processes were powered by independent Brownian motions $W_A$ and $W_B$, then $dW_A$ and $dW_B$ would not exhibit any correlation. If the two processes are correlated, then we can measure their cross-correlation by observing their past behavior.

Now, by definition of cross-correlation we get:

$\rho \approx \Sigma_{i=o}^t \frac{ (\pi_A(i+1) - \pi_A(i)) \cdot (\pi_B(i+1) - \pi_B(i)) }{ V(\pi_A(i), i) \cdot V(\pi_B(i), i) }$

Conditional Probabilities using InTrade

OK, if you stayed with me so long, here are some of the strong correlations as observed and computed based on the InTrade data. How to read the table? If Democrats win state B, what is the probability Pr(A|B) that they will also win state A? To make comparisons easy, we also list the current price of the contracts A and B. The "lift" shows how much the conditional probability increases compared to the base probability. I skipped the cases when a state has very high probability, i.e., above 0.9 (as they are either uninformative) or very low probability, i.e., less than 0.2 (as they are highly unlikely to happen). I also list only state pairs with lift larger than 1.10. You can also get the list as an Excel spreadsheet.


Friday, October 31, 2008

The New York Times Annotated Corpus

Last week, I was invited to give a talk at a conference at the New York Public Library, about the preservation of news. I talked about our research in the Economining project, where we are trying to find the "economic value" of textual content on the Internet.

As part of the presentation, I discussed some problems that I had in the past with obtaining well-organized news corpora that are both comprehensive and also easily accessible using standard tools. Factiva has an excellent database of articles, exported in a richly annotated XML format but unfortunately Factiva prohibits data mining of the content of its archives.

The librarians in the conference were very helpful in offerring suggestions and acknowledging that providing content for data mining purposes should be one of the goals of any preservation effort.

So, yesterday I received an email from Dorothy Carner informing me about the availability of The New York Times Corpus, a corpus of 1.8 million articles from The New York Times, dating from 1987 until 2007. The details are available from but let me repeat some of the interesting facts here (the emphasis below is mine):

The New York Times Annotated Corpus is a collection of over 1.8 million articles annotated with rich metadata published by The New York Times between January 1, 1987 and July 19, 2007.

With over 650,000 individually written summaries and 1.5 million manually tagged articles, The New York Times Annotated Corpus has the potential to be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction.

The corpus is provided as a collection of XML documents in the News Industry Text Format (NITF). Developed by a consortium of the world’s major news agencies, NITF is an internationally recognized standard for representing the content and structure of news documents. To learn more about NITF please visit the NITF website.

Highlights of The New York Times Annotated Corpus include:

  • Over 1.8 million articles written and published between January 1, 1987 and June 19, 2007.
  • Over 650,000 article summaries written by the staff of The New York Times Index Department.
  • Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
  • Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at
  • Java tools for parsing corpus documents from xml into a memory resident object.

To learn more about The New York Times Annotated Corpus please read the PDF Overview.

Yes, 1.8 million articles, in richly annotated XML, with summaries, with hierarchically categorized articles, and with verified annotations of people, locations, and organizations! Expect the corpus to be a de facto standard for many text-centric research efforts! Hopefully more organizations are going to follow the example of New York Times and we are going to see such publicly available corpora from other high-quality sources. (I know that Associated Press has an archive of almost 1Tb of text, in computerized form, and hopefully we will see something similar from them as well.)

How can you get the corpus? It is available from LDC, for 300 USD for non-members; members should get this for free.

I am looking forward to receiving the corpus and start playing!

Monday, October 20, 2008

Modeling Volatility in Prediction Markets, Part II

In the previous post, I described how we can estimate the volatility of prediction markets using additional prediction market contracts, aka options on prediction markets. I finished indicating that techniques that can be used to price options for stocks, are not directly applicable in the prediction market context.

Now, I will review a different modeling approach that builds on the spirit of Black-Scholes but is properly adapted for the prediction market context. This model has been developed by Nikolay, and is described in the paper "Modeling Volatility in Prediction Markets".

Modeling Prediction Markets as Competitions

Let's consider the simple case of a contract with a binary outcome. For example, who will win the presidential election? McCain or Obama?

The basic modeling idea is to assume that each competing party has an ability $S_i(t)$ that evolves over time , moving as a Brownian motion. (A simplified example of such ability would be the number of voters for a party, the number of points in a sports game, and so on.) At the expiration of the contract at time $T$ , the party $i$ with the higher ability $S_i(T)$ wins.

Actually, to have a more general case, we can use a generalized form of the Brownian motion, an Ito diffusion, that allows for the abilities to have a drift $\mu_i$ over time (i.e., the average rate of growth), and different volatilities $\sigma_i$ . The quantity that we need to monitor is the difference of the two ability processes $S(t)=S_1(t)-S_2(t)$ . If at the expiration of the contract at time $T$ we have $S(T)>0$ , then party 1 wins. If $S(T)$ is less than 0, then party 2 wins. Interestingly, the difference $S(t)$ is also an Ito diffusion, with $\mu=\mu_1-\mu_2$ , $\sigma=\sqrt{\sigma_1^2+\sigma_2^2-2\rho \sigma_1 \sigma_2}$ , where $\rho$ is the correlation of the two ability processes. Under this scenario, the price of the contract $\pi(t)$ at time $t$ is:

$\pi(t) = Pr\{ S(T)>0 | S(t) \}$

which can be written as:

$\pi(t) = N\Big(\frac{S(t) + \mu \cdot (T-t)}{\sigma \cdot \sqrt{T-t} } \Big)$

where $N(x) =\frac{1}{2} \Big[ 1 + erf\Big( \frac{x}{\sqrt{2}} \Big) \Big]$ is the CDF of the normal distribution with mean 0, and standard deviation 1 and $erf(x)$ is the error function. Notice that as time $t$ gets closer to the expiration, the denominator gets close to 0, which makes the ratio closer to $\infty$ or $-\infty$, and price $\pi(t)$ gets close to 0 or 1. However, if $S(t)$ is close to 0 (i.e., the two parties are almost equivalent), then we observe increasingly higher instability as we get close to expiration, as small changes in the difference $S(t)$ can have a significant effect in the outcome.

For example, consider two parties: party 1 with an ability that has positive drift $\mu_1=0.2$ and volatility $\sigma_1=0.3$, and party 2 with negative drift $\mu_2=-0.2$ and higher volatility $\sigma_2=0.6$. In this case, assuming no correlation, the difference is a diffusion with drift $\mu=0.4$ and volatility $\sigma=0.67$. Here is one scenario of the evolution, and below you can see the price of the contract, as time evolves.

As you may observe from the example, the red line (party 1) is for the most time above the blue line (party 2), which causes the green line (the difference) to be above 0. As the contract gets close to expiration, the contract gets closer and closer to 1 (i.e., party 1 will win). Close to the end, the blue line catches up, which causes the prediction market contract to have a big swing from almost 1 to 0.5, but then swings back up as party 1 finally finishes at the expiration above party 2.

So far, we generated a nice simulation but our results depend on knowing the parameters of the underlying "ability processes". Since we never get to observe these values, what is the use of all this exercise?
Well, the interesting thing is that by using the price function, we can now proceed to derive its volatility. Without going into the details, we can prove that the volatility of the prediction market contract is:

$V(t) = \rac{1}{\sqrt{T-t}} \cdot \varphi( N^{-1}( \pi(t) ) )$

where $N^{-1}(x)$ is the inverse CDF of the standard normal distribution and $\varphi(x)=\frac{exp( (-x^2)/2)}{\sqrt{2\pi}}$ is the density of the standard normal distribution.

In other words, volatility depends only on the current price of the contract and time to expiration! Anything else is irrelevant! Drifts do not matter: they are priced already in the current price of the contract, since we know where the drift will lead at expiration. The magnitude of the volatilities are also priced into the current contract price: higher volatilities cause the contract price to get closer to 0.5, as it is easier for $S(t)$ to move above and below 0 when it has high volatility. Furthermore, the direction of the volatilities of the underlying abilities is indifferent as they can move the difference into either direction with equal probability. (The only assumption is that the volatilities of the underlying abilities processes do not change over time.)

Volatility Surface

So, what this model implies for the volatility of the prediction markets? First of all, the model says that volatility increases as we move closer to the expiration, as long as the price of the contract is not 0 or 1. For example, assuming that now we have $t=0$ and expiration is at $T=1$, the volatility is expected to increase as follows:

So, how volatility changes with different contract prices? As you can see, volatility is highest when the contract trades at around 0.5, and gets close to 0 when price is 0 or 1.

And just to combine the two plots and present a nice 3d plot, with the present being at $t=0$ and expiration at $T=1$:

The experimental section in the paper "Modeling Volatility in Prediction Markets" (shorter conference paper presented at ACM EC'09), indicates that the actual volatility observed in the InTrade prediction markets fits well the current model.

Now, given this model, we can judge what is a "noise movement" and what is actually a "significant move" in prediction markets. Furthermore, we can provide an "error margin" for each day, indicating the confidence bounds for the market price.

I will post more applications of this model in the next few days. We will see how to price the X contracts on InTrade, and a way to compute correlations of the outcomes of state elections, given simply the past movements of their corresponding prediction markets.

Modeling Volatility in Prediction Markets, Part I

A few weeks back, I was thinking about the concept of uncertainty in prediction markets. The price of a contract in a prediction market today gives us the probability that an event will happen. For example, the contract 2008.PRES.OBAMA is trading at 84.0, indicating that there is an 84% chance that Obama will win the presidential election.

Unfortunately, we have no idea about the stability and robustness of this estimate. How likely it is that the contract will fall tomorrow to 80%? How likely it is to jump to 90%? By treating the contract price as a "deterministic" number, we do not capture such information. We need to treat the price as a random variable with its own probability distribution, out of which we observe just the mean by looking at the prediction market.

However, to fully understand the stability of the price we need further information, beyond just the mean of the probability, revealed by the current contract price.

A first step is to look at the volatility of the price. One approach is to look at the past trading behavior, but this analysis will give us the past volatility, not the expected future volatility of the contract.

Predicting Future Volatility using Options

So, how can we estimate the future volatility of a prediction market contract?

There is a market approach to solve this problem. Namely, we can run prediction markets on the results of the prediction markets!

Recently, Intrade has introduced such contracts, the so-called X contracts (listed under "Politics->Options: US Election" from the sidebar). For example, the contract "X.22OCT.OBAMA.>80.0" pays 1 USD if the contract "2008.PRES.OBAMA" will be higher than 80.0 on Wed 22 Oct 2008. Traditionally, the threshold defined in the options contract is called strike price (e.g., the strike price for X.22OCT.OBAMA.>80.0 is 80.0).

A set of such contracts can reveal the distribution of the probability of the event for the underlying contract 2008.PRES.OBAMA. In other words, we can see not only what is the mean probability that Obama will be elected president but we can also see the expected downside risk or upside potential of the 2008.PRES.OBAMA contract. For example, the X.22OCT.OBAMA.>80.0 has a price of 90.0, indicating a 90% chance that the 2008.PRES.OBAMA contract will be above 80.0 on Oct 22nd.

Now, given enough contracts, with strike prices at various levels, we can estimate the probability distribution for the likely prices of the contract. For example, we can have contracts with strike price 10, 20, ..., 90 that will give us the probability that the contract will trade above 10, 20, ... and 90 points at some specific point in time, which corresponds to the expiration date of the options contract. So for each date, we need 9 contracts, if we need to have a 10 column histogram that describes the distribution.

Note that if we want to estimate the probability distribution dynamics we will need to setup 9 contracts for each date that we want to measure. Of course, this implies that we have plenty of liquidity in the markets if we want to rely purely on the market for such estimates.

Pricing Options and the Black-Scholes Formula

A natural question is: Can we price such "options on options" contracts?

This will at least give us some guidance on the likely prices of such contracts, if not for anything else, but to just start the market at the appropriate level. (For example, if we have a market scoring mechanism.)

There is significant research in Finance on pricing options for stocks. The Black-Scholes formula is one of the most well-known examples for deriving prices for options on stocks. The basic idea behind Black-Scholes is that the underlying stock price follows a Brownian motion, moving randomly up and down. Then by extracting the probability that this random stock move will reach various levels, it is possible to derive the option prices. (Terrence Tao has a very easy to read 3-page note explaining the Black-Scholes formula and a longer blog posting.)

Why not applying directly this model to price options on prediction markets? There are a few fundamental problems but the most important one is the bounded price of the underlying prediction market contract. The price of a prediction market contract cannot go below 0 or above 1, so the Brownian motion assumption is invalid. In fact, if we try to apply the Black-Scholes model on a prediction market, we get absurd results.

In the next post, I will review an adaptation of the Black-Scholes model that works well for prediction markets, and leads to some very interesting results!

Sunday, October 12, 2008

Student websites

I am just posting this to provide links to the pages of my students, so that Google indexes their websites.

Saturday, October 4, 2008

Reviewing the Reviewers

I received today the latest issue of TOIS, and the title of the editorial by Gary Marchionini caught my eye: "Reviewer Merits and Review Control, in an Age of Electronic Manuscript Management Systems". The article makes the case for using the electronic management systems to allow for grading of the reviewer efforts and allow for memory of the reviewing process, including both the reviews and the reviewer ratings.

In principle, I agree with the idea. Having the complete reviewing history for each reviewer, and for each journal and conference, can bring several improvements in the process:

1. Estimating and Fixing Biases

One way to see the publication process is as noisy labeling of an example, where the true labels are "accept" or "reject". The reviewers can be modeled as noisy processes, each with its own sensitivity and specificity. The perfect reviewer has sensitivity=1, i.e., marks as "accept" all the "true accepts", and has specificity=1, i.e., marks as "reject" all the "true rejects".

Given enough noisy ratings, it is possible to use statistical techniques to infer what is the "true label" for each paper, and infer at the same time the sensitivity and specificity of each reviewer. Bob Carpenter has presented a hierarchical Bayesian model that can be used for this purpose, but simpler maximum likelihood models, like the one of Dawid and Skene, also work very well. In my own (synthetic) experiments the MLE method worked almost perfectly for recovering the quality characteristics of the reviewers and to recover the true labels of the papers (of course, without the uncertainty estimates that the Bayesian methods provide.)

One issue with such a model? The assumption that we have an underlying "true" label. For people with different backgrounds and research interests, what is a "true accept" and what a "true reject" is not easy to define even with perfect reviewing.

2. Reviewer Ratings

Reviewer reviewing by the editors

The statistical approaches described above reduce the quality of a reviewer into two metrics. However, these ratings only show agreement of the recommendations with the "true" value (publish or not). They say nothing about other aspects of the review: comprehensiveness, depth, timeliness, helpfulness, are all important aspects that need to be captured using different methods.

Marchionini mentions that current manuscript management systems allow the editors to rate reviewers in terms of timeliness and in terms of quality. By following the references, I ran into the article Reviewer Merits, published in Information Processing and Management, where the Editors-in-Chief of many IR journals stated:
Electronic manuscript systems easily provide time data for reviewers and some offer rating scales and note fields for editors to evaluate review quality. Many of us (editors) are beginning to use these capabilities and, over time, we will be able to have systematic and persistent reviewer quality data. Graduate students, faculty, chairs, and deans should be aware that these data are held.
Now, while I agree with reviewer accountability, I think that this statement is not worded properly. I find the use of the phrase "should be aware" as semi-threatening. ("We, the editors, are rating you... remember that!")

If reviewer quality history is being kept, then the reviewers should be aware and have access to it. Being reminded that "your history is out there somewhere" is not the way to go. If reviewer quality is going to be a credible evaluation metric, the reviewers need to know how well they did. (Especially junior reviewers, and especially when the review does not meet the quality standards.)

Furthermore, if the editors are the ones rating the reviewers, then who controls the quality of these ratings? How do we know that the evaluation is fair and accurate? Notice that if we have a single editorial quality rating per review, then the statistical approaches described above do not work.

Reviewer reviewing by the authors

In the past, I have argued that authors should rate reviewers. My main point in that post was to propose a system that will encourage reviewers to participate by rewarding the highly performing reviewers. (There is a similar letter to Science, named "Rewarding Reviewers.") Since authors will have to provide multiple feedback points, it is much easier to correct the biases in the reviewer ratings of the authors.

3. Reviewer History and Motivation

If we have a history of reviewers, we should not forget potential side-effects. One clear issue that I see, is motivation. If "reviews of reviewers" become a public record, then it is not clear how easy it will be to recruit reviewers.

Right now, many accept invitations to review, knowing that they will be able to do a decent job. If the expectations increase, it will be natural for people to reject invitations, focusing only on a few reviews for which they can do a great job. Arguably, reviewer record is never going to be as important for evaluation as other metrics, as research productivity or teaching, so it is unlikely to get more time devoted to it.

So, there will always be the tradeoff: more reviews or better reviews?

One solution that I have proposed in the past: Impose a budget! Any researcher should remove from the reviewing system the workload it generates. Five papers submitted (not accepted) within a year? The researcher needs to review 3x5 = 15 papers to remove the workload that these five papers generated. (See also the article "In Search of Peer Reviewers" that has the same ideas.)

4. Training Reviewers

So, suppose that we have the system in place to keep reviewer history, we have solved the issue of motivation, and now one facet of researcher reputation is the reviewer quality score. How do we learn how to review properly? A system that generates a sensitivity and specificity of a reviewer can provide some information on how strict or lenient a reviewer is, compared to others.

However, we need something more than that. What makes a review constructive? What makes a review fair? In principle, we could rely on academic advising to pass such qualities to newer generations of researchers. In practice, when someone starts reviewing a significant volume of papers, there is no advisor or mentor to oversee the process.

Therefore, we need some guidelines. An excellent set of guidelines is given in the article "A Peer Review How-To". Let me highlight some nuggets:

Reviewers make two common mistakes. The first mistake is to reflexively demand that more be done. Do not require experiments beyond the scope of the paper, unless the scope is too narrow.
Do not reject a manuscript simply because its ideas are not original, if it offers the first strong evidence for an old but important idea.

Do not reject a paper with a brilliant new idea simply because the evidence was not as comprehensive as could be imagined.

Do not reject a paper simply because it is not of the highest significance, if it is beautifully executed and offers fresh ideas with strong evidence.

Seek a balance among criteria in making a recommendation.

Finally, step back from your own scientific prejudices

And now excuse me, because I have to review a couple of papers...

Thursday, October 2, 2008

VP Debate and Prediction Market Volatility

I was watching the VP debate on CNN, and CNN was reporting the reactions of "undecided Ohio voters" to what the VP candidates were saying. Although interesting, it was not satisfying. I wanted a better way to see the real time reactions. Blogs were relatively slow to post, and mainstream media were simply describing the minutia of the debate. What is the solution? Easy. Prediction markets!

I remembered that Intrade has a contract VP.DEBATE.OBAMA, "Barack Obama's Intrade value will increase more than John McCain's following the VP debate"

So, during the debate, I was following the fluctuations of the contract's price to measure the reactions. Here is how the contract moved from 8.30pm EST since 10.30pm EST. (The debate started at 9pm EST, and lasted until 10.30pm EST.)

At the beginning, the contract was below 50.0%, reflecting probably that the fact that Palin was giving reasonable and coherent responses, disappointing perhaps those that were expecting material for a Saturday Night Live performance.

However, at the second 45 minutes of the debate, as the discussion moved into foreign policy issues, the contract started moving up, as Biden started giving more immediate answers, and Palin started avoiding questions and replied using stereotypical, canned answers.

What I found interesting was the significant increase in variance as the debate came close to the end. Prices fluctuated widely during the closing statements of the two VP candidates.

This increased volatility as the contract comes to a close, is actually a fact that we observed consistently in many contracts over time: when the contract is not close to 0.0 or 1.0, the price fluctuates widely as we get close to expiration. While I could explain this intuitively, I did not have a solid theoretical understanding of why.

So, what to do in this case? You simply ask a PhD student to explain it to you! I asked Nikolay Archak, and within a few weeks, Nikolay had the answer.

The basic result:
  • Volatility increases as contract price gets closer to 0.5,
  • Volatility decreases as contract price gets closer to 0.0 or to 1.0,
  • Volatility increases as we get close to the expiration, and approaches infinity if price is not 0.0 or 1.0.
More information about the basic ideas of the model and about the technical details in a later post.

Tuesday, September 30, 2008

Sarah Palin and Markov Models

How good are n-gram Markov models for language modeling?

Apparently pretty good for modeling the responses of Sarah Palin during her last couple of interviews! Check them out:

Friday, September 12, 2008

How Much Turking Pays?

After reporting the results about "why Turkers Turk," I received a set of questions about further things that people would like to know about the Turkers. One of the most common questions was about the compensation of Turkers: "How much do they make by Turking?"

Well, there is no question about Mechanical Turk, that Mechanical Turk cannot answer, so here we go. I posted the very same question on MTurk, asking people about their average compensation per week. Without further ado, here are the results:

A small number of people make more than $100 per week, about 20% make more than $20 per week, and the majority get less than $20. So indeed it seems unlikely that people work on MTurk for a living.

It is much more likely that people actually enjoy what they are doing, and getting some cash is a nice side-effect. Furthermore, it is work that can be done even while working, and doing the tasks on MTurk helps other people. My own gut feeling is that the research about the motivations that drive people to contribute to open source projects can also be applied here to explain why Turkers turk.

(Info: The current survey paid 5 cents per HIT, and received responses for 200 Turkers. I will keep running the survey to collect 1,000 responses and will report if I see any significant changes. But so far the results seem remarkably stable.)

Thursday, September 11, 2008

Why People Participate on Mechanical Turk, Now Tabulated

A few months back, I decided to ask Turkers about their motivation for participating on Mechanical Turk. I found their responses quite fascinating, so I decided to list them in their raw format, without any further tabulation and processing.

However, as the time passed, I realized that I wanted to have the results in a more summarized and accessible format. Therefore, I bit the bullet and organize the results. Of course, I had no time for such a big task. So, what to do? First, I hired two coders using, to read and identify the main reasons listed in the responses. The two coders agreed on 9 broad categories:

A. To Kill Time
B. Fruitful way to spend free time (Instead of watching TV, Not to waste time, Rather than playing video games/online games, Sense of purpose when watching TV, Something to do during downtime in work)
C. Income purposes (Gas, Bills, Make money, Credit card, Groceries, School, Help family)
D. Pocket change/extra cash (Hobbies, Mad money, Buy personal stuff)
E. For entertainment, for fun, interesting, addiction
F. Challenge, self-competition
G. Unemployed, no regular job, s part-time job
H. To sharpen/ To keep mind sharp
I. Learn English

Then, I simply listed the responses on Mechanical Turk, and asked (new) Turkers to identify the category (or categories) for each response. Here are the percentages for each category (note that one response can be classified into multiple categories):

and the actual percentages:

A. 20.50%
B. 14.00%
C. 49.00%
D. 34.00%
E. 42.00%
F. 5.50%
G. 3.50%
H. 3.50%
I. 4.00%

So, we can see that many Turkers complete such tasks to get some extra cash and pay for gas (may be we should wish for high oil prices :-) but there is a significant fraction that does it for fun, because they consider Turking interesting, and sometimes even addicting!

I still consider the responses themselves more interesting than the tabulated version, so go and take a look yourself!

Tuesday, September 9, 2008

Links to Mechanical Turk Articles

I came back after the summer break, and I found a long list of articles regarding Mechanical Turk. So, let's give a list of links with small commentary, to start the new blogging season:
  • First, David Pennock points out that Mechanical Turk can be used for dubious activities as well. Interestingly enough, in the comments, the creator of some of the suspicious HITs mentioned in the article replies and argues that his actions are legitimate marketing. You may disagree, but when a alleged "spammer" shows up to defend his actions, it is unlikely to be a true black-hat spammer.
  • From the comments, we get another great pointer to the Floozy Speak blog, presenting a survey of why Tukers participate and complete tasks on Mechanical Turk. Nicely organized, it confirms my earlier post on the motivating factors for Turkers to participate (I have to finally tabulate this responses and post the summary... long overdue...).
  • To my great disappointment, I realized that ReadWriteWeb has a sensationalistic article titled " Amazon's Mechanical Turk Used for Fraudulent Activities." In contrast to David Pennock's post, it portrays Mechanical Turk as a marketplace of spammers. Even the last "positive" paragraph is phrased in way that reinforces the negative message. Not sure is this is what Brynn Evans was thinking about Mechanical Turk when she collected the screenshots of the spammers! My own experience indicates that Turkers have very good judgment and they avoid the tasks that look spammy (Lukas provides further evidence about that). I would put this article in the same category as all those articles in the mass media that portray the Internet as a place for kidnappers, paedophiles, scammers, and gamblers. Yes, in any system where you have human participation, you will find such people! Are we going to run away?
  • Bob Carpenter describes briefly his approach for modeling annotator's accuracy. His hierarchical Bayesian model builds on previous research in epidemiology, and allows us to estimate the quality characteristics of each annotator, and estimate the most likely "correct" responses for questions that get multiple answers. The model generates: (a) a set of labelers with quality characteristics (specificity and sensitivity), and (b) a dataset with a class balance. Matching the two, we get the final annotated dataset. I found the model very inspiring!
  • Finally, Brendan posts a description of their EMNLP'08 paper on using Mechanical Turk for completing a variety of NLP tasks. As far as I can tell, this is the first published study that examines the accuracy of Turkers in perfoming various annotation tasks, and provides plenty of information for people who are looking for some "wetlab statistics". I truly wish that I had a reference to add it to our KDD'08 paper, where we demonstrated the value of using multiple noisy annotations. While we had results indicating performance characteristics under different levels of noise, reviewers were curious to find out about the actual noise level on systems like Mechanical Turk. This EMNLP'08 paper provides plenty of such data!
Lots of great stuff!

Thursday, August 14, 2008

Mechanical Turk worker quality and HIT difficulty

Brendan O'Connor from Dolores Labs posted two excellent visualizations (1, 2) of durations of tasks on Mechanical Turk.

By looking at these images we can see easily how different workers behave on Mechanical Turk. Some are working systematically and fast, others are slower. These visualizations allow us to categorize workers into different categories.

If we have the "correct" response for each HIT (see the excellent post on the topic by Bob Carpenter), then we can augment the visualization with by coloring each HIT accordingly, and we can see easily which workers are fast, systematic, and give correct answers.

Given such data, we can also examine the dual problem: How easy is to work on particular HITs? Some HITs are going to be harder than others, will take longer to complete, and will have higher error rate.

Notice, though, that while for annotators we would like to avoid the ones with higher error rate, for HITs it may be beneficial to get the correct answers for the difficult cases. In fact, it may makse sense to try to allow only high-quality workers to work on the hard HITs, and allow the "low quality" annotators to work on the easy HITs. Devising an optimizal HIT-worker allocation strategy seems to be an interesting problem for future research.

Wednesday, July 30, 2008

Mechanical Turk Allows Bulk Submissions Using Templates

Mechanical Turk is a very useful service that I have used repeatedly in the past for a variety of tasks. While it was a great tool, one of the shortcomings was the lack of a web-based interface support to create batches of tasks. Using the web it was only possible to create a single task at a time, potentially asking many Turkers to complete it.

If someone wanted to generate a large number of similar tasks (e.g., submit 1000 query-document pairs for relevance judgments as opposed to 1 query-document pair) then the only option was to use some command-line tools, or use the MTurk API. Admittedly, the tools were easy to use, but still the need to resort to "programming" was a problem for many people with no programming background.

So, today Amazon released the ability to generate "templates" that will automate the generation of such tasks. The basic idea: You generate a template, and specify in the template a variable placeholder (e.g., #document). Then you upload a separate file with the data that will be used to populate the placeholder, and you are done. A set of HITs are then generated automatically.

Amazon has already created some basic HIT templates that illustrate the process, showing example tasks such as "Get Product Name from Image," "Are these products the same," "Evaluate the Effectiveness of Search Results" and so on. The Getting Started Guide illustrates how easy is to generate the batches of HITs.

Nice! I may finally generate some batch HITs myself, instead of asking my students to do all the work for me :-)

Monday, July 28, 2008

JDMR: Officially live

It seems that the website of the Journal of Database Management Research is now live.

The current title of the website is "The Proceedings of the VLDB Endowment (PVLDB)". Since the first issue of this journal will contain the papers that have been accepted for publication at
VLDB 2008, it seems that simply renaming the proceedings of VLDB was not deemed acceptable. Therefore, according to the transition plans:
All regular (research) papers selected for presentation at the VLDB Conference, beginning 2008, will be offered publication in a new Journal, called "Proceedings
of the VLDB Endowment" (PVLDB or "Proc. VLDB" for short). Volume 1 of PVLDB will appear in Aug 2008. There will be no separate VLDB Conference Proceedings.

Papers accepted for presentation to the VLDB Conference were reviewed once more, by the same set of reviewers, before being accepted for publication in PVLDB. This was a "light" review round, in this first transition year, to make sure that reviewer suggestions had been adopted in so far as practically possible.
The new hybrid journal will start accepting submissions on August 1st, 2008.

Sunday, July 27, 2008

Using The New York Times Reader

A few weeks back, I installed on my computer the "New York Times Reader". It is an application from The New York Times that runs quietly in the background, downloading locally all the articles of NY Times published over the last week. It also provides its own non-browser interface for browsing through the articles. The layout emulates more the layout of a paper newspaper than the layout of a web edition. I have used the reader a little bit when I downloaded it, but then I forgot about it and kept reading the news over the web.

Today, though, I found myself stuck on a 10-hr flight to Greece, with no Internet connectivity. Well, no problem. Actually I enjoy such long flights *because* there is no Internet connectivity and I can really focus on whatever I am doing, without (voluntarily or not) interruptions.

After going through all my email, I answered all the emails that were staying in my inbox for a while, and then I started reading blogs using the offline option of Google Reader. Unfortunately, reading blogs offline is not a very enjoyable experience. Some blogs are simply pointing to external articles, some others have only partial feeds, and some others are not meaningful to read without going over the comments and the discussion. So, quickly, I ran out of stuff to do.

Then, I noticed that I had the paper version of New York Times in front of me. I tried to read a little bit, just to realize that it is a royal pain to read a newspaper with the layout of New York Times on a plane. New York Times deserves and needs a coffee table, not a tray that can barely fit the laptop.

At that time, I realized that I had the Reader available on my laptop. Not sure if it syncs, I opened it. Fortunately, it has been quietly syncing all the material, and now I had one week of New York Times articles at my disposal. They layout was nice, the font type excellent, and the interface very intuitive. Plus the ability to go through all the sections (some of them published only once a week) is a big advantage. Therefore, I happily read one week worth of NY Times (ok, impossible, but it felt like that) on my laptop, ignoring completely the paper version sitting next to me.

Then, I noticed the "search" link. I went to the search screen and I started typing various queries. Well, was I surprised! Search was immediate, "results-as-you-type"-style. Plus the results were nicely organized as a newspaper, and ordered by relevance going from left to right. Here is a screenshot of the results for the query "economy":

Next step: See what is this "Topic Explorer". This generates a result screen like this:
Not very impressive initially, but the more interesting thing happens when you click an article, and you see a list of all the associated topics:
Very easy to go through related articles, easy to see the level of interest for each topic, and so on. I guess a little bit further visualization could also help. Also, some extra work to allow for faceted navigation would make the interface even more interesting. But it is definitely an enjoyable experience as-is, demonstrating the power of truly online interfaces over interfaces that simply try to emulate paper.

Sunday, July 13, 2008

Options Markets for Prediction Markets?

Over the last few months, together with George Tziralis, we have been doing research on the efficiency of the prediction markets.

At the very basic level, we want to examine how fast the market can incorporate into the price of a contract all the available information. Our experiments, so far, on InTrade show that the price of a contact tends to be "unpredictable" when using purely historic price data. In other words, the markets on InTrade tend to be "weakly efficient".

In fact, some pretty extensive time-series tests showed that the price movements are almost a random walk. Furthermore the spectrum of the random walk movements (after doing a Fourier transform) tends to be something close to 1/f noise (or pink noise). Assuming that each price change indeed captures the effect of a real-time event (this is a big assumption), then we can conclude that the importance of the events that happen tends to follow a "power-law": there is the "big" events that move the market significantly, but there is also a large number of minor events that cummulatively can move the market significantly, even though none of them is of any particular importance.

The next question that we wanted to examine, was whether the price changes have the same importance in different times. By analyzing the markets we observed that the prediction markets, just like the financial markets, exhibit the phenomenon of volatility clustering. In other words, there are periods in which the market tends to move up and down a lot, and periods in which the prices are relatively stable.

What are the implications of this findings? If we want to assign an importance value in a price change, we have to take into consideration the (past and future) volatility of the prices. Going from 60 to 80 in a period of low volatility signals a much more important event compared to the case of going from 60 to 80 in a period of high volatility.

With George, using the general family of the ARCH models we showed that we can model and predict nicely the volatility of the markets, and we can then estimate properly the importance of the events that correspond to these price changes.

Even though such models are good, there are limits on their predictive power.

A much better approach for estimating the future volatility of a market is to allow people to directly trade volatility. In financial markets, this happens by allowing people to buy options for a given stock. The values of the options give a good idea on what is the expected, future volatility of a stock (and its expected upward or downward movement). Therefore, having options allow us to estimate how robust and stable a particulat contract will be in the future.

In principle there is nothing that prevents us from having such derivative markets on top of the existing prediction markets. For example, we could have contacts such as "Obama@0.80/01Sep2008", which would allow people to buy for 0.80 cents, on September 1st 2008, a contract for Obama winning the presidency. If the actual contract is trading above 80%, the contract will turn a profit. (This is exactly equivalent to the existing options for stocks.)

The prices of such options give a good estimation of how volatile the contract is going to be in the future. For example, if Obama is trading today at 0.65 and noone is willing to buy the 0.80 contract for September, then traders do not believe that Obama will reach that level by September. On the other hand, if the price of the "Obama@0.80/01Sep2008" call option trades at 0.05, then people believe that the contract has good chances of being above 0.05 by September.

Using such values we can estimate the "upside volatility" of the contract. (The corresponding "put" contracts will show what is the estimated volatility on the downside.)

Of course, while such ideas are nice, we should not forget that markets work only when there is liquidity. And given the relatively low liquidity for the existing, primary prediction markets, there is little hope that such derivative markets for "options on prediction markets" will have even close to the necessary liquidity.

Monday, June 23, 2008

Massive Data and the End of the Scientific Method

I have been reading the last issue of Wired, which has just arrived in my mailbox, and which is provocatively titled "The End of Science."

The opening article by Chris Anderson, discusses how the availability of huge amounts of data allows us to use effectively data mining techniques and make discoveries without using any underlying scientific model or hypothesis.

The article starts with a quote of George Box:
All models are wrong but some are useful.
and Peter Norvig from Google rephrased that to today's massive dataset era:
All models are wrong and increasingly you can succeed without them.
Being a data junkie myself, I cannot disagree that you can achieve significant things by "letting the data speak". In fact, computer scientists very rarely work by strictly following the "scientific method" of hypothesis formulation, experimentation, and statistical testing to see if the experiments agree with the theory.

However, by thinking of this change as a paradigm shift, I started thinking about the possibility to go through a period in which we will actually know less about the world as we transition to a "new way of doing things".

Look at the quote of Chris Anderson:
Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
If we indeed proceed for a scientific model like this, then at some point we will end up missing some of the elegance and intuition that is offered by the (imperfect) models that are being dropped in favor of blackbox models that are simply trained by the available data. And when theory will end up catching up to the current experimental state of the art, we will end up developing new models once again.

Thursday, June 12, 2008

Statistical Significance of Sequential Comparisons

Suppose that we have a new technique and we need to compare it with some existing baseline. For that, we can run some tests (e.g., using multiple data sets, multiple users). Assume for simplicity that we have only a binary outcome of each test:
  1. Our technique outperformed the baseline
  2. The baseline outperformed our technique
The basic option is to select beforehand the size of the "test set" (i.e., how many data sets we will use, or how many users) and we have N results. A simple test to detect the statistical significance of the difference is to run a simple "sign test" and check whether the difference of the baseline and our technique is significant. The table below illustrates which combinations of "yes" and "no" results generate a statistically significant outcome, according to a one-tailed sign test:

The gray cells indicate significance at the 10% level, the red cells indicate significance at the 5% level, the yellow cells significance at the 1% level, the green at 0.1%, and the blue ones at the 0.001% level.

Now, suppose that we want to run experiments with the minimum possible cost, in terms of asking questions. Therefore, we keep running tests until reaching a desired level of statistical significance (say at 1% level). When we reach the level of statistical significance, we stop. Otherwise we keep increasing the sample size.

This is a pretty common practice and occurs a lot when we do not manage to hit the level of statistical significance from the first attempt. Keep increasing the size of the sample, hoping that we will manage to hit statistical significance.

There is however a hidden bias in this practice, similar to the hidden bias that appears in the Monty Hall problem. We stop as soon as we see something favorable, ignoring the possibility that the next few cases may reduce the statistical significance of the findings.

In principle, if we want to absolutely correct, if the first sample of size N does not give statistical significance then we pick a new sample, ignoring all previous samples, and conduct the test again. At the end, to discover the statistical significance of the findings, we apply the Bonferroni correction, that ensures that we have overconfidence in the statistical significance of the findings. (Intuitively, after running 20 experiments, at least one of them is expected to show 5% confidence by chance, so the statistical significance level will have to be multiplied by the number of experiments.)

However, I do not know what is the correction in the case of sequential experiments, in which we do not reject the sample cases that did not return statistical significance. It may be easy to work it out analytically but I am too lazy to do it.

Friday, May 23, 2008

The (Statistical) Significance of the Impact Factor

Being in the middle of my tenure track, I cannot skip running into different ways that people use to evaluate research. One of the most common ways to evaluate papers (at least at a very high level) is to look at the impact factor of the journal, and classify the paper as "published in a top journal," "published in an OK journal," or "published in a B-class journal". I have argued in the past that this is a problematic practice, and an article published in Nature provides the evidence. To summarize the reasoning: articles published within the same journal have widely different citation numbers, therefore using the average is simply misleading.

I think that the best example that I have heard that illustrates the problem of reporting averages of highly-skewed distributions is from Paul Krugman's book "The Conscience of a Liberal":
...Bill Gates walks into a bar, the average wealth of the bar's clientele soars...
This is exactly what happens when evaluating papers using impact factors for journals. So, this introduces two problems:
  • If you evaluate a paper using the impact factor of the journal, the evaluation is almost always a significant overestimate or a significant underestimate of the paper's "impact". (Assuming that citations measure "impact".) Read the analysis below for an illustrating example.
  • The impact factor itself is a very brittle metric, as it is heavily influenced by a few outliers. If indeed the in-journal citation distribution is a power-law, then the impact factor itself is a useless metric.
To make this more clear, I will pick as an example the ACM Transactions of Information Systems. The journal has a rather impressive impact factor for a computer science journal, with an increasing trend:
Now, let's try to dissect the 5.059 impact factor for 2006. The impact factor is the number of citations generated in 2006, pointing to the papers published in 2005 and 2004, divided by the total number of published articles. According to ISI Web of Knowledge, we have:
2006 Impact Factor

Cites in 2006 to articles published in:
2005 = 25
2004 = 147
Sum: 172

Number of articles published in:
2005 = 15
2004 = 19
Sum: 34

Calculation: 172/34 = 5.059
Now, let's split down these numbers by publication. By looking at the number of citations per publication, we can see that there is a single paper "Evaluating collaborative filtering recommender systems" by Herlocker, which has almost 30 citations in 2006. Taking this single publication out, the impact factor is reduced to 4.3.

In fact, if we take out of the calculations the papers published in the Special Issue for Recommender Systems (Jan 2004), then the impact factor drops even more, and comes close to 2.5. At the same time, the impact factor of the papers published in the special issue is much higher, getting closer to 15.0 or so.

Given the unusual high impact of that special issue, we can expect for the 2007 impact factor for TOIS to decrease substantially. It would not be surprising to see the impact factor for 2007 to be in the pre-2003 levels.

This simple example illustrate that the impact factor rarely represents the "average" paper published in the journal. There are papers that are significantly stronger than the impact factor illustrates and papers that are significantly weaker. (Implication: Authors that use the impact factor of the journals as a representative metric of the quality of their research, they use a metric that is almost never representative.)

Therefore, a set of other metrics may be preferable. The obvious choices is to use the median instead of the average, and report the Gini coefficient for the papers published in the journal. The Gini coefficient will show how representative is the impact factor. Next step is to examine the distribution of the number of citations within the journals. Is it a power-law, or an exponential? (I was not able to locate an appropriate reference.) Having these answers can lead to better analysis and easier comparisons.

Monday, May 12, 2008

Experimental Repeatability or simply Open Source?

This year SIGMOD and KDD started playing with the idea of experimental repeatability. The basic idea is to generate guidelines and processes that will encourage repeatability of the experiments presented in many papers.

The reasons are rather obvious: We need to be able to reproduce the experiment, to avoid any hidden bias, catch errors, and even avoid outright fraud. Furthermore, this encourages publications of techniques that are easy to implement and test. Why do we care? If the method is impossible to implement then it is an obstacle to research progress. A published paper that claims to be the state of the art, but is not reproducible may prevent other reproducible methods from being published, just for lack of comparison with the current state of the art.

Now, to achieve experimental repeatability we need two things:
  • Access to the data sets
  • Access to the code
Both parts tend to have issues: When someone uses multi-terrabyte data sets, it is highly unclear how to give access to such data to outsiders. (Our work on the evolution of web databases used a 3.3Tb dataset -- I have no idea how to even make the data available.) Other issues include copyrighted datasets, e.g., archives of newspaper articles. Despite these issues, I believe that at the end it is relatively easy to give access to the used datasets. See, for example, the UCI Machine Learning Repository, the UCR Time Series, the Linguistic Data Consortium, the Wharton Research Data Services (WRDS), and Daniel Lemire's set of pointers. (Feel free to post more pointers in the comments.)

The second aspect is access to the underlying code. One may argue that instead of giving access to the code we should describe clearly how to implement the algorithms, give the settings, and so on. This avoids any intellectual property issues, and everyone is happy. Personally, I do not buy this. No matter how nicely someone implements someone else's algorithms, nobody is going to spend much of time optimizing the code for a competing technique. This may lead to flawed experimental comparisons. Another alternative is to use common datasets and simply pick the performance numbers from the published paper, without reimplementing the competing technique. (This works only when the underlying hardware is irrelevant -- e.g., for precision/recall experiments in information retrieval.)

My own take? Encourage publication of open source software. If the code is open and available, comparisons are easy, and the whole issue of experimental repeatability becomes moot. No need for committees to verify that the reported results are indeed correct, no need to upload code into machines with different architecture, making sure that the code runs without any segmentation faults, and so on. If the code is available, even if the results are incorrect, someone will catch that in the future. (If the results are incorrect, the code and data is available, and nobody cares to replicate the results, then experimental repeatability is a moot point.)

Now, it is easy to talk about open source, but anyone who tries knows what a pain it is to take the scripts used to run experiments and make them ready to use by anyone else. (Or even to be reused later, from the author :-) Therefore, we need to give further incentives. The idea of the JMLR journal to have a track for submissions of open source software; this track serves as "a venue for collection and dissemination of open source software"

Perhaps this is the way to proceed, an alternative to the "experimental repeatability requirements" that may be too difficult to follow.