Friday, October 31, 2008

The New York Times Annotated Corpus

Last week, I was invited to give a talk at a conference at the New York Public Library, about the preservation of news. I talked about our research in the Economining project, where we are trying to find the "economic value" of textual content on the Internet.

As part of the presentation, I discussed some problems that I had in the past with obtaining well-organized news corpora that are both comprehensive and also easily accessible using standard tools. Factiva has an excellent database of articles, exported in a richly annotated XML format but unfortunately Factiva prohibits data mining of the content of its archives.

The librarians in the conference were very helpful in offerring suggestions and acknowledging that providing content for data mining purposes should be one of the goals of any preservation effort.

So, yesterday I received an email from Dorothy Carner informing me about the availability of The New York Times Corpus, a corpus of 1.8 million articles from The New York Times, dating from 1987 until 2007. The details are available from but let me repeat some of the interesting facts here (the emphasis below is mine):

The New York Times Annotated Corpus is a collection of over 1.8 million articles annotated with rich metadata published by The New York Times between January 1, 1987 and July 19, 2007.

With over 650,000 individually written summaries and 1.5 million manually tagged articles, The New York Times Annotated Corpus has the potential to be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction.

The corpus is provided as a collection of XML documents in the News Industry Text Format (NITF). Developed by a consortium of the world’s major news agencies, NITF is an internationally recognized standard for representing the content and structure of news documents. To learn more about NITF please visit the NITF website.

Highlights of The New York Times Annotated Corpus include:

  • Over 1.8 million articles written and published between January 1, 1987 and June 19, 2007.
  • Over 650,000 article summaries written by the staff of The New York Times Index Department.
  • Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
  • Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at
  • Java tools for parsing corpus documents from xml into a memory resident object.

To learn more about The New York Times Annotated Corpus please read the PDF Overview.

Yes, 1.8 million articles, in richly annotated XML, with summaries, with hierarchically categorized articles, and with verified annotations of people, locations, and organizations! Expect the corpus to be a de facto standard for many text-centric research efforts! Hopefully more organizations are going to follow the example of New York Times and we are going to see such publicly available corpora from other high-quality sources. (I know that Associated Press has an archive of almost 1Tb of text, in computerized form, and hopefully we will see something similar from them as well.)

How can you get the corpus? It is available from LDC, for 300 USD for non-members; members should get this for free.

I am looking forward to receiving the corpus and start playing!

Monday, October 20, 2008

Modeling Volatility in Prediction Markets, Part II

In the previous post, I described how we can estimate the volatility of prediction markets using additional prediction market contracts, aka options on prediction markets. I finished indicating that techniques that can be used to price options for stocks, are not directly applicable in the prediction market context.

Now, I will review a different modeling approach that builds on the spirit of Black-Scholes but is properly adapted for the prediction market context. This model has been developed by Nikolay, and is described in the paper "Modeling Volatility in Prediction Markets".

Modeling Prediction Markets as Competitions

Let's consider the simple case of a contract with a binary outcome. For example, who will win the presidential election? McCain or Obama?

The basic modeling idea is to assume that each competing party has an ability $S_i(t)$ that evolves over time , moving as a Brownian motion. (A simplified example of such ability would be the number of voters for a party, the number of points in a sports game, and so on.) At the expiration of the contract at time $T$ , the party $i$ with the higher ability $S_i(T)$ wins.

Actually, to have a more general case, we can use a generalized form of the Brownian motion, an Ito diffusion, that allows for the abilities to have a drift $\mu_i$ over time (i.e., the average rate of growth), and different volatilities $\sigma_i$ . The quantity that we need to monitor is the difference of the two ability processes $S(t)=S_1(t)-S_2(t)$ . If at the expiration of the contract at time $T$ we have $S(T)>0$ , then party 1 wins. If $S(T)$ is less than 0, then party 2 wins. Interestingly, the difference $S(t)$ is also an Ito diffusion, with $\mu=\mu_1-\mu_2$ , $\sigma=\sqrt{\sigma_1^2+\sigma_2^2-2\rho \sigma_1 \sigma_2}$ , where $\rho$ is the correlation of the two ability processes. Under this scenario, the price of the contract $\pi(t)$ at time $t$ is:

$\pi(t) = Pr\{ S(T)>0 | S(t) \}$

which can be written as:

$\pi(t) = N\Big(\frac{S(t) + \mu \cdot (T-t)}{\sigma \cdot \sqrt{T-t} } \Big)$

where $N(x) =\frac{1}{2} \Big[ 1 + erf\Big( \frac{x}{\sqrt{2}} \Big) \Big]$ is the CDF of the normal distribution with mean 0, and standard deviation 1 and $erf(x)$ is the error function. Notice that as time $t$ gets closer to the expiration, the denominator gets close to 0, which makes the ratio closer to $\infty$ or $-\infty$, and price $\pi(t)$ gets close to 0 or 1. However, if $S(t)$ is close to 0 (i.e., the two parties are almost equivalent), then we observe increasingly higher instability as we get close to expiration, as small changes in the difference $S(t)$ can have a significant effect in the outcome.

For example, consider two parties: party 1 with an ability that has positive drift $\mu_1=0.2$ and volatility $\sigma_1=0.3$, and party 2 with negative drift $\mu_2=-0.2$ and higher volatility $\sigma_2=0.6$. In this case, assuming no correlation, the difference is a diffusion with drift $\mu=0.4$ and volatility $\sigma=0.67$. Here is one scenario of the evolution, and below you can see the price of the contract, as time evolves.

As you may observe from the example, the red line (party 1) is for the most time above the blue line (party 2), which causes the green line (the difference) to be above 0. As the contract gets close to expiration, the contract gets closer and closer to 1 (i.e., party 1 will win). Close to the end, the blue line catches up, which causes the prediction market contract to have a big swing from almost 1 to 0.5, but then swings back up as party 1 finally finishes at the expiration above party 2.

So far, we generated a nice simulation but our results depend on knowing the parameters of the underlying "ability processes". Since we never get to observe these values, what is the use of all this exercise?
Well, the interesting thing is that by using the price function, we can now proceed to derive its volatility. Without going into the details, we can prove that the volatility of the prediction market contract is:

$V(t) = \rac{1}{\sqrt{T-t}} \cdot \varphi( N^{-1}( \pi(t) ) )$

where $N^{-1}(x)$ is the inverse CDF of the standard normal distribution and $\varphi(x)=\frac{exp( (-x^2)/2)}{\sqrt{2\pi}}$ is the density of the standard normal distribution.

In other words, volatility depends only on the current price of the contract and time to expiration! Anything else is irrelevant! Drifts do not matter: they are priced already in the current price of the contract, since we know where the drift will lead at expiration. The magnitude of the volatilities are also priced into the current contract price: higher volatilities cause the contract price to get closer to 0.5, as it is easier for $S(t)$ to move above and below 0 when it has high volatility. Furthermore, the direction of the volatilities of the underlying abilities is indifferent as they can move the difference into either direction with equal probability. (The only assumption is that the volatilities of the underlying abilities processes do not change over time.)

Volatility Surface

So, what this model implies for the volatility of the prediction markets? First of all, the model says that volatility increases as we move closer to the expiration, as long as the price of the contract is not 0 or 1. For example, assuming that now we have $t=0$ and expiration is at $T=1$, the volatility is expected to increase as follows:

So, how volatility changes with different contract prices? As you can see, volatility is highest when the contract trades at around 0.5, and gets close to 0 when price is 0 or 1.

And just to combine the two plots and present a nice 3d plot, with the present being at $t=0$ and expiration at $T=1$:

The experimental section in the paper "Modeling Volatility in Prediction Markets" (shorter conference paper presented at ACM EC'09), indicates that the actual volatility observed in the InTrade prediction markets fits well the current model.

Now, given this model, we can judge what is a "noise movement" and what is actually a "significant move" in prediction markets. Furthermore, we can provide an "error margin" for each day, indicating the confidence bounds for the market price.

I will post more applications of this model in the next few days. We will see how to price the X contracts on InTrade, and a way to compute correlations of the outcomes of state elections, given simply the past movements of their corresponding prediction markets.

Modeling Volatility in Prediction Markets, Part I

A few weeks back, I was thinking about the concept of uncertainty in prediction markets. The price of a contract in a prediction market today gives us the probability that an event will happen. For example, the contract 2008.PRES.OBAMA is trading at 84.0, indicating that there is an 84% chance that Obama will win the presidential election.

Unfortunately, we have no idea about the stability and robustness of this estimate. How likely it is that the contract will fall tomorrow to 80%? How likely it is to jump to 90%? By treating the contract price as a "deterministic" number, we do not capture such information. We need to treat the price as a random variable with its own probability distribution, out of which we observe just the mean by looking at the prediction market.

However, to fully understand the stability of the price we need further information, beyond just the mean of the probability, revealed by the current contract price.

A first step is to look at the volatility of the price. One approach is to look at the past trading behavior, but this analysis will give us the past volatility, not the expected future volatility of the contract.

Predicting Future Volatility using Options

So, how can we estimate the future volatility of a prediction market contract?

There is a market approach to solve this problem. Namely, we can run prediction markets on the results of the prediction markets!

Recently, Intrade has introduced such contracts, the so-called X contracts (listed under "Politics->Options: US Election" from the sidebar). For example, the contract "X.22OCT.OBAMA.>80.0" pays 1 USD if the contract "2008.PRES.OBAMA" will be higher than 80.0 on Wed 22 Oct 2008. Traditionally, the threshold defined in the options contract is called strike price (e.g., the strike price for X.22OCT.OBAMA.>80.0 is 80.0).

A set of such contracts can reveal the distribution of the probability of the event for the underlying contract 2008.PRES.OBAMA. In other words, we can see not only what is the mean probability that Obama will be elected president but we can also see the expected downside risk or upside potential of the 2008.PRES.OBAMA contract. For example, the X.22OCT.OBAMA.>80.0 has a price of 90.0, indicating a 90% chance that the 2008.PRES.OBAMA contract will be above 80.0 on Oct 22nd.

Now, given enough contracts, with strike prices at various levels, we can estimate the probability distribution for the likely prices of the contract. For example, we can have contracts with strike price 10, 20, ..., 90 that will give us the probability that the contract will trade above 10, 20, ... and 90 points at some specific point in time, which corresponds to the expiration date of the options contract. So for each date, we need 9 contracts, if we need to have a 10 column histogram that describes the distribution.

Note that if we want to estimate the probability distribution dynamics we will need to setup 9 contracts for each date that we want to measure. Of course, this implies that we have plenty of liquidity in the markets if we want to rely purely on the market for such estimates.

Pricing Options and the Black-Scholes Formula

A natural question is: Can we price such "options on options" contracts?

This will at least give us some guidance on the likely prices of such contracts, if not for anything else, but to just start the market at the appropriate level. (For example, if we have a market scoring mechanism.)

There is significant research in Finance on pricing options for stocks. The Black-Scholes formula is one of the most well-known examples for deriving prices for options on stocks. The basic idea behind Black-Scholes is that the underlying stock price follows a Brownian motion, moving randomly up and down. Then by extracting the probability that this random stock move will reach various levels, it is possible to derive the option prices. (Terrence Tao has a very easy to read 3-page note explaining the Black-Scholes formula and a longer blog posting.)

Why not applying directly this model to price options on prediction markets? There are a few fundamental problems but the most important one is the bounded price of the underlying prediction market contract. The price of a prediction market contract cannot go below 0 or above 1, so the Brownian motion assumption is invalid. In fact, if we try to apply the Black-Scholes model on a prediction market, we get absurd results.

In the next post, I will review an adaptation of the Black-Scholes model that works well for prediction markets, and leads to some very interesting results!

Sunday, October 12, 2008

Student websites

I am just posting this to provide links to the pages of my students, so that Google indexes their websites.

Saturday, October 4, 2008

Reviewing the Reviewers

I received today the latest issue of TOIS, and the title of the editorial by Gary Marchionini caught my eye: "Reviewer Merits and Review Control, in an Age of Electronic Manuscript Management Systems". The article makes the case for using the electronic management systems to allow for grading of the reviewer efforts and allow for memory of the reviewing process, including both the reviews and the reviewer ratings.

In principle, I agree with the idea. Having the complete reviewing history for each reviewer, and for each journal and conference, can bring several improvements in the process:

1. Estimating and Fixing Biases

One way to see the publication process is as noisy labeling of an example, where the true labels are "accept" or "reject". The reviewers can be modeled as noisy processes, each with its own sensitivity and specificity. The perfect reviewer has sensitivity=1, i.e., marks as "accept" all the "true accepts", and has specificity=1, i.e., marks as "reject" all the "true rejects".

Given enough noisy ratings, it is possible to use statistical techniques to infer what is the "true label" for each paper, and infer at the same time the sensitivity and specificity of each reviewer. Bob Carpenter has presented a hierarchical Bayesian model that can be used for this purpose, but simpler maximum likelihood models, like the one of Dawid and Skene, also work very well. In my own (synthetic) experiments the MLE method worked almost perfectly for recovering the quality characteristics of the reviewers and to recover the true labels of the papers (of course, without the uncertainty estimates that the Bayesian methods provide.)

One issue with such a model? The assumption that we have an underlying "true" label. For people with different backgrounds and research interests, what is a "true accept" and what a "true reject" is not easy to define even with perfect reviewing.

2. Reviewer Ratings

Reviewer reviewing by the editors

The statistical approaches described above reduce the quality of a reviewer into two metrics. However, these ratings only show agreement of the recommendations with the "true" value (publish or not). They say nothing about other aspects of the review: comprehensiveness, depth, timeliness, helpfulness, are all important aspects that need to be captured using different methods.

Marchionini mentions that current manuscript management systems allow the editors to rate reviewers in terms of timeliness and in terms of quality. By following the references, I ran into the article Reviewer Merits, published in Information Processing and Management, where the Editors-in-Chief of many IR journals stated:
Electronic manuscript systems easily provide time data for reviewers and some offer rating scales and note fields for editors to evaluate review quality. Many of us (editors) are beginning to use these capabilities and, over time, we will be able to have systematic and persistent reviewer quality data. Graduate students, faculty, chairs, and deans should be aware that these data are held.
Now, while I agree with reviewer accountability, I think that this statement is not worded properly. I find the use of the phrase "should be aware" as semi-threatening. ("We, the editors, are rating you... remember that!")

If reviewer quality history is being kept, then the reviewers should be aware and have access to it. Being reminded that "your history is out there somewhere" is not the way to go. If reviewer quality is going to be a credible evaluation metric, the reviewers need to know how well they did. (Especially junior reviewers, and especially when the review does not meet the quality standards.)

Furthermore, if the editors are the ones rating the reviewers, then who controls the quality of these ratings? How do we know that the evaluation is fair and accurate? Notice that if we have a single editorial quality rating per review, then the statistical approaches described above do not work.

Reviewer reviewing by the authors

In the past, I have argued that authors should rate reviewers. My main point in that post was to propose a system that will encourage reviewers to participate by rewarding the highly performing reviewers. (There is a similar letter to Science, named "Rewarding Reviewers.") Since authors will have to provide multiple feedback points, it is much easier to correct the biases in the reviewer ratings of the authors.

3. Reviewer History and Motivation

If we have a history of reviewers, we should not forget potential side-effects. One clear issue that I see, is motivation. If "reviews of reviewers" become a public record, then it is not clear how easy it will be to recruit reviewers.

Right now, many accept invitations to review, knowing that they will be able to do a decent job. If the expectations increase, it will be natural for people to reject invitations, focusing only on a few reviews for which they can do a great job. Arguably, reviewer record is never going to be as important for evaluation as other metrics, as research productivity or teaching, so it is unlikely to get more time devoted to it.

So, there will always be the tradeoff: more reviews or better reviews?

One solution that I have proposed in the past: Impose a budget! Any researcher should remove from the reviewing system the workload it generates. Five papers submitted (not accepted) within a year? The researcher needs to review 3x5 = 15 papers to remove the workload that these five papers generated. (See also the article "In Search of Peer Reviewers" that has the same ideas.)

4. Training Reviewers

So, suppose that we have the system in place to keep reviewer history, we have solved the issue of motivation, and now one facet of researcher reputation is the reviewer quality score. How do we learn how to review properly? A system that generates a sensitivity and specificity of a reviewer can provide some information on how strict or lenient a reviewer is, compared to others.

However, we need something more than that. What makes a review constructive? What makes a review fair? In principle, we could rely on academic advising to pass such qualities to newer generations of researchers. In practice, when someone starts reviewing a significant volume of papers, there is no advisor or mentor to oversee the process.

Therefore, we need some guidelines. An excellent set of guidelines is given in the article "A Peer Review How-To". Let me highlight some nuggets:

Reviewers make two common mistakes. The first mistake is to reflexively demand that more be done. Do not require experiments beyond the scope of the paper, unless the scope is too narrow.
Do not reject a manuscript simply because its ideas are not original, if it offers the first strong evidence for an old but important idea.

Do not reject a paper with a brilliant new idea simply because the evidence was not as comprehensive as could be imagined.

Do not reject a paper simply because it is not of the highest significance, if it is beautifully executed and offers fresh ideas with strong evidence.

Seek a balance among criteria in making a recommendation.

Finally, step back from your own scientific prejudices

And now excuse me, because I have to review a couple of papers...

Thursday, October 2, 2008

VP Debate and Prediction Market Volatility

I was watching the VP debate on CNN, and CNN was reporting the reactions of "undecided Ohio voters" to what the VP candidates were saying. Although interesting, it was not satisfying. I wanted a better way to see the real time reactions. Blogs were relatively slow to post, and mainstream media were simply describing the minutia of the debate. What is the solution? Easy. Prediction markets!

I remembered that Intrade has a contract VP.DEBATE.OBAMA, "Barack Obama's Intrade value will increase more than John McCain's following the VP debate"

So, during the debate, I was following the fluctuations of the contract's price to measure the reactions. Here is how the contract moved from 8.30pm EST since 10.30pm EST. (The debate started at 9pm EST, and lasted until 10.30pm EST.)

At the beginning, the contract was below 50.0%, reflecting probably that the fact that Palin was giving reasonable and coherent responses, disappointing perhaps those that were expecting material for a Saturday Night Live performance.

However, at the second 45 minutes of the debate, as the discussion moved into foreign policy issues, the contract started moving up, as Biden started giving more immediate answers, and Palin started avoiding questions and replied using stereotypical, canned answers.

What I found interesting was the significant increase in variance as the debate came close to the end. Prices fluctuated widely during the closing statements of the two VP candidates.

This increased volatility as the contract comes to a close, is actually a fact that we observed consistently in many contracts over time: when the contract is not close to 0.0 or 1.0, the price fluctuates widely as we get close to expiration. While I could explain this intuitively, I did not have a solid theoretical understanding of why.

So, what to do in this case? You simply ask a PhD student to explain it to you! I asked Nikolay Archak, and within a few weeks, Nikolay had the answer.

The basic result:
  • Volatility increases as contract price gets closer to 0.5,
  • Volatility decreases as contract price gets closer to 0.0 or to 1.0,
  • Volatility increases as we get close to the expiration, and approaches infinity if price is not 0.0 or 1.0.
More information about the basic ideas of the model and about the technical details in a later post.