Friday, May 23, 2008

The (Statistical) Significance of the Impact Factor

Being in the middle of my tenure track, I cannot skip running into different ways that people use to evaluate research. One of the most common ways to evaluate papers (at least at a very high level) is to look at the impact factor of the journal, and classify the paper as "published in a top journal," "published in an OK journal," or "published in a B-class journal". I have argued in the past that this is a problematic practice, and an article published in Nature provides the evidence. To summarize the reasoning: articles published within the same journal have widely different citation numbers, therefore using the average is simply misleading.

I think that the best example that I have heard that illustrates the problem of reporting averages of highly-skewed distributions is from Paul Krugman's book "The Conscience of a Liberal":
...Bill Gates walks into a bar, the average wealth of the bar's clientele soars...
This is exactly what happens when evaluating papers using impact factors for journals. So, this introduces two problems:
  • If you evaluate a paper using the impact factor of the journal, the evaluation is almost always a significant overestimate or a significant underestimate of the paper's "impact". (Assuming that citations measure "impact".) Read the analysis below for an illustrating example.
  • The impact factor itself is a very brittle metric, as it is heavily influenced by a few outliers. If indeed the in-journal citation distribution is a power-law, then the impact factor itself is a useless metric.
To make this more clear, I will pick as an example the ACM Transactions of Information Systems. The journal has a rather impressive impact factor for a computer science journal, with an increasing trend:
Now, let's try to dissect the 5.059 impact factor for 2006. The impact factor is the number of citations generated in 2006, pointing to the papers published in 2005 and 2004, divided by the total number of published articles. According to ISI Web of Knowledge, we have:
2006 Impact Factor

Cites in 2006 to articles published in:
2005 = 25
2004 = 147
Sum: 172

Number of articles published in:
2005 = 15
2004 = 19
Sum: 34

Calculation: 172/34 = 5.059
Now, let's split down these numbers by publication. By looking at the number of citations per publication, we can see that there is a single paper "Evaluating collaborative filtering recommender systems" by Herlocker, which has almost 30 citations in 2006. Taking this single publication out, the impact factor is reduced to 4.3.

In fact, if we take out of the calculations the papers published in the Special Issue for Recommender Systems (Jan 2004), then the impact factor drops even more, and comes close to 2.5. At the same time, the impact factor of the papers published in the special issue is much higher, getting closer to 15.0 or so.

Given the unusual high impact of that special issue, we can expect for the 2007 impact factor for TOIS to decrease substantially. It would not be surprising to see the impact factor for 2007 to be in the pre-2003 levels.

This simple example illustrate that the impact factor rarely represents the "average" paper published in the journal. There are papers that are significantly stronger than the impact factor illustrates and papers that are significantly weaker. (Implication: Authors that use the impact factor of the journals as a representative metric of the quality of their research, they use a metric that is almost never representative.)

Therefore, a set of other metrics may be preferable. The obvious choices is to use the median instead of the average, and report the Gini coefficient for the papers published in the journal. The Gini coefficient will show how representative is the impact factor. Next step is to examine the distribution of the number of citations within the journals. Is it a power-law, or an exponential? (I was not able to locate an appropriate reference.) Having these answers can lead to better analysis and easier comparisons.

Monday, May 12, 2008

Experimental Repeatability or simply Open Source?

This year SIGMOD and KDD started playing with the idea of experimental repeatability. The basic idea is to generate guidelines and processes that will encourage repeatability of the experiments presented in many papers.

The reasons are rather obvious: We need to be able to reproduce the experiment, to avoid any hidden bias, catch errors, and even avoid outright fraud. Furthermore, this encourages publications of techniques that are easy to implement and test. Why do we care? If the method is impossible to implement then it is an obstacle to research progress. A published paper that claims to be the state of the art, but is not reproducible may prevent other reproducible methods from being published, just for lack of comparison with the current state of the art.

Now, to achieve experimental repeatability we need two things:
  • Access to the data sets
  • Access to the code
Both parts tend to have issues: When someone uses multi-terrabyte data sets, it is highly unclear how to give access to such data to outsiders. (Our work on the evolution of web databases used a 3.3Tb dataset -- I have no idea how to even make the data available.) Other issues include copyrighted datasets, e.g., archives of newspaper articles. Despite these issues, I believe that at the end it is relatively easy to give access to the used datasets. See, for example, the UCI Machine Learning Repository, the UCR Time Series, the Linguistic Data Consortium, the Wharton Research Data Services (WRDS), and Daniel Lemire's set of pointers. (Feel free to post more pointers in the comments.)

The second aspect is access to the underlying code. One may argue that instead of giving access to the code we should describe clearly how to implement the algorithms, give the settings, and so on. This avoids any intellectual property issues, and everyone is happy. Personally, I do not buy this. No matter how nicely someone implements someone else's algorithms, nobody is going to spend much of time optimizing the code for a competing technique. This may lead to flawed experimental comparisons. Another alternative is to use common datasets and simply pick the performance numbers from the published paper, without reimplementing the competing technique. (This works only when the underlying hardware is irrelevant -- e.g., for precision/recall experiments in information retrieval.)

My own take? Encourage publication of open source software. If the code is open and available, comparisons are easy, and the whole issue of experimental repeatability becomes moot. No need for committees to verify that the reported results are indeed correct, no need to upload code into machines with different architecture, making sure that the code runs without any segmentation faults, and so on. If the code is available, even if the results are incorrect, someone will catch that in the future. (If the results are incorrect, the code and data is available, and nobody cares to replicate the results, then experimental repeatability is a moot point.)

Now, it is easy to talk about open source, but anyone who tries knows what a pain it is to take the scripts used to run experiments and make them ready to use by anyone else. (Or even to be reused later, from the author :-) Therefore, we need to give further incentives. The idea of the JMLR journal to have a track for submissions of open source software; this track serves as "a venue for collection and dissemination of open source software"

Perhaps this is the way to proceed, an alternative to the "experimental repeatability requirements" that may be too difficult to follow.

Wednesday, May 7, 2008

SCECR 2008: Symposium on Statistical Challenges in Electronic Commerce Research

Those of you who live in the NYC area, and are interested in social networks, user-generated contents, and statistical approaches to problems in the area, you may consider attending a 2-day symposium that will be organized at NYU on May 18th and 19th. Below, I attach the call for participation.
We're pleased to invite you to participate in the 2008 Symposium on Statistical Challenges in Electronic Commerce Research, to be hosted in New York City by NYU Stern's Center for Digital Economy Research, on May 18th and May 19th, 2008.

The theme of this year's symposium is "Social Networks and User-Generated Content". The symposium features over 35 excellent talks by researchers from economics, information systems, machine learning, marketing and statistics. Our keynote speakers are Daryl Pregibon (Google) and Duncan Watts (Yahoo Research).

To find out more about how to register and attend the symposium, visit
http://w4.stern.nyu.edu/ceder/events.cfm?doc_id=7911. The registration fee for academic attendants will be $250.