Monday, May 12, 2008

Experimental Repeatability or simply Open Source?

This year SIGMOD and KDD started playing with the idea of experimental repeatability. The basic idea is to generate guidelines and processes that will encourage repeatability of the experiments presented in many papers.

The reasons are rather obvious: We need to be able to reproduce the experiment, to avoid any hidden bias, catch errors, and even avoid outright fraud. Furthermore, this encourages publications of techniques that are easy to implement and test. Why do we care? If the method is impossible to implement then it is an obstacle to research progress. A published paper that claims to be the state of the art, but is not reproducible may prevent other reproducible methods from being published, just for lack of comparison with the current state of the art.

Now, to achieve experimental repeatability we need two things:
  • Access to the data sets
  • Access to the code
Both parts tend to have issues: When someone uses multi-terrabyte data sets, it is highly unclear how to give access to such data to outsiders. (Our work on the evolution of web databases used a 3.3Tb dataset -- I have no idea how to even make the data available.) Other issues include copyrighted datasets, e.g., archives of newspaper articles. Despite these issues, I believe that at the end it is relatively easy to give access to the used datasets. See, for example, the UCI Machine Learning Repository, the UCR Time Series, the Linguistic Data Consortium, the Wharton Research Data Services (WRDS), and Daniel Lemire's set of pointers. (Feel free to post more pointers in the comments.)

The second aspect is access to the underlying code. One may argue that instead of giving access to the code we should describe clearly how to implement the algorithms, give the settings, and so on. This avoids any intellectual property issues, and everyone is happy. Personally, I do not buy this. No matter how nicely someone implements someone else's algorithms, nobody is going to spend much of time optimizing the code for a competing technique. This may lead to flawed experimental comparisons. Another alternative is to use common datasets and simply pick the performance numbers from the published paper, without reimplementing the competing technique. (This works only when the underlying hardware is irrelevant -- e.g., for precision/recall experiments in information retrieval.)

My own take? Encourage publication of open source software. If the code is open and available, comparisons are easy, and the whole issue of experimental repeatability becomes moot. No need for committees to verify that the reported results are indeed correct, no need to upload code into machines with different architecture, making sure that the code runs without any segmentation faults, and so on. If the code is available, even if the results are incorrect, someone will catch that in the future. (If the results are incorrect, the code and data is available, and nobody cares to replicate the results, then experimental repeatability is a moot point.)

Now, it is easy to talk about open source, but anyone who tries knows what a pain it is to take the scripts used to run experiments and make them ready to use by anyone else. (Or even to be reused later, from the author :-) Therefore, we need to give further incentives. The idea of the JMLR journal to have a track for submissions of open source software; this track serves as "a venue for collection and dissemination of open source software"

Perhaps this is the way to proceed, an alternative to the "experimental repeatability requirements" that may be too difficult to follow.