Monday, June 23, 2008

Massive Data and the End of the Scientific Method

I have been reading the last issue of Wired, which has just arrived in my mailbox, and which is provocatively titled "The End of Science."

The opening article by Chris Anderson, discusses how the availability of huge amounts of data allows us to use effectively data mining techniques and make discoveries without using any underlying scientific model or hypothesis.

The article starts with a quote of George Box:
All models are wrong but some are useful.
and Peter Norvig from Google rephrased that to today's massive dataset era:
All models are wrong and increasingly you can succeed without them.
Being a data junkie myself, I cannot disagree that you can achieve significant things by "letting the data speak". In fact, computer scientists very rarely work by strictly following the "scientific method" of hypothesis formulation, experimentation, and statistical testing to see if the experiments agree with the theory.

However, by thinking of this change as a paradigm shift, I started thinking about the possibility to go through a period in which we will actually know less about the world as we transition to a "new way of doing things".

Look at the quote of Chris Anderson:
Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
If we indeed proceed for a scientific model like this, then at some point we will end up missing some of the elegance and intuition that is offered by the (imperfect) models that are being dropped in favor of blackbox models that are simply trained by the available data. And when theory will end up catching up to the current experimental state of the art, we will end up developing new models once again.

Thursday, June 12, 2008

Statistical Significance of Sequential Comparisons

Suppose that we have a new technique and we need to compare it with some existing baseline. For that, we can run some tests (e.g., using multiple data sets, multiple users). Assume for simplicity that we have only a binary outcome of each test:
  1. Our technique outperformed the baseline
  2. The baseline outperformed our technique
The basic option is to select beforehand the size of the "test set" (i.e., how many data sets we will use, or how many users) and we have N results. A simple test to detect the statistical significance of the difference is to run a simple "sign test" and check whether the difference of the baseline and our technique is significant. The table below illustrates which combinations of "yes" and "no" results generate a statistically significant outcome, according to a one-tailed sign test:

The gray cells indicate significance at the 10% level, the red cells indicate significance at the 5% level, the yellow cells significance at the 1% level, the green at 0.1%, and the blue ones at the 0.001% level.

Now, suppose that we want to run experiments with the minimum possible cost, in terms of asking questions. Therefore, we keep running tests until reaching a desired level of statistical significance (say at 1% level). When we reach the level of statistical significance, we stop. Otherwise we keep increasing the sample size.

This is a pretty common practice and occurs a lot when we do not manage to hit the level of statistical significance from the first attempt. Keep increasing the size of the sample, hoping that we will manage to hit statistical significance.

There is however a hidden bias in this practice, similar to the hidden bias that appears in the Monty Hall problem. We stop as soon as we see something favorable, ignoring the possibility that the next few cases may reduce the statistical significance of the findings.

In principle, if we want to absolutely correct, if the first sample of size N does not give statistical significance then we pick a new sample, ignoring all previous samples, and conduct the test again. At the end, to discover the statistical significance of the findings, we apply the Bonferroni correction, that ensures that we have overconfidence in the statistical significance of the findings. (Intuitively, after running 20 experiments, at least one of them is expected to show 5% confidence by chance, so the statistical significance level will have to be multiplied by the number of experiments.)

However, I do not know what is the correction in the case of sequential experiments, in which we do not reject the sample cases that did not return statistical significance. It may be easy to work it out analytically but I am too lazy to do it.