Statistical Significance of Sequential Comparisons

Thursday, June 12, 2008

Statistical Significance of Sequential Comparisons

Suppose that we have a new technique and we need to compare it with some existing baseline. For that, we can run some tests (e.g., using multiple data sets, multiple users). Assume for simplicity that we have only a binary outcome of each test:

Our technique outperformed the baseline
The baseline outperformed our technique

The basic option is to select beforehand the size of the "test set" (i.e., how many data sets we will use, or how many users) and we have N results. A simple test to detect the statistical significance of the difference is to run a simple "sign test" and check whether the difference of the baseline and our technique is significant. The table below illustrates which combinations of "yes" and "no" results generate a statistically significant outcome, according to a one-tailed sign test:

The gray cells indicate significance at the 10% level, the red cells indicate significance at the 5% level, the yellow cells significance at the 1% level, the green at 0.1%, and the blue ones at the 0.001% level.

Now, suppose that we want to run experiments with the minimum possible cost, in terms of asking questions. Therefore, we keep running tests until reaching a desired level of statistical significance (say at 1% level). When we reach the level of statistical significance, we stop. Otherwise we keep increasing the sample size.

This is a pretty common practice and occurs a lot when we do not manage to hit the level of statistical significance from the first attempt. Keep increasing the size of the sample, hoping that we will manage to hit statistical significance.

There is however a hidden bias in this practice, similar to the hidden bias that appears in the Monty Hall problem. We stop as soon as we see something favorable, ignoring the possibility that the next few cases may reduce the statistical significance of the findings.

In principle, if we want to absolutely correct, if the first sample of size N does not give statistical significance then we pick a new sample, ignoring all previous samples, and conduct the test again. At the end, to discover the statistical significance of the findings, we apply the Bonferroni correction, that ensures that we have overconfidence in the statistical significance of the findings. (Intuitively, after running 20 experiments, at least one of them is expected to show 5% confidence by chance, so the statistical significance level will have to be multiplied by the number of experiments.)

However, I do not know what is the correction in the case of sequential experiments, in which we do not reject the sample cases that did not return statistical significance. It may be easy to work it out analytically but I am too lazy to do it.