- Our technique outperformed the baseline
- The baseline outperformed our technique
The gray cells indicate significance at the 10% level, the red cells indicate significance at the 5% level, the yellow cells significance at the 1% level, the green at 0.1%, and the blue ones at the 0.001% level.
Now, suppose that we want to run experiments with the minimum possible cost, in terms of asking questions. Therefore, we keep running tests until reaching a desired level of statistical significance (say at 1% level). When we reach the level of statistical significance, we stop. Otherwise we keep increasing the sample size.
This is a pretty common practice and occurs a lot when we do not manage to hit the level of statistical significance from the first attempt. Keep increasing the size of the sample, hoping that we will manage to hit statistical significance.
There is however a hidden bias in this practice, similar to the hidden bias that appears in the Monty Hall problem. We stop as soon as we see something favorable, ignoring the possibility that the next few cases may reduce the statistical significance of the findings.
In principle, if we want to absolutely correct, if the first sample of size N does not give statistical significance then we pick a new sample, ignoring all previous samples, and conduct the test again. At the end, to discover the statistical significance of the findings, we apply the Bonferroni correction, that ensures that we have overconfidence in the statistical significance of the findings. (Intuitively, after running 20 experiments, at least one of them is expected to show 5% confidence by chance, so the statistical significance level will have to be multiplied by the number of experiments.)
However, I do not know what is the correction in the case of sequential experiments, in which we do not reject the sample cases that did not return statistical significance. It may be easy to work it out analytically but I am too lazy to do it.