Monday, June 23, 2008

Massive Data and the End of the Scientific Method

I have been reading the last issue of Wired, which has just arrived in my mailbox, and which is provocatively titled "The End of Science."

The opening article by Chris Anderson, discusses how the availability of huge amounts of data allows us to use effectively data mining techniques and make discoveries without using any underlying scientific model or hypothesis.

The article starts with a quote of George Box:
All models are wrong but some are useful.
and Peter Norvig from Google rephrased that to today's massive dataset era:
All models are wrong and increasingly you can succeed without them.
Being a data junkie myself, I cannot disagree that you can achieve significant things by "letting the data speak". In fact, computer scientists very rarely work by strictly following the "scientific method" of hypothesis formulation, experimentation, and statistical testing to see if the experiments agree with the theory.

However, by thinking of this change as a paradigm shift, I started thinking about the possibility to go through a period in which we will actually know less about the world as we transition to a "new way of doing things".

Look at the quote of Chris Anderson:
Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
If we indeed proceed for a scientific model like this, then at some point we will end up missing some of the elegance and intuition that is offered by the (imperfect) models that are being dropped in favor of blackbox models that are simply trained by the available data. And when theory will end up catching up to the current experimental state of the art, we will end up developing new models once again.