Model Selection in Genomics. (Editorials)
Shmulevich, Ilya, Environmental Health Perspectives
With the discovery of DNA, the completion of genome sequencing of a number of organisms, and the advent of powerful high-throughput measurement technologies such as microarrays, it is now commonly said that biology has gone through a revolution. But I also have heard it said that biology is only about to go through a scientific revolution, much as physics did in the 17th century. In messianic hopes, people foretell the coming of the Newton of biology, but it is up to us, the scientific community, to set the stage for that to happen.
Both views are valid, each in their own sense. The discovery of DNA and the more recent development of powerful new technologies have certainly revolutionized our understanding of the inner workings of life and allowed us to probe deep into the machinery of living organisms, much as the Copernican system and Galileo's telescope helped revolutionize astronomy. It was Sir Isaac Newton, however, who placed science on a solid footing by formalizing existing knowledge in terms of mathematical models and universal laws. In some sense, this was the real scientific revolution because it permitted prediction of physical phenomena in a general setting, as opposed to simply describing individual observations. The difference is profound. Whereas a mathematical equation can adequately describe a given set of observations, it may be missing the needed universality for making predictions. Kepler's equations pertained to planets in our solar system. Newton's laws could be used to predict what would happen to two arbitrary bodies anywhere in the universe. The universality of a scientific theory coupled with mathematical modeling allows us to make testable predictions. This ability will have a profound effect on the field of biology.
The hallmarks of a great scientific theory are universality and simplicity. Newton's law of gravity is a case in point. The fact that the force of attraction between any two bodies is proportional to the product of their masses and inversely proportional to the square of the distance between them is both universal and simple. These issues are especially important today in the rapidly evolving field of genomics, where formal mathematical and computational methods are becoming indispensable. So what should be our guiding principles, our beacons of scientific inquiry? One such fundamental principle underpinning all scientific investigation is Ockham's razor, also called the "law of parsimony."
Consider the following, seemingly straightforward problem. We are presented with a set of data, represented as pairs of numbers (x,y). In each pair, the first number (x) is an independent variable and the second number (y) is a dependent variable. The problem is to choose whether to fit a line (of the form y = a + bx) or a parabolic function (of the form y = a + bx + [cx.sup.2]). The knee-jerk response might be as follows: Let's fit the parabolic function, since the linear function is clearly a special case of it, just by letting c = 0; thus, the parabola will always provide a better fit to our data set. After all, if it so happens that our data points are arranged on a line, the estimation of parameters (a, b, and c) will simply reveal that c is indeed equal to zero and the parabolic function will reduce to a linear one. Thus, it would seem, three "adjustable" parameters are better than two. Of course, such reasoning could be taken ad absurdum if we had freedom to choose as many parameters as we like. …