Like other ancient texts, the human genome has been copied many times. With three billion letters to reproduce, errors creep in. Some defective versions are destined not to be copied again; they are lethal mutations. But some variants are harmless, or at least not fatal, just as some spellings are slightly different but mean the same thing: Americans write "favor," Britons "favour." Other spellings affect the meaning of the sentence in which they occur--in genomic terms, the gene--but leave it in a more or less readable and useful form.
Molecular biologists call such common single letter variations SNPs--short for single nucleotide polymorphisms. Nucleotides are the four letters of the genomic alphabet, A, T, C and G; "polymorphism" is from the Greek and means simply many forms. Scientists have described roughly two million SNPs in the human genome and are likely to discover a few million more. When you want to look for a gene, SNPs are like marker flags. For one person, flag number 1,000,000 might be green and number 1,000,001 yellow, while another person may have different colors in either or both locations.
Suppose we want to find genes associated with a common, complex disease such as diabetes. You might want to get two groups of people--those who have type 2 diabetes and those who do not--and check to see whether they have any systematic differences in any of their flag colors. If most people in one group have a yellow flag in position 1,000,000 while most of those in the other group have a blue flag at the same spot, a gene linked to the disease may lurk nearby on the genome.
Of course, with two million flags to check, searching for genes can be excruciatingly laborious. The solution? A Haplotype Genome Project.
It turns out that for long stretches of the genome, your flag color at one position may also predict the color of the next hundred or so flags. Why? Because your ancestors tended to inherit that stretch of the genome, flags and genes included, as one continuous segment. The variations they introduced into the genome, like an ancient scholar copying a text, tend to be reproduced by their descendants. These stretches are haplotypes--chunks of genomic text that appear repeatedly within a given population. For every newly discovered haplotype that covers, say, 100 SNPs, we would only have to determine the color of a single flag to know the color of all of them. If we could do that for the entire genome (an unlikely scenario) we could reduce a hundredfold the number of flags we need to track down--from two million to twenty thousand. …