A Statistical Multiprobe Model for Analyzing Cis and Trans Genes in Genetical Genomics Experiments with Short-Oligonucleotide Arrays
Alberts, Rudi, Terpstra, Peter, Bystrykh, Leonid V., de Haan, Gerald, Jansen, Ritsert C., Genetics
Short-oligonucleotide arrays typically contain multiple probes per gene. In genetical genomics applications a statistical model for the individual probe signals can help in separating "true" differential mRNA expression from "ghost" effects caused by polymorphisms, misdesigned probes, and batch effects. It can also help in detecting alternative splicing, start, or termination.
IN a genetical genomics experiment, a panel of 30 genetically different recombinant inbred mice was derived from a cross between parental strains C57BL/6 (B6) and DBA/2 (D2) QANSEN and NAP 2001; BYSTRYKH et al. 2005). These 30 mice were profiled with Affymetrix MG-U74Av2 arrays, using RNA isolated from hematopoietic stem cells and 12,422 probe sets. The observed array data were background corrected and quantile normalized (BOLSTAD et al 2003; GAUTIER et al 2004). Although various methods have been developed to compute a single expression value per probe set for further data analysis (e.g., ZHANG et al. 2003; Wu and IRIZARRY 2004; MANLY et al 2005), we here develop an alternative statistical method to more fully exploit the information contained in the individual probe signals.
Differential expression for a given gene can result from irons-regulation by other genes or from a's-regulation due to variation in the region of the gene itself (altering functional motifs in the promoter region, changing the stability of the mRNA, or modifying the gene product in such away that the feedback loop is shifted; JANSEN and NAP 2004). In either case signal differences are supposed to be rather stable across probes (Figure IA). The differences may, however, also change from one probe to another, due to known or unknown single-nucleotide polymorphisms (SNPs) or microdeletions between mRNA transcripts of different samples (Figure IB; see also Doss et al 2005), due to misdesigned probes (the majority of probe sets in Figure 1 are sequence verified; see also MECHAM et al. 2004) or due to other known or unknown factors (e.g., alternative transcription). In such cases computing a single expression value per probe set, as in the current methods, leads to a loss of biologically relevant information. This will also hold for future alternative transcription/splicing arrays with probes located in different exons and not in the last exon or 3'-untranslated region only, as in the MG-U74Av2 array used in our experiment (SHAROV et al. 2005). When the differences in signal between probes match with information in alternative splicing databases, they indicate that alternative transcription/splicing is the cause, and not an SNP.
Genes colocalizing within 20 Mb of their QTL are termed here cis genes, and genes mapping elsewhere are termed trans genes. The cis genes show many more probe-specific QTL effects than the trans genes do (Figure 1, C and D). It is expected that probe sets carrying the more influential polymorphisms between probe and transcript will be picked up as as-acting with probespecific QTL effects. Indeed 10 as-acting genes carry currently known SNPs in one or more of their probes and the directions of the probe-specific QTL effects are in agreement with the SNPs; i.e., the mouse allele perfectly matching the probes on the array has the higher signal (allowing us to use the array for genotyping as well; see also JANSEN and NAP 2001; ROSTOKS et al. 2005). Further lab research (e.g., sequencing of the D2 strain) will make clear how many of the as-acting genes are caused by SNPs, alternative transcription, or other (hidden) factors.
Our analysis separates probe sets that are "consistently" cis across probes from those that are more "probe-specific" cis and should be investigated in more detail in silica or in the lab. Figure 1C shows that P-values for probe-by-QTL interaction can be very extreme. At the one hand probe-specific QTL effects can be large relative to e,j (and thus statistically significant), at the other hand they can still be small relative to the average QTL effect (and thus biologically not really relevant). …