On Relevance of Codon Usage to Expression of Synthetic and Natural Genes in Escherichia Coli

Article excerpt


A recent investigation concluded that codon bias did not affect expression of green fluorescent protein (GFP) variants in Escherichia coli, while stability of an mRNA secondary structure near the 5' end played a dominant role. We demonstrate that combining the two variables using regression trees or support vector regression yields a biologically plausible model with better support in the GFP data set and in other experimental data: codon usage is relevant for protein levels if the 5' mRNA structures are not strong. Natural E. coli genes had weaker 5' mRNA structures than the examined set of GFP variants and did not exhibit a correlation between the folding free energy of 5' mRNA structures and protein expression.

IN genomes, natural selection may act on silent sites of codons to make translation of highly expressed genes more efficient, an effect linked primarily to abundances of tRNA isoacceptor molecules (Ikemura 1985; Bulmer 1987; Kanaya et al. 1999). Codon choice may also be linkedto formation of secondary structures in mRNA that reduce protein levels, as has been shown with haplotypes of the human COMT gene (Nackley et al. 2006). Kudla et al. (2009) have recently reported an experiment that contributes toward understanding how synonymous codon usage shapes gene expression. They have constructed a library of 154 synthetic variants of a green fluorescent protein (GFP) gene that varied randomly at synonymous siteswhile retaining the original amino acid sequence. The authors concluded that codon usage (CU) bias did not correlate with protein levels measured as fluorescence of the GFP, but also that the minimumfree energy of amRNAsecondary structure ina 42-nucleotide region at [-4,37] that overlaps the start codon ("hairpin stability") bears a great significance. CU bias was quantified by the widely used codon adaptation index (CAI) method (Sharp and Li 1987), essentially a measure of the distance of a gene's codon usage to the codon usage of a predefined set of highly expressed genes. TheCAI andsomeof itsmorerecent alternatives, such as measure independent of length and composition (MILC) (Supek and Vlahovicek 2005), have been shown to be a viable surrogate for gene expression in various unicellular organisms. Additionally, ina multiple linear regression of rank fluorescence against a number of sequence-derived attributes, including CAI and the abovementioned hairpin stability,Kudlaet al. (2009) did not find CAI to contribute significantly toward the prediction of protein levels, in contrast to the hairpin stability.

Both the codon adaptation index and the 5' mRNA secondary structures influence protein levels in the Kudla et al. data: The described statistical analyses, however, failed to address the case in which a nonlinear three-way dependency between hairpin stability, codon usage, andfluorescence might exist; data are visualized in Figure 1, A-C, and in figure 2B in Kudla et al. Such complex patterns in data are readily captured by the support vector machines (SVM) algorithm, reviewed in Noble (2006) and Ben-Hur et al. (2008). We have employed the SVM with a radial basis function kernel to regress fluorescence against both hairpin stability and CAI simultaneously (Figure 1B) and computed the Pearson's correlation coefficient in cross-validation (here denoted as Q) between true and predicted values of fluorescence (See File S1). A linear model based solely on hairpin stability as employed by Kudla et al. (Figure 1A) can explain Q2 = 38.6%of variance in protein levels, while the nonlinear SVM regression that takes CAI into account explains Q2 = 52.2%of variance. The difference in Q is statistically significant at P = 10-190 (paired t-test). Note that Kudla et al. utilize the Spearman rank correlation coefficient (r) in their article; the hairpin stability would explain r2 = 44.6% of the variance in expression levels if the requirement for a linear relationship was abandoned in this manner. …