Proteomics: Protein Identification Using Online Databases

Article excerpt

Bioinformatics, the discipline of biology that employs computerized search algorithms and extensive databases of biological information to investigate biological processes and relationships, has grown exponentially in the past decade. Genomics continues to be the best-known and most data-rich area of bioinformatics; the Human Genome Project, as well as the sequencing of genomes from many other species, has amassed genetic data from laboratories around the world. These data are available in public databases such as the National Center for Biotechnology Information (NCBI; note that the web addresses for websites discussed in this article can be found in Table 1). Many genomics-oriented educational activities have been developed to allow students to use genomic data repositories to study biological questions (BSCS, 2003; Buxeda & Moore-Russo, 2003; Wefer, 2003; Herron et al., 2010).

In addition to genomics, proteomics is a growing discipline of bio-informatics. Like the extensive databases of nucleotide sequences, there are also databases containing amino acid sequences of proteins isolated from species as diverse as bacteria and humans. These databases allow for the discovery and analysis of protein properties in a manner analogous to nucleotides. Thus, proteomics is the area of bioinformatics that makes use of these amino acid sequence databases and allows examination of all the proteins expressed by a cell, tissue, or organism.

Why put such emphasis on proteins? Proteins are almost exclusively responsible for cellular function and metabolism, as well as for much of cellular structure. Therefore, proteins determine the phenotype of the cell and the organism. Being able to identify and examine the proteins present in cells or tissues, and compare protein expression among groups, can provide important information concerning an organism's physiology, health, or evolutionary history. Thus, proteomics has been the recent focus of much research and technology development (Yates et al., 2009), and the field will grow in importance as scientists explore the links among genes, protein expression, and biological function (Gstaiger & Aebersold, 2009).

** Background

The activity begins with data files generated by undergraduates at Franklin & Marshall College (F&M) during a semester-long proteomics laboratory focused on environmental stress in yeast. Baker's yeast (Saccharomyces cerevisiae) is an excellent study organism for bioinformatics laboratories because its genome has been sequenced, there are >58,000 S. cerevisiae protein sequences available in the UniProt database, and its complex eukaryotic metabolism is comparable to those of multicellular organisms like plants and animals.

Here, we briefly describe the process by which these data files were produced; detailed descriptions of these lab procedures can be found in the background materials on the Teaching Bioinformatics website (Table 1). Students chose environmental stresses to investigate, including heat, glucose starvation, high ethanol concentration, and hydrogen peroxide (oxidative stress). Differences in protein expression between control and stressed yeast were determined using 2-D polyacrylamide gel electrophoresis (PAGE) (Figure 1). The students located proteins of interest, those that changed in abundance after stress, and cut them from the 2D-PAGE gels. The proteins of interest were digested into smaller peptides using the enzyme trypsin. A liquid chromatography-tandem mass spectrometer (LC-MS/MS) measured the molecular mass of each peptide, as well as the mass of additional "daughter" peptides generated by fragmentation during the analysis.

Students begin this activity with data files generated by F&M undergraduates, available for download from the Teaching Bioinformatics website. They use these peptide mass data in order to identify proteins associated with stress responses in yeast. …