Representation of Protein-Sequence Information by Amino Acid Subalphabets

By Andersen, Claus A. F.; Brunak, Soren | AI Magazine, Spring 2004 | Go to article overview

Representation of Protein-Sequence Information by Amino Acid Subalphabets


Andersen, Claus A. F., Brunak, Soren, AI Magazine


Proteins typically contain 20 different amino acids, which have been selected during evolution from a much larger pool of possibilities that exists in Nature. Protein sequences are constructed from this alphabet of 20 amino acids, and most proteins with a sequence length of 200 amino acids or more contain all 20, albeit with large differences in frequency. Some amino acids are very common, but others are rare. The human genome encodes at least 100,000 to 200,000 different protein sequences, with lengths ranging from small peptides with 5 to 10 amino acids to large proteins with several thousand amino acids.

A key problem when constructing computational methods for analysis of protein data is how to represent the sequence information (Baldi and Brunak 2001). The literature contains many different examples of how to deal with the fact that the 20 amino acids are related to one another in terms of biochemical properties--very much in analogy to natural language alphabets where two vowels might be more "similar" than any vowel-consonant pair, for example, when constructing speech-synthesis algorithms.

In this article, we do not want to cover all attempts to represent protein sequences computationally but restrict the review to recent developments in the area of amino acid subalphabets, where the idea is to discover groups of amino acids that can be lumped together, thus giving rise to alphabets with fewer than 20 symbols. These subalphabets can then be used to rewrite or reencode the original protein sequence, hopefully giving rise to better performance of an AI algorithm designed to detect a particular functional feature when receiving the simplified input. The idea is completely general, and similar approaches might be relevant in other symbol-sequence data domains, for example, in natural language processing.

It should be mentioned that alphabet expansion in some cases can also be advantageous, that is, to rewrite sequences in expanded, longer alphabets covering more than one symbol, thus encoding significant correlations between individual symbols directly into the rewritten sequence. For example, deoxyribonucleic acid (DNA) sequences contain four different nucleotides (ACGT), but a rewrite as dinucleotides (AA, AC, AG, ...), or trinucleotides (AAA, AAC, AAG, ...) might lead to a DNA representation where functional patterns are easier to detect by machine learning algorithms. For example, this is the case when detecting the small part of the DNA that actually encodes proteins by artificial neural networks (Hebsgaard et al. 1996). The protein-encoding part of the DNA in the human genome is a few percent of the total DNA in the chromosomes; therefore, the problem is to detect protein-encoding segments in a "sea" of noncoding DNA. This task is made easier when the sequences are also analyzed as dinucleotides (16-symbol alphabet) and trinucleotides (64-symbol alphabet) (Hebsgaard et al. 1996).

In proteins, the common 20-letter amino acid alphabet contains groups of amino acids with similar biochemical properties, which can be merged for improved computational analysis. Thus, we have subalphabets with less than 20 symbols. However, one should be careful when merging individual amino acids solely based on their biochemical properties because many functional patterns in proteins are embedded as sequence correlations. This contextual aspect is again similar to natural language, where the pronunciation of the four As in the sentence Mary had a little lamb requires three different phonemes because the contexts of the As are different (Sejnowski and Rosenberg 1987).

Amino acids in proteins also do not contribute to the function or structure of proteins independently. The amino acid alanine, for example, is found in many different types of protein structure depending on the surrounding amino acids, and in this sense, amino acid sequences should be read in the same way that natural language sequences are read, where the short- and long-range symbol correlations are essential for the pronunciation. …

The rest of this article is only available to active members of Questia

Already a member? Log in now.

Notes for this article

Add a new note
If you are trying to select text to create highlights or citations, remember that you must now click or tap on the first word, and then click or tap on the last word.
One moment ...
Default project is now your active project.
Project items
Notes
Cite this article

Cited article

Style
Citations are available only to our active members.
Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

(Einhorn, 1992, p. 25)

(Einhorn 25)

(Einhorn 25)

1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

Note: primary sources have slightly different requirements for citation. Please see these guidelines for more information.

Cited article

Representation of Protein-Sequence Information by Amino Acid Subalphabets
Settings

Settings

Typeface
Text size Smaller Larger Reset View mode
Search within

Search within this article

Look up

Look up a word

  • Dictionary
  • Thesaurus
Please submit a word or phrase above.
Print this page

Print this page

Why can't I print more than one page at a time?

Help
Full screen
Items saved from this article
  • Highlights & Notes
  • Citations
Some of your highlights are legacy items.

Highlights saved before July 30, 2012 will not be displayed on their respective source pages.

You can easily re-create the highlights by opening the book page or article, selecting the text, and clicking “Highlight.”

matching results for page

    Questia reader help

    How to highlight and cite specific passages

    1. Click or tap the first word you want to select.
    2. Click or tap the last word you want to select, and you’ll see everything in between get selected.
    3. You’ll then get a menu of options like creating a highlight or a citation from that passage of text.

    OK, got it!

    Cited passage

    Style
    Citations are available only to our active members.
    Buy instant access to cite pages or passages in MLA 8, MLA 7, APA and Chicago citation styles.

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn, 1992, p. 25).

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences." (Einhorn 25)

    "Portraying himself as an honest, ordinary person helped Lincoln identify with his audiences."1

    1. Lois J. Einhorn, Abraham Lincoln, the Orator: Penetrating the Lincoln Legend (Westport, CT: Greenwood Press, 1992), 25, http://www.questia.com/read/27419298.

    Cited passage

    Thanks for trying Questia!

    Please continue trying out our research tools, but please note, full functionality is available only to our active members.

    Your work will be lost once you leave this Web page.

    Buy instant access to save your work.

    Already a member? Log in now.

    Search by... Author
    Show... All Results Primary Sources Peer-reviewed

    Oops!

    An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.