Magazine article AI Magazine

Representation of Protein-Sequence Information by Amino Acid Subalphabets

Magazine article AI Magazine

Representation of Protein-Sequence Information by Amino Acid Subalphabets

Article excerpt

Proteins typically contain 20 different amino acids, which have been selected during evolution from a much larger pool of possibilities that exists in Nature. Protein sequences are constructed from this alphabet of 20 amino acids, and most proteins with a sequence length of 200 amino acids or more contain all 20, albeit with large differences in frequency. Some amino acids are very common, but others are rare. The human genome encodes at least 100,000 to 200,000 different protein sequences, with lengths ranging from small peptides with 5 to 10 amino acids to large proteins with several thousand amino acids.

A key problem when constructing computational methods for analysis of protein data is how to represent the sequence information (Baldi and Brunak 2001). The literature contains many different examples of how to deal with the fact that the 20 amino acids are related to one another in terms of biochemical properties--very much in analogy to natural language alphabets where two vowels might be more "similar" than any vowel-consonant pair, for example, when constructing speech-synthesis algorithms.

In this article, we do not want to cover all attempts to represent protein sequences computationally but restrict the review to recent developments in the area of amino acid subalphabets, where the idea is to discover groups of amino acids that can be lumped together, thus giving rise to alphabets with fewer than 20 symbols. These subalphabets can then be used to rewrite or reencode the original protein sequence, hopefully giving rise to better performance of an AI algorithm designed to detect a particular functional feature when receiving the simplified input. The idea is completely general, and similar approaches might be relevant in other symbol-sequence data domains, for example, in natural language processing.

It should be mentioned that alphabet expansion in some cases can also be advantageous, that is, to rewrite sequences in expanded, longer alphabets covering more than one symbol, thus encoding significant correlations between individual symbols directly into the rewritten sequence. For example, deoxyribonucleic acid (DNA) sequences contain four different nucleotides (ACGT), but a rewrite as dinucleotides (AA, AC, AG, ...), or trinucleotides (AAA, AAC, AAG, ...) might lead to a DNA representation where functional patterns are easier to detect by machine learning algorithms. For example, this is the case when detecting the small part of the DNA that actually encodes proteins by artificial neural networks (Hebsgaard et al. 1996). The protein-encoding part of the DNA in the human genome is a few percent of the total DNA in the chromosomes; therefore, the problem is to detect protein-encoding segments in a "sea" of noncoding DNA. This task is made easier when the sequences are also analyzed as dinucleotides (16-symbol alphabet) and trinucleotides (64-symbol alphabet) (Hebsgaard et al. 1996).

In proteins, the common 20-letter amino acid alphabet contains groups of amino acids with similar biochemical properties, which can be merged for improved computational analysis. Thus, we have subalphabets with less than 20 symbols. However, one should be careful when merging individual amino acids solely based on their biochemical properties because many functional patterns in proteins are embedded as sequence correlations. This contextual aspect is again similar to natural language, where the pronunciation of the four As in the sentence Mary had a little lamb requires three different phonemes because the contexts of the As are different (Sejnowski and Rosenberg 1987).

Amino acids in proteins also do not contribute to the function or structure of proteins independently. The amino acid alanine, for example, is found in many different types of protein structure depending on the surrounding amino acids, and in this sense, amino acid sequences should be read in the same way that natural language sequences are read, where the short- and long-range symbol correlations are essential for the pronunciation. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.