Magazine article AI Magazine

Using Machine Learning to Design and Interpret Gene-Expression Microarrays

Magazine article AI Magazine

Using Machine Learning to Design and Interpret Gene-Expression Microarrays

Article excerpt

Almost every cell in the body of an organism has the same deoxyribonucleic acid (DNA). Genes are portions of this DNA that code for proteins or (less commonly) other large biomolecules. As Hunter covers in his introductory article in this special issue (and, for completeness, we review in the next section of this article), a gene is expressed through a two-step process in which the gene's DNA is first transcribed into ribonucleic acid (RNA), which is then translated into the corresponding protein. A novel technology of gene-expression microarrays--whose development started in the second half of the 1990s and is having a revolutionary impact on molecular biology--allows one to monitor the DNA-to-RNA portion of this fundamental biological process.

Why should this new development in biology interest researchers in machine learning and other areas of AI? Although the ability to measure transcription of a single gene is not new, the ability to measure at once the transcription of all the genes in an organism is new. Consequently, the amount of data that biologists need to examine is overwhelming. Many of the data sets we describe in this article consist of roughly 100 samples, where each sample contains about 10,000 genes measured on a gene-expression microarray. Suppose 50 of these patients have one disease, and the other 50 have a different disease. Finding some combination of genes whose expression levels can distinguish these two groups of patients is a daunting task for a human but a relatively natural one for a machine learning algorithm. Of course, this example also illustrates a challenge that microarray data pose for machine learning algorithms--the dimensionality of the data is high compared to the typical number of data points.

The preceding paragraph gives one natural example of how one can apply machine learning to microarray data. There are many other tasks that arise in analyzing microarray data and correspondingly many ways in which machine learning is applicable. We present a number of such tasks, with an effort to describe each task concisely and give concrete examples of how researchers have addressed such tasks, together with brief summaries of their results. (1) Before discussing these particular tasks and approaches, we summarize the relevant biology and biotechnology. This article closes with future research directions, including the analysis of several new types of high-throughput biological data, similar to microarray data, that are becoming available based on other advances in biotechnology.

Some Relevant Introductory Biology

The method by which the genes of an organism are expressed is through the production of proteins, (2) the building blocks of life. This process of gene expression occurs in all organisms, from bacteria to plants to humans. Each gene encodes a specific protein, (3) and at each point in the life of a given cell, various proteins are being produced. It is through turning on and off the production of specific proteins that an organism responds to environmental and biological situations, such as stress, and different developmental stages, such as cell division.

Genes are contained in the DNA of the organism. The mechanism by which proteins are produced from their corresponding genes is a two-step process (figure 1). The first step is the transcription of a gene from DNA into a temporary molecule known as RNA. During the second step--translation--cellular machinery builds a protein using the RNA message as a blueprint. Although there are exceptions to this process, these steps (along with DNA replication) are known as the central dogma of molecular biology.


One property that DNA and RNA have in common is that each is a chain of chemicals known as bases. (4) In the case of DNA, these bases are adenine, cytosine, guanine, and thymine, commonly referred to as A, C, G, and T, respectively. RNA has the same set of four bases, except that instead of thymine, RNA has uracil--commonly referred to as U. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.