Academic journal article Environmental Health Perspectives

Using Decision Forest to Classify Prostate Cancer Samples on the Basis of SELDI-TOF MS Data: Assessing Chance Correlation and Prediction Confidence

Academic journal article Environmental Health Perspectives

Using Decision Forest to Classify Prostate Cancer Samples on the Basis of SELDI-TOF MS Data: Assessing Chance Correlation and Prediction Confidence

Article excerpt

Class prediction using "omics" data is playing an increasing role in toxicogenomics, diagnosis/prognosis, and risk assessment. These data are usually noisy and represented by relatively few samples and a very large number of predictor variables (e.g., genes of DNA microarray data or m/z peaks of mass spectrometry data). These characteristics manifest the importance of assessing potential random correlation and overfitting of noise for a classification model based on omics data. We present a novel classification method, decision forest (DF), for class prediction using omics data. DF combines the results of multiple heterogeneous but comparable decision tree (DT) models to produce a consensus prediction. The method is less prone to overfitting of noise and chance correlation. A DF model was developed to predict presence of prostate cancer using a proteomic data set generated from surface-enhanced laser deposition/ ionization time-of-flight mass spectrometry (SELDI-TOF MS). The degree of chance correlation and prediction confidence of the model was rigorously assessed by extensive cross-validation and randomization testing. Comparison of model prediction with imposed random correlation demonstrated biologic relevance of the model and the reduction of overfitting in DF. Furthermore, two confidence levels (high and low confidences) were assigned to each prediction, where most misclassifications were associated with the low-confidence region. For the high-confidence prediction, the model achieved 99.2% sensitivity and 98.2% specificity. The model also identified a list of significant peaks that could be useful for biomarker identification. DF should be equally applicable to other omics data such as gene expression data or metabolomic data. The DF algorithm is available upon request. Key words: bioinformatics, chance correlation, class prediction, classification, decision forest, prediction confidence, prostate cancer, proteomics, SELDI-TOF. Environ Health Perspect 112:1622-1627 (2004). doi: 10.1289/txg.7109 available via http://dx.doi.org/ [Online 5 August 2004]

**********

Recent technologic advances in the fields of "omics," including toxicogenomics, hold great promise for the understanding of the molecular basis of health and disease, and toxicity. Prospective further advances could significantly enhance our capability to study toxicology and improve clinical protocols for early detection of various types of cancer, disease states, and treatment outcomes. Classification methods, because of their power to unravel patterns in biologically complex data, have become one of the most important bioinformatics approaches investigated for use with omics data. Classification uses supervised learning techniques (Tong et ah 2003b) to fit the samples into the predefined categories based on patterns of omics profiles or predictor variables (e.g., gene expressions in DNA microarray). The fitted model is then validated using either a cross-validation method or an external test set. Once validated, the model could be used for prediction of unknown samples.

A number of classification methods have been applied to microarray gene expression data (Ben-Dor et al. 2000; Simon et al. 2003; Slonim 2002), including artificial neural networks (Khan et al. 2001), K-nearest neighbor (Olshen and Jain 2002), Decision Tree (DT; Zhang et al. 2001), and support vector machines (SVMs; Brown et al. 2000). Some of the same methods have been applied similarly to proteomic data generated from surface-enhanced laser deposition/ionization time-of-flight mass spectrometry (SELDI-TOF MS) for molecular diagnostics (Adam et ah 2002; Bali et al. 2002). For example, Petricoin et al (2002a, 2002b) developed classification models for early detection of ovarian and prostate cancers (PCAs) on the basis of SELDI-TOF MS data using a genetic algorithm-based SVM.

Omics data present challenges for most classification methods because a) the number of predictor variables normally far exceeds the sample size and b) most data are unfortunately very noisy. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.