Motivation: The analysis of gene expression data in its chromosomal context has been a recent development in cancer research. However, currently available methods fail to account for variation in the distance between genes, gene density and genomic features (e.g. GC content) in identifying increased or decreased chromosomal regions of gene expression.
Results: We have developed a model-based scan statistic that accounts for these aspects of the complex landscape of the human genome in the identification of extreme chromosomal regions of gene expression. This method may be applied to gene expression data regardless of the microarray platform used to generate it. To demonstrate the accuracy and utility of this method, we applied it to a breast cancer gene expression dataset and tested its ability to predict regions containing medium-to-high level DNA amplification (DNA ratio values >2). A classifier was developed from the scan statistic results that had a 10-fold cross-validated classification rate of 93% and a positive predictive value of 88%. This result strongly suggests that the model-based scan statistic and the expression characteristics of an increased chromosomal region of gene expression can be used to accurately predict chromosomal regions containing amplified genes.
Availability: Functions in the R-language are available from the author upon request.
Contact: fcouples@umich.edu.