|
|
Computational analysis of molecular sequence data is a key component in solving three critical biological problems of the 21st century: how genes interact to produce living cells, how gene malfunction causes disease and how complex, multicellular organisms evolved from simple, unicellular organisms. The late 20th century revolution in DNA sequencing technology has produced an exponentially growing wealth of data that can be used to answer these and similar questions. The World Wide Web has made it possible to integrate data from diverse, distant sources and make it universally available. The field of computational molecular biology was born at the confluence of these two revolutions. In my research in computational molecular biology, I study the role of gene duplication in the acquisition of new gene function and the evolution of vertebrate genomes. (A genome is the complete set of genes in an organism.) New genes arise through gene duplications, errors during cell division that result in extra copies of genes. These extra copies subsequently mutate to take on new functional roles in the cell. The duplication of large regions, ranging from chromosomal segments to the entire genome, is believed to have played a crucial role in early vertebrate evolution. According to the hypothesis, the new genes that resulted from these massive duplications are responsible for the evolution of innovations, such as skeletal structure, limbs, and a complex central nervous system, that distinguish vertebrates from other life forms. If we can understand how these genes acquired new function following duplication, we will have a better understanding of how we evolved and the role those genes play in vertebrates living today. This project contains a number of interesting open problems in both information retrieval and algorithms. In order to understand the evolution of new function in duplicated genes, sequence data must be combined with other types of biological data leading to problems in web-based data management and retrieval, including data mining, analysis and visualization of large biological datasets that are diverse, distributed and noisy. Algorithmic and combinatorial problems arise in reconstructing the history of genomic duplications and rearrangements that led to the modern genome. For example, while individual duplicate genes can be found using algorithms based on approximate string matching, the problems of identifying entire sets of genes that were duplicated simultaneously and estimating the age of these large-scale duplications remain open. I welcome discussions with computer science students and computer scientists who are interested in either the algorithmic or the information retrieval aspects of computational molecular biology.
|
||||