Computer Science Thesis Proposal

Wednesday, May 31, 2017 - 2:00pm


8102 Gates Hillman Centers


HONGYI XIN, Ph.D. Student

DNA read mapping is an important problem in Bioinformatics. With the introduction of next-generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. The success of many medical and genetic applications critically depends on computational methods to process the enormous amount of sequence data quickly and accurately. However, due to the repetitive nature of human genome and limitations of the sequencing technology, current read mapping methods still fall short from achieving both high performance and high sensitivity. In this proposal, I break down the DNA read mapping problem into four subproblems:  intelligent seed extraction, efficient filtration of incorrect seed locations, high performance extension and accurate and efficient read cloud mapping. I provide novel computational techniques for each subproblem, including: 1) a novel seed selection algorithm that optimally divides a read into low frequency seeds; 2) a novel SIMD-friendly bit-parallel filtering problem that quickly estimates if two strings are highly similar; 3) a generalization of a state-of-the-art approximate string matching algorithm that measures genetic similarities with more realistic metrics and 4) a novel mapping strategy that utilizes characteristics of a new sequencing technologies, read cloud sequencing, to map NGS reads with higher accuracy and higher efficiency. Thesis Committee: Carl Kingsford (Chair) Jian Ma Phil Gibbson Iman Hajirasouliha (Weill Cornell Medical College) Bill Bolosky (Microsoft Research) Copy of Thesis Summary

For More Information, Contact:


Thesis Proposal