A Master of Science Thesis in Electrical Engineering Submitted by Omniyah Gul Mohammed Entitled, "DNA Base-Calling," May 2010. Available are both Soft and Hard Copies of the Thesis.
The human genome sequence, consisting of approximately three billion bases, was decoded in the last decade as a result of the Human Genome Project. Information derived from the genomic sequences of different species is expected to contribute massively to advances in various fields, such as medicine, forensics and agriculture. The ability to decipher the genetic material is of huge importance to researchers trying to improve the diagnosis of genetic diseases, improve drug design to target specific genes, detect bacteria that may pollute air or water, explore species ancestry, etc. The impact of DNA sequencing in various fields has created a need to efficiently automate the mapping of signals obtained from sequencing machines to their corresponding sequence of bases, a process referred to as DNA base-calling. This thesis attempts to solve the problem of base-calling by using pattern recognition, the act of classifying raw data based on prior or statistical information extracted from the data into various classes. In this thesis, two new frameworks are proposed using Artificial Neural Networks (ANN) and Polynomial Classifiers (PC) to model electropherogram traces. Data is obtained from the Sorenson Molecular Genealogy Foundation (SMGF) and the National Center for Biotechnology Information (NCBI) trace archive. Pre-processing, which includes de-correlation, de- convolution and normalization, needs to be implemented to minimize or eliminate data imperfections that are primarily attributed to the nature of chemical reactions involved in DNA sequencing. Discriminative features that characterize chromatogram traces are subsequently extracted and subjected to the classifiers to categorize the events to their respective classes: A, C, T or G. The models are trained such that they are not restricted to the type of organism the chromatogram belongs to or to the chemistry involved in obtaining the chromatogram. The base-calling accuracy achieved is compared with the existing standards, PHRED and ABI KB base-caller in terms of deletion, insertion and substitution errors. Experimental evidence indicates that the models implemented achieve a higher base-calling accuracy when compared to PHRED and a comparable performance when compared to ABI. The results obtained demonstrate the potential of the proposed models for efficient and accurate DNA base-calling.