As DNA databases continue to grow at an exponential rate, the need for more efficient solutions to basic problems in computational biology grows ever more pressing.
These problems range from the principal questions driving evolutionary science:
How can we accurately infer the history of genes, individuals and species? How
can we separate the signal from the noise in our data? How can we visualise that
signal? - to the purely practical: How can we efficiently store all this data? With
these goals in mind, this thesis mounts a computational combination attack on a
variety of topics in bioinformatics and phylogenetics:
A program is designed and implemented for solving the Maximum Parsimony
problem - in essence, finding phylogenetic trees having the fewest mutations.
This program generally outperforms existing highly optimised programs when
using a single CPU, and unlike these earlier programs, offers highly efficient
parallelisation across multiple CPUs for further speedup.;
A program is designed and implemented for compressing databases of DNA
sequences. This program outperforms general-purpose compression by taking
advantage of the special "treelike" structure of DNA databases, using a novel
data structure, the "leaky move-to-front hashtable", to achieve speed gains.;
A data visualisation technique is introduced that concisely summarises the
"treelikeness" of phylogenetic datasets on a ternary plot. Each dataset is represented by a single point, allowing multiple datasets, or multiple treatments
of a dataset, to be displayed on a single diagram.;
We demonstrate problems with a standard phylogenetic analysis methodology
in which a single tree is assumed a priori. We argue for a shift towards network
methods that can in principle reject the hypothesis of a single tree.;
Motivated by a phylogenetic problem, a fast new algorithm is developed for finding the mode(s) of a multinomial distribution, and an exact analysis of its
complexity is given.