Distributions on bicoloured evolutionary trees : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Mathematics at Massey University

Thumbnail Image
Open Access Location
Journal Title
Journal ISSN
Volume Title
Massey University
The Author
A central and challenging problem in contemporary biology is how to accurately reconstruct evolutionary trees from DNA sequence data. This thesis addresses three themes from this endeavour -- comparison, consistency and confidence intervals -- by analysing distributions arising from phylogenetic trees. Toward the first theme, the distribution of the symmetric difference metric on pairs of binary and phylogenetic trees is studied, and a number of new results obtained. These theorems, as well as a result on another tree metric answer previous conjectures in this area. Also under the theme of comparison, we analyse distributions on bicoloured trees arising from the principle of parsimony. A streamlined proof is given of an elegant theorem which allows an efficient comparison of how much better a maximum parsimony tree fits given data than a randomly-chosen tree. A dual distribution, where the tree is fixed and the data varies is also analysed, answering a recent unsolved problem. We then consider the theoretical accuracy of tree-building methods, concentrating on the statistical property of consistency. Under a simple stochastic model on bicoloured trees, conditions for the consistency of frequently-used methods based on parsimony and compatibility are examined. lt is shown that even in "best possible" conditions both methods can be inconsistent, though a strong sufficient condition for compatibility is given. The analysis is extended for a molecular clock. Finally, procedures are described for placing confidence intervals around phylogenies, and limitations on the sort of confidence intervals possible are given. Ways to efficiently implement these procedures are then considered -- in particular, approximate methods, applications to sets of taxa of size four, and simplifications under a molecular clock. The rate that sequence data must grow as a function of the number of taxa for confidence intervals to converge to a single tree is also considered. The arguments in this thesis are primarily combinatorial and stochastic. In the hope that their implications will also interest biologists, some space has been given to motivating and explaining the biological relevance of the results presented.
Evolutionary trees, Phylogenetic trees, Phylogenetics