Lineage specific evolution and phylogenetic analysis : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biomathematics at Massey University, Palmerston North, New Zealand
Phylogenetic models generally assume a homogeneous, time reversible, stationary
process. These assumptions are often violated by the real, far more complex,
evolutionary process. This thesis is centered on non-homogeneous, lineage-specific,
properties of molecular sequences. It consist several related but independent studies.
LineageSpecificSeqgen, an extension to the Seq-Gen program, which allows generation
of sequences with changes in the proportion of variable sites, is introduced. This
program is then used in a simulation study showing that changes in the proportion of
variable sites can hinder tree estimation accuracy, and that tree reconstruction under the
best-fit model chosen using a relative test can result in a wrong tree. In this case, the
less commonly used absolute model-fit was a better predictor of tree estimation
accuracy. This study found that increased taxon sampling of lineages that have
undergone a change in the proportion of variable sites was critical for accurate tree
reconstruction and that, in contrast to some earlier findings, the accuracy of maximum
parsimony is adversely affected by such changes.
This thesis also addresses the well-known long-branch attraction artifact. A nonparametric
bootstrap test to identify changes in the substitution process is introduced,
validated, and applied to the case of Microsporidia, a highly reduced intracellular
parasite. Microsporidia was first thought to be an early branching eukaryote, but is now
believed to be sister to, or included within, fungi. Its apparent basal eukaryote position
is considered a result of long-branch attraction due to an elevated evolutionary rate in
the microsporidian lineage. This study shows that long-branch estimates and basal
positioning of Microsporidia both correlate with increased proportions of radical
substitutions in the microsporidian lineage. In simulated data, such increased
proportions of radical substitutions leads to erroneous long-branch estimates. These
results suggest that the long microsporidian branch is likely to be a result of an
increased proportion of radical substitutions on that branch, rather than increased
evolutionary rate per se.
The focus of the last study is the intriguing case of Mesostigma, a fresh water green alga
for which contradicting phylogenetic relationships were inferred. While some studies
placed Mesostigma within the Streptophyta lineage (which includes land plants), others
placed it as the deepest green algae divergence. This basal positioning is regarded as a
result of long-branch attraction due to poor taxon sampling. Reinvestigation of a 13-
taxon mitochondrial amino acid dataset and a sub-dataset of 8 taxa reveals that site
sampling, and in particular the treatment of missing data, is just as important a factor for
accurate tree reconstruction as taxon sampling. This study identifies a difficulty in
recreating the long-branch attraction observed for the 8-taxon dataset in simulated data.
The cause is likely to be the smaller number of amino acid characters per site in
simulated data compared to real data, highlighting the fact that there are properties of
the evolutionary process that are yet to be accurately modeled.