Primary homology in DNA and protein sequence has long been used to infer a
relationship between similar sequences. However gene sequence, and thus protein
sequence, can change over time. In evolutionary biology that time can be millions of
years and related sequences may become unrecognisable via primary homology. This is
demonstrated most effectively in chapter 4a (figure 10). Conversely the number of
possible folds that proteins can adopt is limited by the attractions between residues and
therefore the number of possible folds is not infinite. This means that folds may arise
via convergence between evolutionarily unrelated DNA sequences.
This thesis aims to look at a process to will wring more information from the
primary protein sequence than is usually used and finds other factors that can support or
refute the placement of a protein sequence within the family in question. Two quite
different proteins; the Major Vault Protein whose monomers make up the enigmatic
vault particle and the argonaute family of proteins (AGO and PIWI) that appear to have
a major hand in quelling parasitic nucleic acid and control of endogenous gene
expression, are used to demonstrate the flexibility of the workflow.
Principally the method relies on prediction of three-dimensional structure. This
requires at least a partially solved crystal structure but once one exists this method
should be suitable for any protein. Whole genome sequencing is now a routine practice
but annotation of the resultant sequence lags behind for lack of skilled personnel.
Automated pipeline data does a good job in annotating close homologs but more effort
is needed for correct annotation of the exponentially growing data bank of
uncharacterised (and wrongly characterised) proteins. Lastly, in deference to budding
biologists the world over, I have tried to find free stable software that can be used on an
ordinary personal computer and by a researcher with minimal computer literacy to help
with this task.