An information theoretic approach to language relatedness : a dissertation submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Systems at Massey University
This dissertation examines the prospect of applying information theoretic principles to help solve problems in historical linguistics. The Minimum Message Length principle attributed to Chris Wallace (similar to the Minimum Description Length principle of Jorma Rissanen) is used to judge the goodness of hypotheses in the field of historical linguistics. The idea is that theories that require a shorter message to describe with their data are better than those that require long messages. Work in collecting the linguistic data tracing the derivation of some 2714 words in Modern Cantonese and Modern Beijing from their forms in a reconstruction of Middle Chinese is described as also is the work in transforming this data into a format suitable for use with software developed for this project. Heuristics for inferring Probabilistic Finite State Automata (PFSA In this dissertation, the abbreviation PFSA has been used to denote both the singular and plural of these machines, the "A" in PFSA being understood to represent both Automaton and Automata.) from such data are reviewed and some new heuristics are introduced. These are then applied to training data and benchmark results presented. Finally, the inference process is applied to the actual linguistic data which allows a conjecture regarding a relative closeness of the Chinese dialects to their reconstructed ancestor to be formed.