An analysis of the missing data methodology for different types of data : a thesis presented in partial fulfilment of the requirements for the degree of Master of Applied Statistics at Massey University
Missing data is an eternal problem in data analysis. It is widely recognised that data is costly to collect, and the methods used to deal with missing data in the past relied on case deletion. There is no one overall best fix, but many different methodologies to use in different situations. This study was motivated by the writer's time spent analysing data in the nutrition study, and realising how much data was wasted by case deletion, and subsequently how this could bias inferences formed from the results. A better method (or methods), of dealing with missing data (than case deletion) is required, to ensure valuable information is not lost. What is being done: What is in the literature? The literature on this topic has exploded with new methods in recent times. Algorithms have been written and incorporated based on these methods into a number of statistical packages and add-on libraries. Statistical packages are also reviewed for their practicality and application in this area. The nutrition data is then applied to different methodologies, and software packages to assess different types of imputation. A set of questions are posed; based on type of data, type of missingness, extent of missingness, the required end use of the data, the size of the dataset, and how extensive that analysis needs to be. This can guide the investigator into using an appropriate form of imputation for the type of data at hand. A comparison of imputation methods and results is given with the principal result that imputing missing data is a very worthwhile exercise to reduce bias in survey results, which can be achieved by any researcher analysing their own data. Further to this, a conjecture is given for using Data Augmentation for ordinal data, particularly Likert scales. Previously this has been restricted to either person or item mean imputation, or hot deck methods. Using model based methods for imputation is far superior for other types of data. Model based methods for Likert data are achieved by means of inserting the linear by linear association model into standard missing data methodology.