Massey Documents by Type
Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294
Browse
2 results
Search Results
Item Source attribution models using random forest for whole genome sequencing data : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics, School of Mathematical and Computational Sciences, Massey University, Palmerston North, New Zealand(Massey University, 2025-07-14) Smith, HelenFoodborne diseases, such as campylobacteriosis, represent a significant risk to public health. Preventing the spread of Campylobacter species requires knowledge of sources of human infection. Current methods of source attribution are designed to be used with a small number of genes, such as the seven housekeeping genes of the original multilocus sequence typing (MLST) scheme, and encounter issues when presented with whole genome data. Higher resolution data, however, offers the potential to differentiate within source groups (i.e., between different ruminant species in addition to differentiating between ruminants and poultry), which is poorly achieved with current methods. Random forest is a tree-based machine learning algorithm which is suitable for analysing data sets with large numbers of predictor variables, such as whole genome sequencing data. A known issue with tree-based predictive models occurs when new levels of a variable are present in an observation for prediction which were not present in the set of observations with which the model was trained. This is almost certain to occur with genomic data, which has a potentially ever-growing set of alleles for any single gene. This thesis investigates the use of ordinal encoding categorical variables to address the ‘absent levels’ problem in random forest models. Firstly, a method of encoding is adapted, based on correspondence analysis (CA) of a class by level contingency table, to be unbiased in the presence of absent levels. Secondly, a new method of encoding is introduced which utilises a set of supplementary information on the category levels themselves (i.e., the sequence information of alleles) and encodes them, as well as any new levels, according to their similarity or dissimilarity to each other via the method of principal coordinates analysis (PCO). Thirdly, based on the method of canonical analysis of principal coordinates (CAP), the encoding information of the levels from the CA on the contingency table is combined with the encoding information of the levels from the PCO on the dissimilarity matrix of the supplementary levels information, with a classical correspondence analysis (CCorA). Potential issues when using out-of-bag (OOB) data following variable encoding are then explored and an adaptation to the holdout variable importance method is introduced which is suitable for use with all methods of encoding. This thesis finishes by applying the CAP method of encoding to a random forest predictive model for source attribution of whole genome sequencing data from the Source Assigned Campylobacteriosis in New Zealand (SACNZ) study. The advantage of adding core genes and accessory genes as predictor variables is investigated, and the attribution results are compared to the results from a previously published study which used the asymmetric island model on the same set of isolates and the seven MLST genes.Item Prediction of students' performance through data mining : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, Auckland, New Zealand(Massey University, 2020) Umer Baloch, RahilaGovernment funding to higher education providers is based upon graduate completions rather than on student enrollments. Therefore, unfinished degrees or delayed degree completions are major concerns for higher education providers since these problems impact their long-term financial security and overall cost-effectiveness. Therefore, providers need to develop strategies for improving the quality of their education to ensure increased enrollment and retention rates. This study uses predictive modeling techniques for assisting providers with real-time identification of struggling students in order to improve their course retention rates. Predictive models utilizing student demographic and other behavioral data gathered from an institutional learning platform have been developed to predict whether a student should be classed as at-risk of failing a course or not. Identification of at-risk students will help instructors take proactive measures, such as offering students extra help and other timely supports. The outcomes of this study will, therefore, provide a safety net for students as well as education providers in improving student engagement and retention rates. The computational approaches adopted in this study include machine learning techniques in combination with educational process mining methods. Results show that multi-purpose predictive models that were designed to operate across a variety of different courses could not be generalized due to the complexity and diversity of the courses. Instead, a meta-learning approach for recommending the best classification algorithms for predicting students’ performance is demonstrated. The study reveals how process-unaware learning platforms that do not accurately reflect ongoing learner interactions can enable the discovery of student learning practices. It holds value in reconsidering predictive modeling techniques by supplementing the analysis with contextually-relevant process models that can be extracted from stand-alone activities of process-unaware learning platforms. This provides a prescriptive approach for conducting empirical research on predictive modeling with educational data sets. The study contributes to the fields of learning analytics and education process mining by providing a distinctive use of predictive modeling techniques that can be effectively applied to real-world data sets.
