Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 10 of 12
  • Item
    Source attribution models using random forest for whole genome sequencing data : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics, School of Mathematical and Computational Sciences, Massey University, Palmerston North, New Zealand
    (Massey University, 2025-07-14) Smith, Helen
    Foodborne diseases, such as campylobacteriosis, represent a significant risk to public health. Preventing the spread of Campylobacter species requires knowledge of sources of human infection. Current methods of source attribution are designed to be used with a small number of genes, such as the seven housekeeping genes of the original multilocus sequence typing (MLST) scheme, and encounter issues when presented with whole genome data. Higher resolution data, however, offers the potential to differentiate within source groups (i.e., between different ruminant species in addition to differentiating between ruminants and poultry), which is poorly achieved with current methods. Random forest is a tree-based machine learning algorithm which is suitable for analysing data sets with large numbers of predictor variables, such as whole genome sequencing data. A known issue with tree-based predictive models occurs when new levels of a variable are present in an observation for prediction which were not present in the set of observations with which the model was trained. This is almost certain to occur with genomic data, which has a potentially ever-growing set of alleles for any single gene. This thesis investigates the use of ordinal encoding categorical variables to address the ‘absent levels’ problem in random forest models. Firstly, a method of encoding is adapted, based on correspondence analysis (CA) of a class by level contingency table, to be unbiased in the presence of absent levels. Secondly, a new method of encoding is introduced which utilises a set of supplementary information on the category levels themselves (i.e., the sequence information of alleles) and encodes them, as well as any new levels, according to their similarity or dissimilarity to each other via the method of principal coordinates analysis (PCO). Thirdly, based on the method of canonical analysis of principal coordinates (CAP), the encoding information of the levels from the CA on the contingency table is combined with the encoding information of the levels from the PCO on the dissimilarity matrix of the supplementary levels information, with a classical correspondence analysis (CCorA). Potential issues when using out-of-bag (OOB) data following variable encoding are then explored and an adaptation to the holdout variable importance method is introduced which is suitable for use with all methods of encoding. This thesis finishes by applying the CAP method of encoding to a random forest predictive model for source attribution of whole genome sequencing data from the Source Assigned Campylobacteriosis in New Zealand (SACNZ) study. The advantage of adding core genes and accessory genes as predictor variables is investigated, and the attribution results are compared to the results from a previously published study which used the asymmetric island model on the same set of isolates and the seven MLST genes.
  • Item
    Prediction of students' performance through data mining : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, Auckland, New Zealand
    (Massey University, 2020) Umer Baloch, Rahila
    Government funding to higher education providers is based upon graduate completions rather than on student enrollments. Therefore, unfinished degrees or delayed degree completions are major concerns for higher education providers since these problems impact their long-term financial security and overall cost-effectiveness. Therefore, providers need to develop strategies for improving the quality of their education to ensure increased enrollment and retention rates. This study uses predictive modeling techniques for assisting providers with real-time identification of struggling students in order to improve their course retention rates. Predictive models utilizing student demographic and other behavioral data gathered from an institutional learning platform have been developed to predict whether a student should be classed as at-risk of failing a course or not. Identification of at-risk students will help instructors take proactive measures, such as offering students extra help and other timely supports. The outcomes of this study will, therefore, provide a safety net for students as well as education providers in improving student engagement and retention rates. The computational approaches adopted in this study include machine learning techniques in combination with educational process mining methods. Results show that multi-purpose predictive models that were designed to operate across a variety of different courses could not be generalized due to the complexity and diversity of the courses. Instead, a meta-learning approach for recommending the best classification algorithms for predicting students’ performance is demonstrated. The study reveals how process-unaware learning platforms that do not accurately reflect ongoing learner interactions can enable the discovery of student learning practices. It holds value in reconsidering predictive modeling techniques by supplementing the analysis with contextually-relevant process models that can be extracted from stand-alone activities of process-unaware learning platforms. This provides a prescriptive approach for conducting empirical research on predictive modeling with educational data sets. The study contributes to the fields of learning analytics and education process mining by providing a distinctive use of predictive modeling techniques that can be effectively applied to real-world data sets.
  • Item
    Mining complex trees for hidden fruit : a graph–based computational solution to detect latent criminal networks : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Technology at Massey University, Albany, New Zealand.
    (Massey University, 2019) Robinson, David
    The detection of crime is a complex and difficult endeavour. Public and private organisations – focusing on law enforcement, intelligence, and compliance – commonly apply the rational isolated actor approach premised on observability and materiality. This is manifested largely as conducting entity-level risk management sourcing ‘leads’ from reactive covert human intelligence sources and/or proactive sources by applying simple rules-based models. Focusing on discrete observable and material actors simply ignores that criminal activity exists within a complex system deriving its fundamental structural fabric from the complex interactions between actors - with those most unobservable likely to be both criminally proficient and influential. The graph-based computational solution developed to detect latent criminal networks is a response to the inadequacy of the rational isolated actor approach that ignores the connectedness and complexity of criminality. The core computational solution, written in the R language, consists of novel entity resolution, link discovery, and knowledge discovery technology. Entity resolution enables the fusion of multiple datasets with high accuracy (mean F-measure of 0.986 versus competitors 0.872), generating a graph-based expressive view of the problem. Link discovery is comprised of link prediction and link inference, enabling the high-performance detection (accuracy of ~0.8 versus relevant published models ~0.45) of unobserved relationships such as identity fraud. Knowledge discovery uses the fused graph generated and applies the “GraphExtract” algorithm to create a set of subgraphs representing latent functional criminal groups, and a mesoscopic graph representing how this set of criminal groups are interconnected. Latent knowledge is generated from a range of metrics including the “Super-broker” metric and attitude prediction. The computational solution has been evaluated on a range of datasets that mimic an applied setting, demonstrating a scalable (tested on ~18 million node graphs) and performant (~33 hours runtime on a non-distributed platform) solution that successfully detects relevant latent functional criminal groups in around 90% of cases sampled and enables the contextual understanding of the broader criminal system through the mesoscopic graph and associated metadata. The augmented data assets generated provide a multi-perspective systems view of criminal activity that enable advanced informed decision making across the microscopic mesoscopic macroscopic spectrum.
  • Item
    Learning object metadata interchange mechanism : a thesis presented in partial fulfillment of the requirements for the degree of Master of Information Science at Massey University, Palmerston North, New Zealand
    (Massey University, 2005) Zhang, Yuejun
    In spite of the current lack of conceptual clarity in the multiple definitions and uses, the term learning objects is still frequently used in content creation and aggregation in the online-learning field. In the mean time, considerable efforts have been initiated in the past few years for the standardization of metadata elements for consistent description of learning objects, so that learning objects can be identified, searched and retrieved effectively and efficiently across multiple contexts. However, there are currently a large number of standardization bodies and an even much larger number of ongoing standard initiatives in the learning field, and different learning objects repositories are likely to apply different metadata schemas to meet the specific needs of their intended communities. An interchange mechanism for the conversion between various metadata schemas, therefore, becomes necessary for intensive interoperability. In this thesis, we first make a brief introduction to the concept learning objects, then the term metadata, followed by a description of the functional requirements of learning objects. the purposes of metadata, and the importance of metadata for learning objects. After that, this thesis investigates metadata schemas in various fields in general, focused on several mainstream metadata specifications developed for learning objects in particular. The differences among these metadata schemas for learning objects are analyzed and a mapping between their elements is identified. On the basis of literature review, a framework for interchange of metadata schemas is proposed and a prototype to demonstrate the functionalities of the framework is developed. For the high scalability and the high accuracy of the developed system, a so-called LOM-intermediated approach is suggested, and a so-called dynamic-database methodology is adopted. The LOM- intermediated approach significantly simplifies the metadata mapping issues by undertaking the schema-schema mapping in a way of schema-LOM-schema mapping, while the dynamic-database methodology effectively prevents any data-loss resulting as a by-product from the use of LOM-intermediated approach. The prototype currently generates and outputs XML metadata in IMS, EdNA, Dublin Core and LOM. It is a web- based three-tier architecture, using Java technologies for implementation, MySQL as the database server and JDBC for database access.
  • Item
    Using students' participation data to understand their impact on students' course outcomes : a thesis presented in partial fulfilment of the requirements for the MPhil degree at Massey University, Albany, New Zealand, Master of Philosophy degree in Information Technology
    (Massey University, 2016) Esnaashari, Shadi
    Many students with diverse needs are enrolled in university courses. Not all these students are able to be successful in completing their courses. Faculty members are keen to identify these students who have the risk of failing their courses early enough to help them by providing timely feedback so that students can meet the requirements of their courses. There are many studies using educational data mining algorithms which aim to identify at risk students by predicting students’ course outcomes, for example, from their forum activities, content requests, and time spent online. This study addresses this issue by clustering the students’ course outcomes using students’ class participation data which can be obtained from various online education technological solutions. Using data mining in educational systems as an analytical tool offers researchers new opportunities to trace students’ digital footprints in various course related activities and analyse students’ traced data to help the students in their learning processes and teachers in their educational practices. In this study the focus is not only on finding at risk students but also in using data for improving learning process and supporting personalized learning. In‐class participation data was collected through audience participation tools, the out‐of‐class participation data was collected from Stream and combined with the qualitative and quantitative data from questionnaires. The participation data were collected from 5 different courses in the mainstream university programs. Our first aim was to understand the perception of students regarding the effect of participation and using the audience participation tools in class and their effects on students’ learning processes. Moreover, we would like to identify to what extents their perceptions match with their final course outcomes. Therefore, the tool has been used in different mainstream courses from different departments. The results of our study show that students who participated more and thought that the tool helped them to learn, engaged and increased their interest in the course more, and eventually achieved highest scores. This finding supports the view that inclass participation is critical to learning and academic success.
  • Item
    Realism in synthetic data generation : a thesis presented in fulfilment of the requirements for the degree of Master of Philosophy in Science, School of Engineering and Advanced Technology, Massey University, Palmerston North, New Zealand
    (Massey University, 2017) McLachlan, Scott
    There are many situations where researchers cannot make use of real data because either the data does not exist in the required format or privacy and confidentiality concerns prevent release of the data. The work presented in this thesis has been undertaken in the context of security and privacy for the Electronic Healthcare Record (EHR). In these situations, synthetic data generation (SDG) methods are sought to create a replacement for real data. In order to be a proper replacement, that synthetic data must be realistic yet no method currently exists to develop and validate realism in a unified way. This thesis investigates the problem of characterising, achieving and validating realism in synthetic data generation. A comprehensive domain analysis provides the basis for new characterisation and classification methods for synthetic data, as well as a previously undescribed but consistently applied generic SDG approach. In order to achieve realism, an existing knowledge discovery in databases approach is extended to discover realistic elements inherent to real data. This approach is validated through a case study. The case study demonstrates the realism characterisation and validation approaches as well as establishes whether or not the synthetic data is a realistic replacement. This thesis presents the ATEN framework which incorporates three primary contributions: (1) the THOTH approach to SDG; (2) the RA approach to characterise the elements and qualities of realism for use in SDG, and finally; (3) the HORUS approach for validating realism in synthetic data. The ATEN framework presented is significant in that it allows researchers to substantiate claims of success and realism in their synthetic data generation projects. The THOTH approach is significant in providing a new structured way for engaging in SDG. The RA approach is significant in enabling a researcher to discover and specify realism characteristics that must be achieved synthetically. The HORUS approach is significant in providing a new practical and systematic validation method for substantiating and justifying claims of success and realism in SDG works. Future efforts will focus on further validation of the ATEN framework through a controlled multi-stream synthetic data generation process.
  • Item
    Maximising the effectiveness of threat responses using data mining : a piracy case study : this thesis presented in partial fulfillment of the requirements for the degree of Master of Information Sciences in Information Technology, School of Engineering and Advanced Technology at Massey University, Albany, Auckland, New Zealand
    (Massey University, 2015) Lee, Seung Jun
    Companies with limited budgets must decide how best to defend against threats. This thesis presents and develops a robust approach to grouping together threats which present the highest (and lowest) risk, using film piracy as a case study. Techniques like cluster analysis can be used effectively to group together sites based on a wide range of attributes, such as income earned per day and estimated worth. The attributes of high earning and low earning websites could also give some useful insight into policy options which might be effective in reducing earnings by pirate websites. For instance, are all low value sites based in a country with effective internet controls? One of the practical data mining techniques such as a decision tree or classification tree could help rightsholders to interpret these attributes. The purpose of analysing the data in this thesis was to answer three main research questions in this thesis. It was found that, as predicted, there were two natural clusters of the most complained about sites (high income and low income). This means that rightsholders should focus their efforts and resources on only high income sites, and ignore the others. It was also found that the main significant factors or key critical variables for separating high-income vs low-income rogue websites included daily page-views, number of internal and external links, social media shares (i.e. social network engagement) and element of the page structure, including HTML page and JavaScript sizes. Further research should investigate why these factors were important in driving website revenue higher. For instance, why is high revenue associated with smaller HTML pages and less JavaScript? Is it because the pages are simply faster to load? A similar pattern is observed with the number of links. These results could form a study looking into what attributes make e-commerce successful more broadly. It is important to note that this was a preliminary study only looking at the Top 20 rogue websites basically suggested by Google Transparency Report (2015). Whilst these account for the majority of complaints, a different picture may emerge if we analysed more sites, and/or selected them based on different sets of criteria, such the time period, geographic location, content category (software versus movies, for example), and so on. Future research should also extend the clustering technique to other security domains.
  • Item
    A comparison of univariate and multivariate statistical and data mining approaches to the behavioural and biochemical effects of vestibular loss related to the hippocampus : a thesis submitted in partial fulfilment of the requirements of the MApplStat in Applied Statistics, Massey University, Manawatu
    (Massey University, 2013) Smith, Paul F
    Vestibular dysfunction is associated with a complex syndrome of cognitive and anxiety disorders. However, most studies have used simple univariate analyses of the effects of vestibular loss on behaviour and brain function. In this thesis, univariate statistical, and multivariate statistical and data mining approaches, to the behavioural and neurochemical effects of bilateral vestibular deafferentation (BVD), were compared. Using linear mixed model analyses, including repeated measures analyses of variance and analyses with the covariance structure of the repeated measures specified, rats with BVD were found to exhibit increased locomotor activity, reduced rearing and reduced thigmotaxis. By contrast, there were no significant differences between BVD and sham control animals in the elevated plus maze and the BVD animals exhibited a longer escape latency in the elevated T maze, with no change in avoidance latency. In the spatial T maze, the BVD animals demonstrated a significant decrease in accuracy compared to the sham control animals. Using linear discriminant analysis, cluster analysis, random forest classification and support vector machines, BVD animals could be distinguished from sham controls by their behavioural syndrome. Using multiple linear regression and random forest regression, the best predictors of performance in the spatial T maze were whether the animals had received a BVD or sham lesion, and the duration of rearing. In the neurochemical data set, the expression of 5-7 glutamate receptor subunits was measured in 3 different subregions of the rat hippocampus, at various times following BVD, using western blotting. In the 6 month group, half of the animals underwent training in a T-maze. Using multivariate analyses of variance, there was no significant effect of surgery for any hippocampal subregion. Linear discriminant analysis could not determine a linear discriminant function that could separate BVD from sham control animals. A random forest classification analysis was also unsuccessful in this respect. However, for the 6 month data set, T maze training had a significant effect independently of surgery. The results of these experiments suggest that BVD results in profound spatial memory deficits that are not associated with large changes in the expression of glutamate receptors in the hippocampus. The results of the multivariate statistical and data mining analyses, applied to both the behavioural and neurochemical data sets, suggested that research in this field of neuroscience would benefit from analysing multiple variables in relation to one another, rather than simply conducting univariate analyses. Since the different behavioural and neurochemical variables do interact with one another, it is important to determine the nature of these interactions in the analyses conducted. However, this will require researchers to design experiments in which multiple variables can be measured under the one set of conditions.
  • Item
    A study of frequent pattern mining in transaction datasets : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Palmerston North, New Zealand
    (Massey University, 2011) Xu, Luofeng
    Within data mining, the efficient discovery of frequent patterns—sets of items that occur together in a dataset—is an important task, particularly in transaction datasets. This thesis develops effective and efficient algorithms for frequent pattern mining, and considers the related problem of how to learn, and utilise, the characteristics of the particular datasets being investigated. The first problem considered is how to mine frequent closed patterns in dynamic datasets, where updates to the dataset are performed. The standard approach to this problem is to use a standard pattern mining algorithm and simply rerun it on the updated dataset. An alternative method is proposed in this thesis that is significantly more efficient provided that the size of the updates is relatively small. Following this is an investigation of the pattern support distribution of transaction datasets, which measures the numbers of times each pattern appears within the dataset. The evidence for the pattern support distribution of real retail datasets obeying a power law is investigated using qualitative appraisals and statistical goodness-of-fit tests, and the power law is found to be a good model. Based on this, the thesis demonstrates how to efficiently estimate the pattern support distribution based on sampling techniques, reducing the computational cost of finding this distribution. The last major contribution of the thesis is to consider novel ways to set the main user-specified parameters of frequent pattern mining, the minimum support, which defines how many times a pattern needs to be seen before it is ‘frequent’. This is a critical parameter, and very hard to set without a lot of knowledge of the dataset. A method to enable the user to specify rather looser requirements for what they require from the mining is proposed based on the assumption of a power-law-based pattern support distribution and fuzzy logic techniques.
  • Item
    Quantification of individual rugby player performance through multivariate analysis and data mining : a thesis presented for the fulfilment of the requirements for the degree of Doctor of Philosophy at Massey University, Albany, New Zealand
    (Massey University, 2003) Bracewell, Paul J
    This doctoral thesis examines the multivariate nature of performance to develop a contextual rating system for individual rugby players on a match-by-match basis. The data, provided by Eagle Sports, is a summary of the physical tasks completed by the individual in a match, such as the number of tackles, metres run and number of kicks made. More than 130 variables were available for analysis. Assuming that the successful completion of observed tasks are an expression of ability enables the extraction of the latent dimensionality of the data, or key performance indicators (KPI), which are the core components of an individual's skill-set. Multivariate techniques (factor analysis) and data mining techniques (self-organising maps and self-supervising feed-forward neural networks) are employed to reduce the dimensionality of match performance data and create KPI's. For this rating system to be meaningful, the underlying model must use suitable data, and the end model itself must be transparent, contextual and robust. The half-moon statistic was developed to promote transparency, understanding and interpretation of dimension reduction neural networks. This novel non-parametric multivariate method is a tool for determining the strength of a relationship between input variables and a single output variable, whilst not requiring prior knowledge of the relationship between the input and output variables. This resolves the issue of transparency, which is necessary to ensure the rating system is contextual. A hybrid methodology is developed to combine the most appropriate KPI's into a contextual, robust and transparent univariate measure for individual performance. The KPI's are collapsed to a single performance measure using an adaptation of quality control ideology where observations are compared with perfection rather than the average to suit the circumstances presented in sport. The use of this performance rating and the underlying key performance indicators is demonstrated in a coaching setting. Individual performance is monitored with the use of control charts enabling changes in form to be identified. This enables the detection of strengths/weakness in the individual's underlying skill-set (KPI's) and skills. This process is not restricted to rugby or sports data and is applicable in any field where a summary of multivariate data is required to understand performance.