Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 10 of 39
  • Item
    Distinguishing plant and milk proteins and their interactions in hybrid cheese using confocal Raman microscopy with machine learning
    (Elsevier Limited, United Kingdom, 2026-01-01) Lu D; McGoverin C; Roy D; Acevedo-Fani A; Singh H; Waterland M; Zheng Y; Ye A
    The increasing demand for plant-based alternatives to milk protein has led to the development of hybrid processed cheese analogues (HPCAs) combining plant proteins and casein. However, their complex microstructure and molecular interactions remain poorly understood. This study integrated confocal Raman spectroscopy with advanced machine learning for high-resolution spatial mapping and molecular characterization of HPCAs containing mung bean protein isolate (MPI) or hemp protein isolate (HPI) with casein. This integration helped distinguish between protein sources and elucidate structural changes. The addition of casein changed the HPI structure, promoting structural disorder, disulfide bond rearrangement, and a sharp decrease in the tyrosine doublet ratio from 4.5 in HPI100 to 1.2 in HPI50. Conversely, casein interaction with MPI led to microstructural segregation and changes of β-sheet content (from 53 % in MPI100 to 20 % in MPI30). This integrated method represents a powerful tool for analysing protein structure and interactions in complex food systems.
  • Item
    A Hormetic Approach to the Value-Loading Problem: Preventing the Paperclip Apocalypse
    (Springer Nature Singapore Pte Ltd, 2025-10-06) Henry NIN; Pedersen M; Williams M; Martin JLB; Donkin L
    The value-loading problem is a major obstacle to creating Artificial Intelligence (AI) systems that align with human values and preferences. Central to this problem is the establishment of safe limits for repeatable AI behaviors. We introduce hormetic alignment, a paradigm to regulate the behavioral patterns of AI, grounded in the concept of hormesis, where low frequencies or repetitions of a behavior have beneficial effects, while high frequencies or repetitions are harmful. By modeling behaviors as allostatic opponent processes, we can use either Behavioral Frequency Response Analysis (BFRA) or Behavioral Count Response Analysis (BCRA) to quantify the safe and optimal limits of repeatable behaviors. We demonstrate how hormetic alignment solves the ‘paperclip maximizer’ scenario, a thought experiment where an unregulated AI tasked with making paperclips could end up converting all matter in the universe into paperclips. Our approach may be used to help create an evolving database of ‘values’ based on the hedonic calculus of repeatable behaviors with decreasing marginal utility. Hormetic alignment offers a principled solution to the value-loading problem for repeatable behaviors, augmenting current techniques by adding temporal constraints that reflect the diminishing returns of repeated actions. It further supports weak-to-strong generalization – using weaker models to supervise stronger ones – by providing a scalable value system that enables AI to learn and respect safe behavioral bounds. This paradigm opens new research avenues for developing computational value systems that govern not only single actions but the frequency and count of repeatable behaviors.
  • Item
    Accurate machine learning model for human embryo morphokinetic stage detection
    (Springer Science+Business Media, LLC, 2025-08-20) Misaghi H; Cree L; Knowlton N
    Purpose: The ability to detect, monitor, and precisely time the morphokinetic stages of human pre-implantation embryo development plays a critical role in assessing their viability and potential for successful implantation. Therefore, there is a need for accurate and accessible tools to analyse embryos. This work describes a highly accurate, machine learning model designed to predict 17 morphokinetic stages of pre-implantation human development, an improvement on existing models. This model provides a robust tool for researchers and clinicians, enabling the automation of morphokinetic stage prediction, standardising the process, and reducing subjectivity between clinics. Method: A computer vision model was built on a publicly available dataset for embryo Morphokinetic stage detection. The dataset contained 273,438 labelled images based on Embryoscope/ + © embryo images. The dataset was split 70/10/20 into training/validation/test sets. Two different deep learning architectures were trained and tested, one using EfficientNet-V2-Large and the other using EfficientNet-V2-Large with the addition of fertilisation time as input. A new postprocessing algorithm was developed to reduce noise in the predictions of the deep learning model and detect the exact time of each morphokinetic stage change. Results: The proposed model reached an overall test F1-score of 0.881 and accuracy of 87% across 17 morphokinetic stages on an independent test set. Conclusion: The proposed model shows a 17% accuracy improvement, compared to the best models on the same dataset. Therefore, our model can accurately detect morphokinetic stages in static embryo images as well as detecting the exact timings of stage changes in a complete time-lapse video.
  • Item
    A machine learning-guided semi-empirical model for predicting single-sided natural ventilation rates
    (Elsevier B V, 2025-10-01) Han JM; Wu W; Malkawi A
    Most of the state-of-the-art natural ventilation models were developed for either single-sided, or cross ventilation mode, or buoyancy-driven ventilation. Natural ventilation (NV) of a single zone may vary between different modes in different seasons depending on the design and the operation of other building systems. This paper tailors the machine learning embedded semi-empirical models to predict the natural ventilation rate in a single zone. The process of model development consists of two parts: 1) semi-empirical model development for single-sided ventilation with a local context 2) machine learning driven component to accurately predict a specific lab condition. By taking a case study, the series of steps were taken to validate model accuracy with an estimated flowrate in given window operable areas. Firstly, the contextual inputs and localized wind speed as well as window models were investigated. Finally, we developed a machine learning model to predict the localized lab environment by using pressure sensor's data on façade. The random forest model was trained and fine-tuned to predict localized pressure coefficients (Cp). Over 75 % of the predicted values fall within the model's ± 1 standard deviation credible interval, demonstrating not only high predictive reliability but also suitability for integration into empirical ventilation models. These results highlight the model's potential as a robust input generator for semi-empirical frameworks with locally collected weather data, particularly in applications involving window operation control and site-specific model calibration.
  • Item
    Essays on finance and deep learning : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Finance, School of Economics and Finance, Massey University
    (Massey University, 2025-07-25) Pan, Guoyao
    This thesis aims to broaden the application of deep learning techniques in financial research and comprises three essays that make meaningful contributions to the related literature. Essay One integrates deep learning into the Hub Strategy, a novel chart pattern analysis method, to develop trading strategies. Utilizing deep learning models, which analyze chart patterns alongside data such as trading volume, price volatility, and sentiment indicators, the strategy forecasts stock price movements. Tests on U.S. S&P 500 index stocks indicate that Hub Strategy trading methods, when integrated with deep learning models, achieve an annualized average return of approximately 25%, significantly outperforming the benchmark buy-and-hold strategy's 9.6% return. Risk-adjusted metrics, including Sharpe ratios and Jensen’s alpha, consistently demonstrate the superiority of these trading strategies over both the buy-and-hold approach and standalone Hub Strategy trading rules. To address data snooping concerns, multiple tests validate profitability, and an asset pricing model with 153 risk factors and Lasso-OLS (Ordinary Least Squares) regressions confirms its ability to capture positive alphas. Essay Two utilizes deep learning techniques to explore the relationships between the abnormal return and its explanatory variables, including firm-specific characteristics and realized stock returns. Trained deep learning models effectively predict the estimated abnormal return directly. We evaluate the effectiveness of detecting abnormal returns by comparing our deep learning models against three benchmark methods. When applied to a random dataset, deep learning models demonstrate a significant improvement in identifying abnormal returns within the induced range of -3% to 3%. Moreover, their performance remains consistent across non-random datasets classified by firm size and market conditions. In addition, a regression of abnormal return prediction errors on firm-based factors, market conditions, and periods reveals that deep learning models are less sensitive to variables like firm size, market conditions, and periods than the benchmarks. Essay Three assesses the performance of deep learning predictors in forecasting momentum turning points using the confusion matrix and comparing them to the benchmark model proposed by Goulding, Harvey, and Mazzoleni (2023). Tested on U.S. stocks from January 1990 to December 2023, deep learning predictors demonstrate higher accuracy in identifying turning points than the benchmark. Furthermore, our deep learning-based trading rules yield higher mean log returns and Sharpe ratios, along with lower volatility, compared to the benchmark. Two models achieve average monthly returns of 0.0148 and 0.0177, surpassing the benchmark’s 0.0108. These gains are both economically and statistically significant, with consistent annual results. Regression analysis also shows that our models respond more effectively to changes in stock and market return volatility than the benchmark. Overall, these essays expand the application of deep learning in finance research, demonstrating high predictive accuracy, enhanced trading profitability, and effective detection of long-term abnormal returns, all of which hold significant practical value.
  • Item
    Source attribution models using random forest for whole genome sequencing data : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics, School of Mathematical and Computational Sciences, Massey University, Palmerston North, New Zealand
    (Massey University, 2025-07-14) Smith, Helen
    Foodborne diseases, such as campylobacteriosis, represent a significant risk to public health. Preventing the spread of Campylobacter species requires knowledge of sources of human infection. Current methods of source attribution are designed to be used with a small number of genes, such as the seven housekeeping genes of the original multilocus sequence typing (MLST) scheme, and encounter issues when presented with whole genome data. Higher resolution data, however, offers the potential to differentiate within source groups (i.e., between different ruminant species in addition to differentiating between ruminants and poultry), which is poorly achieved with current methods. Random forest is a tree-based machine learning algorithm which is suitable for analysing data sets with large numbers of predictor variables, such as whole genome sequencing data. A known issue with tree-based predictive models occurs when new levels of a variable are present in an observation for prediction which were not present in the set of observations with which the model was trained. This is almost certain to occur with genomic data, which has a potentially ever-growing set of alleles for any single gene. This thesis investigates the use of ordinal encoding categorical variables to address the ‘absent levels’ problem in random forest models. Firstly, a method of encoding is adapted, based on correspondence analysis (CA) of a class by level contingency table, to be unbiased in the presence of absent levels. Secondly, a new method of encoding is introduced which utilises a set of supplementary information on the category levels themselves (i.e., the sequence information of alleles) and encodes them, as well as any new levels, according to their similarity or dissimilarity to each other via the method of principal coordinates analysis (PCO). Thirdly, based on the method of canonical analysis of principal coordinates (CAP), the encoding information of the levels from the CA on the contingency table is combined with the encoding information of the levels from the PCO on the dissimilarity matrix of the supplementary levels information, with a classical correspondence analysis (CCorA). Potential issues when using out-of-bag (OOB) data following variable encoding are then explored and an adaptation to the holdout variable importance method is introduced which is suitable for use with all methods of encoding. This thesis finishes by applying the CAP method of encoding to a random forest predictive model for source attribution of whole genome sequencing data from the Source Assigned Campylobacteriosis in New Zealand (SACNZ) study. The advantage of adding core genes and accessory genes as predictor variables is investigated, and the attribution results are compared to the results from a previously published study which used the asymmetric island model on the same set of isolates and the seven MLST genes.
  • Item
    Modelling and mapping of subsurface nitrate-attenuation index in agricultural landscapes
    (Elsevier Ltd, 2025-06) Collins SB; Singh R; Mead SR; Horne DJ; Zhang L
    Environmental management of nutrient losses from agricultural lands is required to reduce their potential impacts on the quality of groundwater and eutrophication of surface waters in agricultural landscapes. However, accurate accounting and management of nitrogen losses relies on a robust modelling of nitrogen leaching and its potential attenuation – specifically, the reduction of nitrate to gaseous forms of nitrogen – in subsurface flow pathways. Subsurface denitrification is a key process in potential nitrate attenuation, but the spatial and temporal dynamics of where and when it occurs remain poorly understood, especially at catchment-scale. In this paper, a novel Landscape Subsurface Nitrate-Attenuation Index (LSNAI) is developed to map spatially variable subsurface nitrate attenuation potential of diverse landscape units across the Manawatū-Whanganui region of New Zealand. A large data set of groundwater quality across New Zealand was collated and analysed to assess spatial and temporal variability of groundwater redox status (based on dissolved oxygen, nitrate and dissolved manganese) across different hydrogeological settings. The Extreme Gradient Boosting algorithm was used to predict landscape unit subsurface redox status by integrating the nationwide groundwater redox status data set with various landscape characteristics. Applying the hierarchical clustering analysis and unsupervised classification techniques, the LSNAI was then developed to identify and map five landscape subsurface nitrate attenuation classes, varying from very low to very high potential, based on the predicted groundwater redox status probabilities and identified soil drainage and rock type as key influencing landscape characteristics. Accuracy of the LSNAI mapping was further investigated and validated using a set of independent observations of groundwater quality and redox assessments in shallow groundwaters in the study area. This highlights the potential for further research in up-scaling mapping and modelling of landscape subsurface nitrate attenuation index to accurately account for spatial variability in subsurface nitrate attenuation potential in modelling and assessment of water quality management measures at catchment-scale in agricultural landscapes.
  • Item
    Novel machine learning-driven comparative analysis of CSP, STFT, and CSP-STFT fusion for EEG data classification across multiple meditation and non-meditation sessions in BCI pipeline.
    (BioMed Central Ltd, 2025-02-08) Liyanagedera ND; Bareham CA; Kempton H; Guesgen HW
    This study focuses on classifying multiple sessions of loving kindness meditation (LKM) and non-meditation electroencephalography (EEG) data. This novel study focuses on using multiple sessions of EEG data from a single individual to train a machine learning pipeline, and then using a new session data from the same individual for the classification. Here, two meditation techniques, LKM-Self and LKM-Others were compared with non-meditation EEG data for 12 participants. Among many tested, three BCI pipelines we built produced promising results, successfully detecting features in meditation/ non-meditation EEG data. While testing different feature extraction algorithms, a common neural network structure was used as the classification algorithm to compare the performance of the feature extraction algorithms. For two of those pipelines, Common Spatial Patterns (CSP) and Short Time Fourier Transform (STFT) were successfully used as feature extraction algorithms where both these algorithms are significantly new for meditation EEG. As a novel concept, the third BCI pipeline used a feature extraction algorithm that fused the features of CSP and STFT, achieving the highest classification accuracies among all tested pipelines. Analyses were conducted using EEG data of 3, 4 or 5 sessions, totaling 3960 tests on the entire dataset. At the end of the study, when considering all the tests, the overall classification accuracy using SCP alone was 67.1%, and it was 67.8% for STFT alone. The algorithm combining the features of CSP and STFT achieved an overall classification accuracy of 72.9% which is more than 5% higher than the other two pipelines. At the same time, the highest mean classification accuracy for the 12 participants was achieved using the pipeline with the combination of CSP STFT algorithm, reaching 75.5% for LKM-Self/ non-meditation for the case of 5 sessions of data. Additionally, the highest individual classification accuracy of 88.9% was obtained by the participant no. 14. Furthermore, the results showed that the classification accuracies for all three pipelines increased with the number of training sessions increased from 2 to 3 and then to 4. The study was successful in classifying a new session of EEG meditation/ non-meditation data after training machine learning algorithms using a different set of session data, and this achievement will be beneficial in the development of algorithms that support meditation.
  • Item
    On the origin of optical rotation changes during the κ-carrageenan disorder-to-order transition
    (Elsevier Ltd., 2024-06-01) Westberry BP; Rio M; Waterland MR; Williams MAK
    It is well established that solutions of both polymeric and oligomeric κ-carrageenan exhibit a clear change in optical rotation (OR), in concert with gel-formation for polymeric samples, as the solution is cooled in the presence of certain ions. The canonical interpretation - that this OR change reflects a 'coil-to-helix transition' in single chains - has seemed unambiguous; the solution- or 'disordered'-state structure has ubiquitously been assumed to be a 'random coil', and the helical nature of carrageenan in the solid-state was settled in the 1970s. However, recent work has found that κ-carrageenan contains substantial helical secondary structure elements in the disordered-state, raising doubts over the validity of this interpretation. To investigate the origins of the OR, density-functional theory calculations were conducted using atomic models of κ-carrageenan oligomers. Changes were found to occur in the predicted OR owing purely to dimerization of chains, and - together with the additional effects of slight changes in conformation that occur when separated helical chains form double-helices - the predicted OR changes are qualitatively consistent with experimental results. These findings contribute to a growing body of evidence that the carrageenan 'disorder-to-order' transition is a cooperative process, and have further implications for the interpretation of OR changes demonstrated by macromolecules in general.
  • Item
    Nphos: Database and Predictor of Protein N-phosphorylation.
    (Oxford University Press, 2024-04-10) Zhao M-X; Ding R-F; Chen Q; Meng J; Li F; Fu S; Huang B; Liu Y; Ji Z-L; Zhao Y; Xue Y
    Protein N-phosphorylation is widely present in nature and participates in various biological processes. However, current knowledge on N-phosphorylation is extremely limited compared to that on O-phosphorylation. In this study, we collected 11,710 experimentally verified N-phosphosites of 7344 proteins from 39 species and subsequently constructed the database Nphos to share up-to-date information on protein N-phosphorylation. Upon these substantial data, we characterized the sequential and structural features of protein N-phosphorylation. Moreover, after comparing hundreds of learning models, we chose and optimized gradient boosting decision tree (GBDT) models to predict three types of human N-phosphorylation, achieving mean area under the receiver operating characteristic curve (AUC) values of 90.56%, 91.24%, and 92.01% for pHis, pLys, and pArg, respectively. Meanwhile, we discovered 488,825 distinct N-phosphosites in the human proteome. The models were also deployed in Nphos for interactive N-phosphosite prediction. In summary, this work provides new insights and points for both flexible and focused investigations of N-phosphorylation. It will also facilitate a deeper and more systematic understanding of protein N-phosphorylation modification by providing a data and technical foundation. Nphos is freely available at http://www.bio-add.org/Nphos/ and http://ppodd.org.cn/Nphos/.