Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 10 of 30
  • Item
    Essays on finance and deep learning : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Finance, School of Economics and Finance, Massey University
    (Massey University, 2025-07-25) Pan, Guoyao
    This thesis aims to broaden the application of deep learning techniques in financial research and comprises three essays that make meaningful contributions to the related literature. Essay One integrates deep learning into the Hub Strategy, a novel chart pattern analysis method, to develop trading strategies. Utilizing deep learning models, which analyze chart patterns alongside data such as trading volume, price volatility, and sentiment indicators, the strategy forecasts stock price movements. Tests on U.S. S&P 500 index stocks indicate that Hub Strategy trading methods, when integrated with deep learning models, achieve an annualized average return of approximately 25%, significantly outperforming the benchmark buy-and-hold strategy's 9.6% return. Risk-adjusted metrics, including Sharpe ratios and Jensen’s alpha, consistently demonstrate the superiority of these trading strategies over both the buy-and-hold approach and standalone Hub Strategy trading rules. To address data snooping concerns, multiple tests validate profitability, and an asset pricing model with 153 risk factors and Lasso-OLS (Ordinary Least Squares) regressions confirms its ability to capture positive alphas. Essay Two utilizes deep learning techniques to explore the relationships between the abnormal return and its explanatory variables, including firm-specific characteristics and realized stock returns. Trained deep learning models effectively predict the estimated abnormal return directly. We evaluate the effectiveness of detecting abnormal returns by comparing our deep learning models against three benchmark methods. When applied to a random dataset, deep learning models demonstrate a significant improvement in identifying abnormal returns within the induced range of -3% to 3%. Moreover, their performance remains consistent across non-random datasets classified by firm size and market conditions. In addition, a regression of abnormal return prediction errors on firm-based factors, market conditions, and periods reveals that deep learning models are less sensitive to variables like firm size, market conditions, and periods than the benchmarks. Essay Three assesses the performance of deep learning predictors in forecasting momentum turning points using the confusion matrix and comparing them to the benchmark model proposed by Goulding, Harvey, and Mazzoleni (2023). Tested on U.S. stocks from January 1990 to December 2023, deep learning predictors demonstrate higher accuracy in identifying turning points than the benchmark. Furthermore, our deep learning-based trading rules yield higher mean log returns and Sharpe ratios, along with lower volatility, compared to the benchmark. Two models achieve average monthly returns of 0.0148 and 0.0177, surpassing the benchmark’s 0.0108. These gains are both economically and statistically significant, with consistent annual results. Regression analysis also shows that our models respond more effectively to changes in stock and market return volatility than the benchmark. Overall, these essays expand the application of deep learning in finance research, demonstrating high predictive accuracy, enhanced trading profitability, and effective detection of long-term abnormal returns, all of which hold significant practical value.
  • Item
    Source attribution models using random forest for whole genome sequencing data : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics, School of Mathematical and Computational Sciences, Massey University, Palmerston North, New Zealand
    (Massey University, 2025-07-14) Smith, Helen
    Foodborne diseases, such as campylobacteriosis, represent a significant risk to public health. Preventing the spread of Campylobacter species requires knowledge of sources of human infection. Current methods of source attribution are designed to be used with a small number of genes, such as the seven housekeeping genes of the original multilocus sequence typing (MLST) scheme, and encounter issues when presented with whole genome data. Higher resolution data, however, offers the potential to differentiate within source groups (i.e., between different ruminant species in addition to differentiating between ruminants and poultry), which is poorly achieved with current methods. Random forest is a tree-based machine learning algorithm which is suitable for analysing data sets with large numbers of predictor variables, such as whole genome sequencing data. A known issue with tree-based predictive models occurs when new levels of a variable are present in an observation for prediction which were not present in the set of observations with which the model was trained. This is almost certain to occur with genomic data, which has a potentially ever-growing set of alleles for any single gene. This thesis investigates the use of ordinal encoding categorical variables to address the ‘absent levels’ problem in random forest models. Firstly, a method of encoding is adapted, based on correspondence analysis (CA) of a class by level contingency table, to be unbiased in the presence of absent levels. Secondly, a new method of encoding is introduced which utilises a set of supplementary information on the category levels themselves (i.e., the sequence information of alleles) and encodes them, as well as any new levels, according to their similarity or dissimilarity to each other via the method of principal coordinates analysis (PCO). Thirdly, based on the method of canonical analysis of principal coordinates (CAP), the encoding information of the levels from the CA on the contingency table is combined with the encoding information of the levels from the PCO on the dissimilarity matrix of the supplementary levels information, with a classical correspondence analysis (CCorA). Potential issues when using out-of-bag (OOB) data following variable encoding are then explored and an adaptation to the holdout variable importance method is introduced which is suitable for use with all methods of encoding. This thesis finishes by applying the CAP method of encoding to a random forest predictive model for source attribution of whole genome sequencing data from the Source Assigned Campylobacteriosis in New Zealand (SACNZ) study. The advantage of adding core genes and accessory genes as predictor variables is investigated, and the attribution results are compared to the results from a previously published study which used the asymmetric island model on the same set of isolates and the seven MLST genes.
  • Item
    Enhancing statistical wind speed forecasting models : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Engineering at Massey University, Manawatū Campus, New Zealand
    (Massey University, 2022) Yousuf, Muhammad Uzair
    In recent years, wind speed forecasting models have seen significant development and growth. In particular, hybrid models have been emerging since the last decade. Hybrid models combine two or more techniques from several categories, with each model utilizing its distinct strengths. Mainly, data-driven models that include statistical and Artificial Intelligence/Machine Learning (AI/ML) models are deployed in hybrid models for shorter forecasting time horizons (< 6hrs). Literature studies show that machine learning models have gained enormous potential owing to their accuracy and robustness. On the other hand, only a handful of studies are available on the performance enhancement of statistical models, despite the fact that hybrid models are incomplete without statistical models. To address the knowledge gap, this thesis identified the shortcomings of traditional statistical models while enhancing prediction accuracy. Three statistical models are considered for analyses: Grey Model [GM(1,1)], Markov Chain, and Holt’s Double Exponential Smoothing models. Initially, the problems that limit the forecasting models' applicability are highlighted. Such issues include negative wind speed predictions, failure of predetermined accuracy levels, non-optimal estimates, and additional computational cost with limited performance. To address these concerns, improved forecasting models are proposed considering wind speed data of Palmerston North, New Zealand. Several methodologies have been developed to improve the model performance and fulfill the necessary and sufficient conditions. These approaches include adjusting dynamic moving window, self-adaptive state categorization algorithm, a similar approach to the leave-one-out method, and mixed initialization method. Keeping in view the application of the hybrid methods, novel MODWT-ARIMA-Markov and AGO-HDES models are further proposed as secondary objectives. Also, a comprehensive analysis is presented by comparing sixteen models from three categories, each for four case studies, three rolling windows, and three forecasting horizons. Overall, the improved models showed higher accuracy than their counter traditional models. Finally, the future directions are highlighted that need subsequent research to improve forecasting performance further.
  • Item
    A quantitative situation analysis model for strategic planning in quantity surveying firms : a thesis presented in partial fulfilment of the requirements of the degree of Doctor of Philosophy (PhD) in Construction at Massey University, Albany Campus, New Zealand
    (Massey University, 2021) Frei, Marcel
    Quantity Surveying (QS) firms, like all organisations must continuously formulate and execute the strategies required to enable them to survive and succeed in a constantly changing business environment. Key challenges that firms are required to grapple with include the rapid pace of technological advances affecting professional practice, intense internal competition, and the struggle to attract and retain key talent. In the midst of these operation challenges, QS firm leaders must also dedicate resource to planning and executing strategy. Unfortunately, strategic planning in QS firms is often ad-hoc or neglected, and there is a distinct lack of framework s and tools specific to the QS context. This study set out to redress this gap in literature and theory, by providing firstly a framework of key factors to be considered in a situation analysis – the core activity of the Design School approach to strategic planning, and secondly to provide a quantitative model based on that framework to enable firms to diagnose their Strategic Health – that is, their current performance and areas for improvement and optimisation, prior to formulating, selecting and executing strategic options to achieve their mission and vision. To achieve this, this study takes a multi-stage mixed methods approach. Firstly, following a review of the literature, in-depth semi-structured exploratory interviews were undertaken with key leaders in the Australian and New Zealand QS profession that led to the development of a situation analysis (SA) framework of 28 External Factors and 26 Internal Factors. Two stages of descriptive survey were undertaken (in 2013 and 2020) which enables the development of a quantitative Strategic Health model based on the framework Factors. Finally, the developed model was tested amongst five similar case study firms. Based on the case study results the developed model correlates strongly with five self-reported measures of success. The developed SA framework provides QS firms with empirically validated terms of reference when undertaking SA as part of their own strategic planning process. Due to the relatively small sample sizes involved, caution is urged in applying the developed Strategic Health model to situations outside of the population samples in the study. Further testing of the model in larger population samples or in associated industries are recommendations for further research. Keywords: quantity surveying, situation analysis, strategic health, strategic planning, Australasia
  • Item
    Investigation of genotype and phenotype interactions using computational statistics : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, New Zealand
    (Massey University, 2021) Angelin-Bonnet, Olivia
    Deciphering the precise mechanisms by which variations at the DNA level impact measurable characteristics of organisms, coined phenotypes, through the actions of complex molecular networks is a critical topic in modern biology. Such knowledge has implications spanning numerous fields, from plant or animal breeding to medicine. To this end, statistical methods must be leveraged to extract information from molecular measurements of different cellular scales, allowing us to reconstruct the regulatory networks mediating the impact of genotype variations on a phenotype of interest. In this thesis, I investigate the use of causal inference methods, to infer relationships amongst a set of biological entities from observational data. More specifically, I tackled the reconstruction of multi-omics molecular networks linking genotype to phenotype. In the first part, I developed a simulator that generates benchmark gene expression data, i.e. RNA and protein levels, from synthetic gene regulatory networks. The originality of my work is that it includes transcriptional and post-transcriptional regulation amongst genes. I used the developed simulation tool to evaluate and compare the performance of state-of-the-art causal inference methods in reconstructing causal relationships between the genes. The evaluation focused on the ability of the methods to reconstruct relationships mediated by post-transcriptional regulations from observational transcriptomics data. I also evaluated the methods performance to detect different types of causal relationships between genes via a catalogue of causal queries, and highlighted the shortcomings associated with using transcriptomics data alone in reconstructing gene regulatory networks. In the second part, I developed an analysis framework to shed light on the biological mechanisms underlying tetraploid potato tuber bruising. I first integrated a GWAS analysis with a differential expression analysis on transcriptomics data, to uncover genomic regions in which variations affect the response of tubers to mechanical bruising. I then used a multi-omics integration tool to jointly analyse genomics, transcriptomics, metabolomics and phenotypic data and to identify molecular features across the omics datasets involved in tuber bruising, including some not identified with traditional differential expression analyses. Finally, I made use of causal inference tools to reconstruct a multi-omics causal network linking these features to decipher the molecular relationships involved in tuber bruising. I used causal queries to extract information from the reconstructed causal networks and interpret the uncovered relationships.
  • Item
    Predicting spatiotemporal yield variability to aid arable precision agriculture in New Zealand : a case study of maize-grain crop production in the Waikato region : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Agriculture and Horticulture at Massey University, Palmerston North, New Zealand
    (Massey University, 2020) Jiang, Guopeng
    Precision agriculture attempts to manage within-field spatial variability by applying suitable inputs at the appropriate time, place, and amount. To achieve this, delineation of field-specific management zones (MZs), representing significantly different yield potentials are required. To date, the effectiveness of utilising MZs in New Zealand has potentially been limited due to a lack of emphasis on the interactions between spatiotemporal factors such as soil texture, crop yield, and rainfall. To fill this research gap, this thesis aims to improve the process of delineating MZs by modelling spatiotemporal interactions between spatial crop yield and other complementary factors. Data was collected from five non-irrigated field sites in the Waikato region, based on the availability of several years of maize harvest data. To remove potential yield measurement errors and improve the accuracy of spatial interpolation for yield mapping, a customised filtering algorithm was developed. A supervised machine-learning approach for predicting spatial yield was then developed using several prediction models (stepwise multiple linear regression, feedforward neural network, CART decision tree, random forest, Cubist regression, and XGBoost). To provide insights into managing spatiotemporal yield variability, predictor importance analysis was conducted to identify important yield predictors. The spatial filtering method reduced the root mean squared errors of kriging interpolation for all available years (2014, 2015, 2017 and 2018) in a tested site, suggesting that the method developed in R programme was effective for improving the accuracy of the yield maps. For predicting spatial yield, random forest produced the highest prediction accuracies (R² = 0.08 - 0.50), followed by XGBoost (R² = 0.06 - 0.39). Temporal variables (solar radiation, growing degree days (GDD) and rainfall) were proven to be salient yield predictors. This research demonstrates the viability of these models to predict subfield spatial yield, using input data that is inexpensive and readily available to arable farms in New Zealand. The novel approach employed by this thesis may provide opportunities to improve arable farming input-use efficiency and reduce its environmental impact.
  • Item
    Statistical modelling for zoonotic diseases : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Palmerston North, New Zealand
    (Massey University, 2020) Liao, Sih-Jing
    Preventing and controlling zoonoses through the design and implementation of public health policies requires a thorough understanding of epidemiology and transmission pathways. A pathogen may have complex transmission pathways that could be affected by environmental factors, different reservoirs and the food chain. One way to get more insight into a zoonosis is to trace back the putative sources of infection. Approaches to attribute the infection to sources include epidemiological observations and microbial subtyping techniques. In order for source attribution from the pathways to human infection to be delineated, this thesis proposes statistical modelling methods with an integration of demographic variables with multilocus sequence typing data derived from human cases and sources. These models are framed in a Bayesian context, allowing for a flexible use of limited knowledge about the illness to make inferences about the potential sources contributing to human infection. These methods are applied to campylobacteriosis data collected from a surveillance sentinel site in the Manawatu region of New Zealand. A link between genotypes found from sources and human samples is considered in the modelling scheme, assuming genotypes from sources are equal or linked indirectly to that from human cases. Model diagnostics show that the assumption of equal prevalence of genotypes between humans and sources is not tenable, with a few types being potentially more prevalent in humans than in sources, or vice versa. Thus, a model that allows genotypes on humans to differ from those on sources is implemented. In addition, an approximate Bayesian model is also proposed, which essentially cuts the link between human and source genotype distributions when conducting inference. The final inference from these approaches is the probability for human cases attributable to each source, conditional on the extent to which each case resides in a rural compared to urban environment. Results from the effective models suggest that poultry and ruminants are important sources for human campylobacteriosis. The more rural human cases are located, the higher the likelihood of ruminant-sourced cases is. In contrast, cases are more poultry-associated when their locations are more urban. A little rurality effect is noticed for water and other sources due to small sample sizes compared to that from poultry and ruminants. In addition, animal faeces are believed to be the primary cause of water contamination via rainfall or runoff coming from farmland and pasture. When water is treated as a medium in the transmission, instead of an end point, water birds are suggested to be the most likely contributor to water contamination. These findings have implications for public health practice and food safety risk management. A risk management strategy had been carried out in the poultry industry in New Zealand, leading to a marked decrease of urban case rates from a poultry source. However, the findings of this thesis suggest a further step with a focus on rural areas as rural case rates are observed to be relatively higher than urban rates. Further, by exploring the role that water plays in the transmission, it deepens our knowledge of the epidemiology about waterborne campylobacteriosis and highlights the importance of water quality. This opens a potential research direction to study the association of water quality and environmental factors such as higher global temperatures for this disease.
  • Item
    Statistical models for multihazards : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Palmerston North, New Zealand
    (Massey University, 2020) Frigerio Porta, Gabriele
    Natural hazards such as earthquakes, floods and landslides threaten communities in every part of the world. Exposure to such perils can be reduced by mitigation and forward planning. These procedures require the estimation of event likelihoods, a process which is well understood for single hazards. However, spatio-temporal interaction between natural hazards, through triggering or simple coincidence, is not uncommon (e.g. Alaska 1964, the Armero tragedy, the Kaikoura earthquake), and can lead to more severe consequences than the simple sum of two separate events. Hence single hazard assessments may underestimate, or incorrectly estimate, the real risk through a lack of interaction analysis. In the existing research literature, multi-hazards assessments are most commonly approached qualitatively or semi-quantitatively, evaluating hazards via an interaction matrix, without formal quantification of the risk. This thesis presents a quantitative framework, using point processes as the key tool, to evaluate the interaction of primary hazards in the occurrence of secondary (triggered) ones. The concept of the ‘hazard potential’ is developed, as a means of generalizing hazard interactions in space and time, allowing event outcomes to be simulated within a simple point process framework. Two particular examples of multiple hazard interactions are presented: rainfall and/or earthquake-induced landslides, and the survival of landslide dams. In the first case, point processes are used to model the triggering influence of multiple factors in a large real dataset collected from various sources. By discretizing space and time to match the data resolution, a daily-spatio-temporal hazard model to evaluate the relative and combined effects on landslide triggering due to earthquakes and rainfall is created. The case study on the Italian region of Emilia-Romagna suggests that the triggering effects are additive. In the second example, a Bayesian survival model is developed to forecast the time to failure of landslide dams, based on their characteristics and those of the potential reservoir. A case study on heterogeneous Italian events is presented, together with examples of potential results (forecasting) and possible generalizations of the model.
  • Item
    Statistical inference for population based measures of risk reduction : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Palmerston North, New Zealand
    (Massey University, 2019) Pirikahu, Sarah
    Epidemiologists and public health practitioners often wish to determine the population impact of an intervention to remove or reduce a risk factor. Population attributable type measures, such as the population attributable risk (PAR) and population attributable fraction (PAF), provide a means of assessing this impact in a way that is accessible for a non-statistical audience. To apply these concepts to real-world data, the calculation of estimates and confidence intervals for these measures should take into account the study design and any sources of uncertainty. We provide a Bayesian approach for estimating the PAR and its credible interval, from cross-sectional data resulting in a 2 × 2 table, and assess its Frequentist properties. With the Bayesian approach proving superior this model is then extended by incorporating uncertainty due to the use of an imperfect diagnostic test for exposure. The resulting model is under-identified which causes convergence problems for common MCMC samplers, such as Gibbs and Metropolis- Hastings. An alternative importance sampling method performs much better for these under- identified models and can be used to explore the limiting posterior distribution. However, this comes at the cost of needing to identify an appropriate transparent parameterisation, which can be difficult. We provide an adaptation of the Metropolis-Hastings random walk sampler which, in comparison to other MCMC samplers, more efficiently explores the posterior ridge of an under-identified model for large sample sizes. Often data used to estimate these population attributable measures may include multiple risk factors in addition to the one being considered for removal. Uncertainty regarding the distribution of these risk factors in the population affects the inference for PAR and PAF. To allow for this uncertainty we propose a methodology where the uncertainty in the joint distribution of the response and the covariate is accommodated.
  • Item
    Some diagnostic techniques for small area estimation : with applications to poverty mapping : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Palmerston North, New Zealand
    (Massey University, 2019) Livingston, Alison
    Small area estimation (SAE) techniques borrow strength via auxiliary variables to provide reliable estimates at finer geographical levels. An important application is poverty mapping, whereby aid organisations distribute millions of dollars every year based on small area estimates of poverty measures. Therefore diagnostics become an important tool to ensure estimates are reliable and funding is distributed to the most impoverished communities. Small area models can be large and complex, however even the most complex models can be of little use if they do not have predictive power at the small area level. This motivated a variable importance measure for SAE that considers each auxiliary variable’s ability to explain the variation in the dependent variable, as well as its ability to distinguish between the relative levels in the small areas. A core question addressed is how candidate survey-based models might be simplified without losing accuracy or introducing bias in the small area estimates. When a small area estimate appears to be biased or unusual, it is important to investigate and if necessary remedy the situation. A diagnostic is proposed that quantifies the relative effect of each variable, allowing identification of any variables within an area that have a larger than expected influence on the small area estimate for that area. This highlights possible errors which need to be checked and if necessary corrected. Additionally in SAE, it is essential that the estimates are at an acceptable level of precision in order to be useful. A measure is proposed that takes the ratio of the variability in the small areas to the uncertainty of the small area estimates. This measure is then used to assist in determining the minimum level of precision needed in order to maintain meaningful estimates. The diagnostics developed cover a wide range of small area estimation methods, consisting of those based on survey data only and those which combine survey and census data. By way of illustration, the proposed methods are applied to SAE for poverty measures in Cambodia and Nepal.