Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 9 of 9
  • Item
    Contributions to improve the power, efficiency and scope of control-chart methods : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Albany, New Zealand
    (Massey University, 2019) Adegoke, Nurudeen Adedayo
    Detection of outliers and other anomalies in multivariate datasets is a particularly difficult problem which spans across a range of systems, such as quality control in factories, microarrays or proteomic analyses, identification of features in image analysis, identifying unauthorized access in network traffic patterns, and detection of changes in ecosystems. Multivariate control charts (MCC) are popular and sophisticated statistical process control (SPC) methods for monitoring characteristics of interest and detecting changes in a multivariate process. These methods are divided into memory-less and memory-type charts which are used to monitor large and small-to-moderate shifts in the process, respectively. For example, the multivariate χ2 is a memory-less control chart that uses only the most current process information and disregards any previous observations; it is typically used where any shifts in the process mean are expected to be relatively large. To increase the sensitivity of the multivariate process control tool for the detection of small-to-moderate shifts in the process mean vector, different multivariate memory-type tools that use information from both the current and previous process observations have been proposed. These tools have proven very useful for multivariate independent normal or "nearly" normal distributed processes. Like most univariate control-chart methods, when the process parameters (i.e., the process mean vector or covariance parameters, or both) are unknown, then MCC methods are based on estimated parameters, and their implementation occurs in two phases. In Phase I (retrospective phase), a historical reference sample is studied to establish the characteristics of the in-control state and evaluate the stability of the process. Once the in-control reference sample has been deemed to be stable, the process parameters are estimated from Phase I, and control chart limits are obtained for use in Phase II. The Phase II aspect initiates ongoing regular monitoring of the process. If successive observed values obtained at the beginning of Phase II fall within specified desired in-control limits, the process is considered to be in control. In contrast, any observed values during Phase II which fall outside the specified control limits indicate that the process may be out of control, and remedial responses are then required. Although conventional MCC are well developed from a statistical point of view, they can be difficult to apply in modern, data-rich contexts. This serious drawback comes from the fact that classical MCC plotting statistics requires the inversion of the covariance matrix, which is typically assumed to be known. In practice, the covariance matrix is seldom known and often empirically estimated, using a sample covariance matrix from historical data. While the empirical estimate of the covariance matrix may be an unbiased and consistent estimator for a low-dimensional data matrix with an adequate prior sample size, it performs inconsistently in high-dimensional settings. In particular, the empirical estimate of the covariance matrix can lead to in ated false-alarm rates and decreased sensitivity of the chart to detect changes in the process. Also, the statistical properties of traditional MCC tools are accurate only if the assumption of multivariate normality is satisfied. However, in many cases, the underlying system is not multivariate normal, and as a result, the traditional charts can be adversely affected. The necessity of this assumption generally restricts the application of traditional control charts to monitoring industrial processes. Most MCC applications also typically focus on monitoring either the process mean vector or the process variability, and they require that the process mean vector be stable, and that the process variability be independent of the process mean. However, in many real-life processes, the process variability is dependent on the mean, and the mean is not necessarily constant. In such cases, it is more appropriate to monitor the coefficient of variation (CV). The univariate CV is the ratio of the standard deviation to the mean of a random variable. As a relative dispersion measure to the mean, it is useful for comparing the variability of populations having very different process means. More recently, MCC methods have been adapted for monitoring the multivariate coefficient of variation (CV). However, to date, studies of multivariate CV control charts have focused on power - the detection of out-of-control parameters in Phase II, while no study has investigated their in-control performance in Phase I. The Phase I data set can contain unusual observations, which are problematic as they can in uence the parameter estimates, resulting in Phase II control charts with reduced power. Relevant Phase I analysis will guide practitioners with the choice of appropriate multivariate CV estimation procedures when the Phase I data contain contaminated samples. In this thesis, we investigated the performance of the most widely adopted memory-type MCC methods: the multivariate cumulative sum (MCUSUM) and the multivariate exponentially weighted moving average (MEWMA) charts, for monitoring shifts in a process mean vector when the process parameters are unknown and estimated from Phase I (chapters 2 and 3). We demonstrate that using a shrinkage estimate of the covariance matrix improves the run-length performance of these methods, particularly when only a small Phase I sample size is available. In chapter 4, we investigate the Phase I performance of a variety of multivariate CV charts, considering both diffuse symmetric and localized CV disturbance scenarios, and using probability to signal (PTS) as a performance measure. We present a new memory-type control chart for monitoring the mean vector of a multivariate normally distributed process, namely, the multivariate homogeneously weighted moving average (MHWMA) control chart (chapter 5). We present the design procedure and compare the run length performance of the proposed MHWMA chart for the detection of small shifts in the process mean vector with a variety of other existing MCC methods. We also present a dissimilarity-based distribution-free control chart for monitoring changes in the centroid of a multivariate ecological community (chapter 6). The proposed chart may be used, for example, to discover when an impact may have occurred in a monitored ecosystem, and is based on a change-point method that does not require prior knowledge of the ecosystem's behaviour before the monitoring begins. A novel permutation procedure is employed to obtain the control-chart limits of the proposed charting test-statistic to obtain a suitable distance-based model of the target ecological community through time. Finally, we propose enhancements to some classical univariate control chart tools for monitoring small shifts in the process mean, for those scenarios where the process variable is observed along with a correlated auxiliary variable (chapters 7 through 9). We provide the design structure of the charts and examine their performance in terms of their run length properties. We compare the run length performance of the proposed charts with several existing charts for detecting a small shift in the process mean. We offer suggestions on the applications of the proposed charts (in chapters 7 and 8), for cases where the exact measurement of the process variable of interest or the auxiliary variable is diffcult or expensive to obtain, but where the rank ordering of its units can be obtained at a negligible cost. Thus, this thesis, in general, will aid practitioners in applying a wider variety of enhanced and novel control chart tools for more powerful and effcient monitoring of multivariate process. In particular, we develop and test alternative methods for estimating covariance matrices of some useful control-charts' tools (chapters 2 and 3), give recommendations on the choice of an appropriate multivariate CV chart in Phase I (chapter 4), present an efficient method for monitoring small shifts in the process mean vector (chapter 5), expand MCC analyses to cope with non-normally distributed datasets (chapter 6) and contribute to methods that allow efficient use of an auxiliary variable that is observed and correlated with the process variable of interest (chapters 7 through 9).
  • Item
    Multivariate ranking and selection procedures with an application to overseas trade : a thesis presented in partial fulfillment of the requirements for the degree of Master of Arts in Statistics at Massey University
    (Massey University, 1983) Mendis, Mary Sharmila
    An overview of some recent work in the field of Ranking and Selection with emphasis on aspects important to experimenters confronted with Multivariate Ranking and Selection problems is presented. Ranking and Selection procedures fall into two basic categories. They are: 1) Indifference Zone Approach 2) Subset Selection Approach. In these approaches, the multivariate parameters are converted to univariate parameters. Various procedures using these real valued functions are given for both the Indifference Zone Approach and the Subset Selection Approach. A new formulation that has recently been developed which selects the best multivariate population without reducing populations to univariate quantities is also described. This method is a Multivariate Solution to the Multivariate Ranking and Selection problem. Finally a real life problem pertaining to New Zealand's overseas trade is discussed in the context of Multivariate Ranking.
  • Item
    A multivariate planning model - city structure : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Statistics.
    (Massey University, 1972) Crawford, Peter
    The genesis of this study is post graduate research in Urban Geography at Canterbury University in 1966. At that time a crude multivariate Centroid model of 95 New Zealand towns and cities was constructed. Based upon 60 socio-economic variables two factors for each of the years 1951, 1956 and 1961 were extracted and compared. The present study, which is a considerable refinement upon the earlier research, incorporates not only tremendous advancement in multivariate design methodology and application, but also parallel advancements that have been made in computing facilities over the last five years. The objective of this research is to construct a multivariate statistical planning model that is both statistically precise and meaningful in its application. Particular emphasis is placed upon the need to organise in a systematic and meaningful manner the increasingly greater variety of statistics that portray urban growth. Stress is placed upon the utility of the multivariate technique as a tool in the author's profession of Town Planning. [From Preface]
  • Item
    Contributions to high-dimensional data analysis : some applications of the regularized covariance matrices : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Albany, New Zealand
    (Massey University, 2015) Ullah, Insha
    High-dimensional data sets, particularly those where the number of variables exceeds the number of observations, are now common in many subject areas including genetics, ecology, and statistical pattern recognition to name but a few. The sample covariance matrix becomes rank deficient and is not invertible when the number of variables are more than the number of observations. This poses a serious problem for many classical multivariate techniques that rely on an inverse of a covariance matrix. Recently, regularized alternatives to the sample covariance have been proposed, which are not only guaranteed to be positive definite but also provide reliable estimates. In this Thesis, we bring together some of the important recent regularized estimators of the covariance matrix and explore their performance in high-dimensional scenarios via numerical simulations. We make use of these regularized estimators and attempt to improve the performance of the three classical multivariate techniques in high-dimensional settings. In a multivariate random effects models, estimating the between-group covariance is a well known problem. Its classical estimator involves the difference of two mean square matrices and often results in negative elements on the main diagonal. We use a lasso-regularized estimate of the between-group mean square and propose a new approach to estimate the between-group covariance based on the EM-algorithm. Using simulation, the procedure is shown to be quite effective and the estimate obtained is always positive definite. Multivariate analysis of variance (MANOVA) face serious challenges due to the undesirable properties of the sample covariance in high-dimensional problems. First, it suffer from low power and does not maintain accurate type-I error when the dimension is large as compared to the sample size. Second, MANOVA relies on the inverse of a covariance matrix and fails to work when the number of variables exceeds the number of observation. We use an approach based on the lasso regularization and present a comparative study of the existing approaches including our proposal. The lasso approach is shown to be an improvement in some cases, in terms of power of the test, over the existing high-dimensional methods. Another problem that is addressed in the Thesis is how to detect unusual future observations when the dimension is large. The Hotelling T2 control chart has traditionally been used for this purpose. The charting statistic in the control chart rely on the inverse of a covariance matrix and is not reliable in high-dimensional problems. To get a reliable estimate of the covariance matrix we use a distribution free shrinkage estimator. We make use of the available baseline set of data and propose a procedure to estimate the control limits for monitoring the individual future observations. The procedure do not assume multivariate normality and seems robust to the violation of multivariate normality. The simulation study shows that the new method performs better than the traditional Hotelling T2 control charts.
  • Item
    A comparison of univariate and multivariate statistical and data mining approaches to the behavioural and biochemical effects of vestibular loss related to the hippocampus : a thesis submitted in partial fulfilment of the requirements of the MApplStat in Applied Statistics, Massey University, Manawatu
    (Massey University, 2013) Smith, Paul F
    Vestibular dysfunction is associated with a complex syndrome of cognitive and anxiety disorders. However, most studies have used simple univariate analyses of the effects of vestibular loss on behaviour and brain function. In this thesis, univariate statistical, and multivariate statistical and data mining approaches, to the behavioural and neurochemical effects of bilateral vestibular deafferentation (BVD), were compared. Using linear mixed model analyses, including repeated measures analyses of variance and analyses with the covariance structure of the repeated measures specified, rats with BVD were found to exhibit increased locomotor activity, reduced rearing and reduced thigmotaxis. By contrast, there were no significant differences between BVD and sham control animals in the elevated plus maze and the BVD animals exhibited a longer escape latency in the elevated T maze, with no change in avoidance latency. In the spatial T maze, the BVD animals demonstrated a significant decrease in accuracy compared to the sham control animals. Using linear discriminant analysis, cluster analysis, random forest classification and support vector machines, BVD animals could be distinguished from sham controls by their behavioural syndrome. Using multiple linear regression and random forest regression, the best predictors of performance in the spatial T maze were whether the animals had received a BVD or sham lesion, and the duration of rearing. In the neurochemical data set, the expression of 5-7 glutamate receptor subunits was measured in 3 different subregions of the rat hippocampus, at various times following BVD, using western blotting. In the 6 month group, half of the animals underwent training in a T-maze. Using multivariate analyses of variance, there was no significant effect of surgery for any hippocampal subregion. Linear discriminant analysis could not determine a linear discriminant function that could separate BVD from sham control animals. A random forest classification analysis was also unsuccessful in this respect. However, for the 6 month data set, T maze training had a significant effect independently of surgery. The results of these experiments suggest that BVD results in profound spatial memory deficits that are not associated with large changes in the expression of glutamate receptors in the hippocampus. The results of the multivariate statistical and data mining analyses, applied to both the behavioural and neurochemical data sets, suggested that research in this field of neuroscience would benefit from analysing multiple variables in relation to one another, rather than simply conducting univariate analyses. Since the different behavioural and neurochemical variables do interact with one another, it is important to determine the nature of these interactions in the analyses conducted. However, this will require researchers to design experiments in which multiple variables can be measured under the one set of conditions.
  • Item
    A comparison of tree-based and traditional classification methods : a thesis presented in partial fulfilment of the requirements for the degree of PhD in Statistics at Massey University
    (Massey University, 1994) Lynn, Robert D
    Tree-based discrimination methods provide a way of handling classification and discrimination problems by using decision trees to represent the classification rules. The principal aim of tree-based methods is the segmentation of a data set, in a recursive manner, such that the resulting subgroups are as homogeneous as possible with respect to the categorical response variable. Problems often arise in the real world involving cases with a number of measurements (variables) taken from them. Traditionally, in such circumstances involving two or more groups or populations, researchers have used parametric discrimination methods, namely, linear and quadratic discriminant analysis, as well as the well known non-parametric kernel density estimation and Kth nearest neighbour rules. In this thesis, all the above types of methods are considered and presented from a methodological point of view. Tree-based methods are summarised in chronological order of introduction, beginning with the Automatic Interaction Detector (AID) method of Morgan and Sonquist (1963) through to the IND method of Buntine (1992). Given a set of data, the proportion of observations incorrectly classified by a prediction rule is known as the apparent error rate. This error rate is known to underestimate the actual or true error rate associated with the discriminant rule applied to a set of data. Various methods for estimating this actual error rate are considered. Cross-validation is one such method which involves omitting each observation in turn from the data set, calculating a classification rule based on the remaining (n-1) observations and classifying the observation that was omitted. This is carried out n times, that is for each observation in the data set and the total number of misclassified observations is used as the estimate of the error rate. Simulated continuous explanatory data was used to compare the performance of two traditional discrimination methods, linear and quadratic discriminant analysis, with two tree-based methods, Classification and Regression Trees (CART) and Fast Algorithm for Classification Trees (FACT), using cross-validation error rates. The results showed that linear and/or quadratic discriminant analysis are preferred for normal, less complex data and parallel classification problems while CART is best suited for lognormal, highly complex data and sequential classification problems. Simulation studies using categorical explanatory data also showed linear discriminant analysis to work best for parallel problems and CART for sequential problems while CART was also preferred for smaller sample sizes. FACT was found to perform poorly for both continuous and categorical data. Simulation studies involving the CART method alone provided certain situations where the 0.632 error rate estimate is preferred to cross-validation and the one standard error rule over the zero standard error rule. Studies undertaken using real data sets showed that most of the conclusions drawn from the continuous and categoiical simulation studies were valid. Some recommendations are made, both from the literature and personal findings as to what characteristics of tree-based methods are best in particular situations. Final conclusions are given and some proposals for future research regarding the development of tree-based methods are also discussed. A question worth considering in any future research into this area is the use of non-parametric tests for determining the best splitting variable.
  • Item
    Multivariate estimation of variance and covariance components using restricted maximum likelihood, in dairy cattle : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Animal Science at Massey University
    (Massey University, 1992) Sosa Ferreyra, Carlos Francisco
    The multivariate estimation of sire additive and residual variances and covariances by Restricted Maximum Likelihood (REML) is addressed. Particular emphasis is given to its application to dairy cattle data when all traits are explained by the same model and no observations are missing. Special attention is given to the analysis of new traits being included in a sire evaluation programme, for which a model has to be developed and no previous estimates of the population parameters exist. Results obtained by using either the multivariate Method 3 of Henderson, multivariate REML excluding the Numerator Relationship Matrix (NRM) or by multivariate REML including the NRM were compared. When a large number of traits were fitted simultaneously the variance-covariance matrix estimated by Method 3 was negative-definite (outside the allowable parameter space). REML estimates obtained while ignoring the NRM were biased. The number and sequence of traits fitted in the analysis affected the estimates at convergence. A canonical transformation of the variance-covariance matrix was undertaken to simplify the computation by means of an Expectation Maximisation (EM) algorithm. Approaches to choosing initial values for their use in iterative methods were compared via their values at convergence and the number of iterations required to converge. To further simplify the use of multivariate REML, three transformations of the Mixed Model Equations (MME) were integrated: the absorption of proven sire effects taken as fixed, a triangular factorisation of the NRM, and the singular value decomposition of the coefficient matrix in the MME. One statistical algorithm (EM) and one mathematical algorithm (Scoring type) were developed to iteratively solve the REML equations on the transformed scale, such that the transformed coefficient matrix of the MME did not need to be inverted at each iteration and the required quantities to build the REML equations were obtained through vector operations. Traits other than Production (TOP) from New Zealand Holstein-Friesian dairy cows were analysed (4 management and 13 conformation characteristics), each trait scored using a linear scale from 1 to 9, with extreme values corresponding to extreme phenotypes. Mixed model methodology was used for the analysis of TOP as no significant departure from normality was observed. To model the TOP, the fixed effects of herd, inspector, age, stage of lactation (linear and quadratic) and breed of dam were tested for significance. Only the effects of inspector and herd were significant for all traits, with breed of dam significantly affecting adaptability to milking, shed temperament and stature. Estimates of phenotypic means and standard deviations, and heritabilities for TOP were: adaptability to milking 5.4 ± 1.7, 0.20; shed temperament 5.5 ± 1.6, 0.12; milking speed 5.7 ± 1.5, 0.11; farmer's overall opinion 5.7 ± 1.7,.14; stature 5.1 ± 1.0, 0.14; weight 4.4 ± 1.0, 0.37; capacity 5.3 ± 1.0, 0.40; rump angle 5.4 ± 0.7, 0.16; rump width 5.2 ± 0.7, 0.08; legs 5.2 ± 0.6, 0.34; udder support 5.3 ± 1.0, 0.63; fore udder 4.9 ± 1.1, 0.48; rear udder 4.9 ± 1.0, 0.33; front teat placement 4.2 ± 0.7, 0.22; rear teat placement 5.2 ± 0.8, 0.22; udder overall 4.8 ± 1.1, 0.42; and dairy conformation 5.3 ± 1.1, 0.32. Large positive phenotypic correlations among management traits were obtained, while the correlations of these traits with type were small and positive when significant. Large and positive correlations among udder traits were found. All traits related to size were positively correlated amongst themselves. Most of the traits were positively correlated with dairy conformation. Estimated genetic correlations for stature and weight with other conformation traits were generally negative. With the exception of udder support, all udder traits were positively correlated amongst themselves. Dairy conformation was positively correlated with most traits, except with stature, rump angle, legs, rear udder and udder overall. The estimates obtained in this study shold be used in the evaluation of Holstein-Friesian sires and cows lor TOP in New Zealand.
  • Item
    Analyzing volatile compound measurements using traditional multivariate techniques and Bayesian networks : a thesis presented in partial fulfillment of the requirements for the degree of Master of Arts in Statistics at Massey University, Albany, New Zealand
    (Massey University, 2009) Baldawa, Shweta
    The purpose of this project is to compare two statistical approaches, traditional multivariate analysis and Bayesian networks, for representing the relationship between volatile compounds in kiwifruit. Compound measurements were for individual vines which were progeny of an intercross. It was expected that groupings in the data (or compounds) would give some indication of the generic nature of the biochemical pathways. Data for this project was provided by the Flavour Biotech team at Plant and Food Research. This data contained many non-detected observations which were treated as zero and to deal with them, we looked for appropriate value of c for data transformation in log(x+c). The data is ‘large p small n’ paradigm – and has much in common with data, although it is not as extreme as microarray. Principal component analysis was done to select a subset of compounds that retained most of the multivariate structure for further analysis. The reduced set of data was analyzed by Cluster analysis and Bayesian network techniques. A heat map produced by Cluster analysis and a graphical representation of Bayesian networks were presented to scientists for their comments. According to them, the two graphs complemented each other; both graphs were useful in their own unique way. Along with clusters of compounds, clusters of genotypes were represented by the heat map which showed by how much a particular compound is present in each genotype while the relation among different compounds was seen from the Bayesian networks.
  • Item
    Characterization traffic induced compaction in controlled traffic farming (CTF) and random traffic farming (RTF) - A multivariate approach
    Raveendrakumaran B; Grafton MC; Jeyakumar P; Bishop P; Davies CE; Horne, D; Singh, R
    A field scale experiment was carried out in Pukekohe in 2020 under an annual grass crop season to characterize the subsoil compaction in controlled traffic farming (CTF) and random traffic farming systems (RTF). Soil penetration resistance (PR) measurements were taken in each field using a cone penetrometer fitted with a 100 mm2 60° top angle cone. Multivariate analysis was performed to identify penetration resistance by depth through cluster analysis and principal component analysis (PCA). Repeated measures ANOVA was performed on the penetration data using the mixed model procedure to determine the treatment effects. In RTF, the penetrometer values increased more rapidly with depth resulting in higher values being recorded from 20cm compared to CTF. In contrast, it was greater in CTF than in RTF at the subsurface (55-60cm). The differences in PR declined beyond 55cm depth at both sites. All depths showed that differences in soil PR were most apparent in the 5-40cm depth, with significant differences between CTF and RTF (P<0.0001). This shows that traffic management at both CTF and RTF sites caused significant changes in the 5-40cm depth. However, there were no differences in PR between CTF and RTF below 40cm and at 0-5cm depth (P >0.05) showing that the soil layers were homogeneous in both systems beyond 40cm depth. The propagation of subsurface compaction was identified at the deeper layer (40-60cm) in CTF systems whereas it was identified from shallower depths (25-55cm) in RTF system.