Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author. Statistical Methods for Cricket Team Selection A THESIS PRESENTED IN PARTIAL FULFILMENT OF THE REQUIREMENT OF THE DEGREE OF MASTER OF APPLIED STATISTICS AT MASSEY UNIVERSITY, ALBANY NEW ZEALAND Paul J. Bracewell 1999 Abstract Cricket generates a large amount of data for both batsmen and bowlers. Methods for using this data to select a cricket team are examined. Utilising the assumption that an individual's natural ability is expressed via performance outputs, this thesis seeks to describe and understand the underlying statistical processes of player performance. Randomness is tested for and then the distributional properties of the data are sought. This information is then used to monitor the estimate of natural ability via widely accepted control methods, such as Shewhart control charts, CUSUM, EWMA and multivariate versions of these procedures. To accommodate the distribution presented by batting scores, a new control chart based on quartiles is also studied. Further, ranking and selection procedures employ the estimates of individual ability to select the best individuals and note the probability of correct selection. Major contributions of this study include: a) Development of performance measures for cricket b) 2 - Dimensional runs test, with further applicability outside cricket. c) Statistical interpretation specific to cricket • Outliers are very important • Form is autocorrelation • Zone rules for cricket needed to detect good/poor performance • Relatively short nominal ARL's d) Control Chart based on quantiles to preserve outlier influences in a non-parametric procedure. e) The recommendation of appropriate tools for monitoring batting, bowling and all-rounder performance and also choosing man of the match. f) Discriminates between different types of bowlers using the consistency of their performance measures. g) Evaluates the members of a team relative to potential contenders. iii in on v Contents Acknowledgements ... .. .. . . ................................ . ......... . ................ iii Table of Contents . . . .. . . . . ......... .... .... . .. ... .. .. . ... . .. ......................... . iv List of Figures . ..... . .. . .. ...... . ........................ ..... . .. . ... . . . . ..... .. . .. ..... vi List of Tables ...... . ..... . ..... . ... . . . .. .. .. . . . .... ... . . ..... .......................... . vii Chapter 1 An Overview .. . ..... ...... . .. .. .. ..... . ... ..... . .... . ................. 1 1.1 Introduction . ... ... .. .. ... ... .. ... ...................... .. ........... ... 1 1.2 General Overview of Cricket Statistics .. ....... ...... .. . .... . .. ... 2 1.3 Statistics and Team Selection .... .... ... .............. .. ...... ... .. 5 1.4 The Application of Quality Control to Cricket ... ... . .... ..... . .... 6 1.5 Major Contributions of this Study ... ..... .. .. . .. .. .... ... ...... .. ... 8 Chapter 2 Analysing Individual Data Characteristics ........ ...... .. .. ... 1 O 2.0 Introduction .. .. . .. . ........ . . ... .. . . . .. . .. . . . ... ... . . .. . .. ....... . ... .. 10 2.1 Literature Review ..... ... .......... .... . .............. .. ... ... ... . ..... 12 2.1.0 Performance Output Measures .................. .. ... .. ... 12 2.1 .1 Investigation of Bowling Measures ...................... .. 16 2.1.2 Randomness ... .. .. .. .. . .............. .. . . . ... ... . ..... .... ... 16 2.1.3 Distribution of Performance Measures . . ... . ............. . 18 2.2 Data .. . ...... .. .. .... . .... ..... . . . .. . . ... .... ... ... ... ... .... . ... .. . . . ... 20 2.2 .1 Calculation of Batting Measures .. ........... .. .... ..... . .. 20 2.2.2 Calculation of Bowling Measures ............ . .... .. .... ... 22 2.3 Methods .. .... . . . ..... ... .... . .. . ... ... . ... ... .. . .. . . . . . . . .. ... .. ...... . 29 2.3.0 Introduction .......... . ......... .. .. . .... .. ...... .. .... ......... 29 2.3 .1 Tests for Randomness ........ . .. . .. .... .. .. ........ ... ...... 30 2.3.2 Distribution Fitting ............................................ 34 2.4 Results .. .. . .. . ..... .. ........ .. ... . .. .. ..... ..... . .... . . . ..... . .... .. .. . 38 2.4 .0 Introduction ...... ..... .. . ............................... ..... .. 38 2.4.1 Batting Results .... ... .... ..... ....................... . .... .. .. 38 2.4.2 Bowling Results ..... ... .......... .. .................... . ...... 40 2 35 6 7 4 4.1 4 4.3 5 A B c D E F 3 3 ............................................................. 41 as on .................. 1 ..................................... 1 .. 1 .................................................. i iO ..................... 112 ..... ............ ........ ............................ 7 ................ 118 ........... 1 ................................ 1 ........................................................................... 1 vi List of Figures Figure 1. Relative Effectiveness of the Attack Index Figure 2. Relative Effectiveness of the Economy Index Figure 3. Distributional Comparison of Contribution and Score Figure 4. Non-parametric EWMA for B.R. Hartland Figure 5. Non-parametric EWMA for M.J . Horne Figure 6. Distribution-free CUSUM for B.R . Hartland Figure 7. Distribution-free CUSUM for M.J. Horne Figure 8. Control Chart with Warning Lines . Figure 9. Control Chart with Warning Lines for Cricket Figure 10. Shewhart Control Chart of M.J. Horne's Transformed Batting 27 28 42 50 50 52 52 57 58 Contribution With Zone Run Rules . 61 Figure 11 . Shewhart Control Chart of B.R . Hartland 's Transformed Batting Contribution With Zone Run Rules . 62 Figure 12. CUSUM Control Chart of M.J. Horne's Transformed Batting Contribution 64 Figure 13. CUSUM Control Chart of B.R. Hartland 's Transformed Batting Contribution . 65 Figure 14. EWMA Control Chart of M.J . Horne's Transformed Batting Contribution With Zone Run Rules . 67 Figure 15. EWMA Control Chart of B.R. Hartland's Transformed Batting Contribution With Zone Run Rules . 67 Figure 16. Fitted Line Plot of Mean Contribution Vs Mean Score 69 Figure 17. Quartile Control Chart . 71 Figure 18. Establishing Number of Consecutive Increasing Points for Alarm 73 Figure 19. Histograms of simulated Score data with and without Transformation 78 Figure 20. Histograms of simulated Contribution data with and without Transformation . 79 Figure 21 . Quartile Control Chart for M. J. Horne . 80 Figure 22. Quartile Control Chart for B. R. Hartland 81 Figure 23. T2 Control Chart of C.M. Brown for Bowling Indices 85 Figure 24. T2 Control Chart of P.J. Wiseman for Bowling Indices 86 Figure 25. Bivariate Control Chart of C.M . Brown for Bowling Indices 87 Figure 26. Bivariate Control Chart of P.J. Wiseman for Bowling Indices 87 Figure 27. T2 Control Chart of A.C. Barnes . 89 Figure 28. MEWMA Control Chart of Bowling Indices for C.M . Brown 90 Figure 29. MEWMA Control Chart of Bowling Indices for P.J . Wiseman. 91 vii Figure 30. Plot Showing Ranked Nature of Population Batting order. . 113 of Tabl Table 1: Tests for randomness in Table 2: Distribution for Individual Table 3: Distribution for Individual Table 4: Tests for randomness in Table 5: test for Individual Performance Measures Scores . Contribution Performance Measures Indices Table 6. Table 7. in Consecutive Points ARL's 38 39 39 40 40 74 76 Table 8. Ratio's Run From In-Control to Out-of-Control. 77 Table 9. Theoretical Quartile Limits for Horne and Hartland . 79 Table 10. ot the Number of Points to First Alarm for Horne and Hartland for Different Methods 92 Table 1 i. P-Values from Test for of Variance for Bowlers 97 Table 12. of -Correct Selection for NZ Statistical XI 05 Chapter 1. An Overview 1 Chapter 1. An Overview 1.1 Introduction Cricket is a game of numbers. The very core of the sport is entwined with numerical values that translate ultimately to a match result. These sport statistics are a natural by-product of competitive sport and have been around along as contested sport has existed. Currently sport reporters and commentators bombard observers with a vast array of numerical values designed to describe an individual's performance at a particular ski ll. These added extras contribute to the entertainment value of professional sport. However, is this information of use to coaches and selectors of cricket teams? This thesis is aimed primarily at the selectors of top-level cricket teams. An attempt is made to keep the statistics involved as simple as possible so that all levels of selectors may apply this methodology to their teams. There are several key reasons for measuring and evaluating performance in team sport. Organisational Behaviour Theory proves particularly useful in drawing together sport statistics and selection. According to Greenberg and Baron (1997) to build high performance teams appropriate performance measures are required. Tests and measurements are tools that can be used for evaluation of an individual's performance (Franks, B. & Deutsch, H., 1973). Having found suitable measures of performance these indicators can be used in the selection process. For a high performance team, the right team members need to be selected (Greenberg et al, 1973). This means combining all available evidence, quantitative and qualitative, to make correct selection decisions. Chapter 1. An Overview While few really important decisions are made purely on the basis of objective evidence (Franks et al, 1973), selection decisions cannot be based upon subjective evidence alone. The correct balance needs to be implemented. In order to use sport statistics successfully, a deeper understanding of the numerical values involved is necessary. Firstly the nature of cricket statistics is discussed . . 1.2 General Overview of Cricket Statistics 2 Cricket statistics are meticulously collated ball by ba I. The vocabulary of the game continually refers to abstract statistical concepts such as average, aggregate and form - without divulging the secrets of what these mystical values contain. For the sake of simplicity, all values involved are reduced to one dimension. However, this leaves the cricket observer to assume and speculate as to the base values involved. The written media has recently taken to describing bowling performance by listing the number of wickets taken followed by the bowler's average. This form is limited; the basis behind this statement is discussed later in the evolution of the bowling indices. In recent times, an increasing number of studie.:; have been undertaken to understand the statistical processes at work in the game of cricket. G.H. Wood and W.P. Elderton started the ball rolling in 1945, analysing individual batsmen in an attempt to find a general model that would describe individual scores. This is in accordance with the general trend, where most work to date has revolved around batsmen. This seems like an apparent contradiction, as the first skill taught to junior cricketers is how to bowl, for without bowlers the game cannot be played. However, with advent of one-day cricket and now Cricket Max, both geared towards entertainment, the game is becoming increasingly batsmen orientated. "Batsmen have always received the highest accolades. Most histories of cricket are written around them, with the bo'v\~ers regarded merely as a necessary evil." (Nigel Smith, 1994, p.177). The reasoning behind the domination of batsmen in statistical papers may be due to the perceived ease of evaluation. Chapter 1. An Overview This leads to the definition of the statistics utilised in analysis of player performance, enabling a better understanding of the statistics involved. Sport Statistics can be separated into two broad categories; Performance Indicators and Performance Outputs. A Performance Indicator is a quantitative measure that indicates individual performance in a particular facet of the game. These values are collated during the game in progress. Effectively, the game is dissected into small manageable slices, such that a numerical value can be assigned as a descriptive measure. An example of a Performance Indicator, associated with fielding performance, is Ground Ball Efficiency, defined as the number of times the ball is fielded cleanly divided bt the total number of times fielded . These values do not have a direct impact on the match figures. In contrast a Performance Output is a numerical expression detailing the direct result of participation in an event. For cricket these are summary measures detailed in a score book at the completion of an innings, such as score, wickets taken, overs bowled and so forth. As a consequence these values have a direct impact on the match figures . 3 It stands to reason that these two categories are related in some manner. However, only performance outputs will be examined in this study, due to the ease of data collection and availability. Investigating the possible relationship between performance indicators and performance outputs will be analysed in future research. Statistically, assessing the performance of a batsman is relatively simple, as this can be given by a single variable, either runs scored in an innings, aggregate, average, or average contribution. Chapter 1. An Overview 4 'Aggregate' refers to the total number of runs scored by the individual over a specified period of time. A player's 'Average' is then calculated by dividing the aggregate by the number of times the individual was dismissed during the specified time. 'Contribution' is the percentage of runs the individual provides the team total in an innings. Each value on its own can effectively describe performance. Describing performance by bowlers is more complicated A typical bowling analysis gives four values; Runs, maidens, overs and wickets. 'Runs' corresponds to the number of runs conceded by the individual. The number of maidens bowled, refers to the number of completed overs where no runs are penalised against the bowler (leg byes and byes are not added to a bowlers total). 'Overs' refers to the number of six ball sets a bowler has delivered. Finally. 'wickets' are the number of dismissals credited to the bowler. Alone, these individual factors give little insight into how well a bowler performed Together, they are more meaningful, but not until compared to a full score card can the value of the performance be evaluated. The use of the bowling average attempts to describe performance in one dimension. This is found by dividing the number of runs conceded by the number of wickets taken. However, no time frame is suggested by this value. Essentially it is assumed that a bowler will concede 3-4 runs per over. Over a long period of time this assumption becomes more valid, but is not suitable for a game by game situation. The Deliotte Ratings create a one-dimensional measure of performance in Test Cricket. This involves an algorithm that takes into consideration several factors and weightings. This is currently the method of determining the best players in the world. Whilst the formulae involved are extremely thorough. the histories of all players need to be known and equally thorough. An attempt to create a one­ dimensional index, using both factor analysis and principal component analysis. failed to provide meaningful results (Bracewell (1), 1998) Intuitively this 1s obvious as two basic concepts are involved Chapter 1. An Overview 5 Ideally, two dimensions need to be considered, one involving the players attacking ability, the other involving the ability to restrict runs. Kimber ( 1993) gives a graphical method for comparing bowlers. This utilises two dimensions; the attacking ability (strike rate) and the ability to restrict runs (economy rate). Bracewell (2)(1998) proposed two independent normally distributed indices, based upon strike rate and economy rate, to describe performance. The first index deals with a bowler's ability to take wickets , the second with the ability to restrict runs. Both indices are evaluated using simple variations of formula that are already used, taken relative to the team performance. The section dealing with assessing bowlers relies heavily on these indices. Having defined the performance outputs to be assessed it is necessary to discuss the relevance in a selection situation . 1.3 Statistics and Team Selection With the wealth and quality of data available in cricket, it makes sense to utilise this quantitative information in the selection of individuals to maximise the formation of a collective unit (the team). The main assumption underpinning the work in this thesis is that a player's natural ability is expressed by individual performance outputs collated following the completion of a match . Statistics are not the only factors considered when selecting a team. However, Former New Zealand Coach Glenn Turner (1998) discusses the importance of statistics in choosing players in his book Lifting the Covers. In particular the second chapter reveals the emphasis placed on statistics in comparing and selecting individuals. In this instance it is used particularly to justify the non-selection of players, (Andrew Jones and Ken Rutherford) then to defend the selection of Lee Germon. "Late in 1995 Francis Payne, cricket author and statistician , provided me with statistics which mostly confirmed what we had known before we picked our first test team." (p42) . Glenn Turner (1998) , Former New Zealand Coach Chapter 1. An Overview 6 --'------~------------------ ·--- Since statistics are used to make and confirm selection decisions it is necessary to attempt to understand the nature of the data being generated by participation in sport. A greater understanding leads directly to better implementation and hopefully a competitive edge, for the selected team. Former Australian captain, Richie Benaud, remarked on the simplistic nature of selection and the use of statistics stating, "All a captain needs is the confidence that his bowlers are each capable of taking five wickets in an innings, his batsmen are capable of scoring a century and that everyone can field like Viv Richards (Benaud, 1995, p169)." Obviously the captain deals with the players on the field and is not responsible for those selected to take the field, this lies in the hands of the selectors. The captain must believe that he has been given the best men to compete It then begomes the job of the selectors to ensure that the best combination of players available takes the field. If statistics are to be used in the selection process they must be meaningful, and secondly they must be used in an appropriate manner. This means that a relevant application of statistical methodology is that of monitoring individual ability. 1.4 The Application Of Quality Control to Cricket The idea of monitoring performance is as useful to the selector and the player as it is to the arm chair critic. An ideal method for monitoring an individual's performance is with control charts. The control chart is a useful tool in statistical process control. First developed by W.A Shewhart, the shewhart charts are widely accepted as standard tools for monitoring process of univariate independent and nearly normal measurements (Liu & Tang, 1996) Control charts have found frequent appiications in both manufacturing and non-manufacturing settings (Montgomery, 1997). With slight adjustments shewhart charts can be applied to cricket Chapter 1. An Overview Provided the measurements of the individual's performance are reflective of quality, function , or performance then the nature of the 'thing' being measured has no bearing on the general applicability of control charts (Montgomery, 1997). Montgomery (1997) discloses several reasons for the popularity of control charts. At least 3 draw direct parallels to cricket. Possibly most important is that control charts provide diagnostic information. This can identify flaws in technique, or the tendency for a player to struggle under certain conditions. Also control charts are proven at improving productivity, which translates to pushing a player and not allowing complacency. 7 In Cricket we are interested in selecting individuals that will maximise team performa~ce and ensure the best chance of victory. Whilst Cricket is a team sport, the nature of the game allows for individual aspects to stand out. Indeed, when we look at the possible selection of an individual , it is the performance outputs of the individual that is of primary concern. Therefore to ensure the right selections are made, it is important the right statistics are used. Due to the awkward nature of bowling performance outputs, this leads to the evolution of the bowling indices. These two independent, random, standard normal indices are a simple and effective way of allowing bowling performance to be measured from the post match statistics. They are more useful than the current convention used in the written media of quoting the number of wickets taken and the bowling average of an individual. Utilising the assumption that an individual's worth is expressed via performance outputs, this thesis seeks to describe and understand the underlying statistical processes that shape our impression of player performance in the second chapter. Randomness is tested for and then distributional properties of the data are sought. Chapter 1. An Overview Armed with information generated in the second chapter, the third chapter assesses methods for monitoring the estimate of natural ability. Widely accepted control methods, such as Shewhart control charts, CUSUM, EWMA and multivariate versions of these procedures are implemented and the performance for both batting and bowling is discussed. To accommodate the distribution presented by batting scores a new control chart based on quartiles is also studied. 8 Further, ranking and selection procedures utilise the estimates of individual ability to select the best individuals and note the probability of correct selection in chapter four. Chapter Five then d~tails how this information can be drawn together an applied in selecting a side with the assistance of statistics based upon performance outputs. 1.5 Major Contributions of this Study A number of new and novel approaches are presented in this thesis, these include: a) the further development of individual performance measures for the main disciplines of batting and bowling for cricket. b) A 2 - Dimensional runs test, utilising the T2 statistic, with further applicability outside cricket. c) Statistical interpretation of assumptions and results specific to cricket namely: •Outliers are very important in determining the estimate of ability for an individual. •Form is autocorrelation. •Zone rules for cricket are needed to detect good/poor performance. •Relatively short nominal ARL's to accommodate the restricted number of sampl ing opportunities presented in a season. Chapter 1. An Overview d) A new Control Chart based on quantiles to preserve outlier influences in a non-parametric procedure. e) The recommendation of appropriate tools for monitoring batting , bowling and all-rounder performance and also choosing man of the match. f) a selection procedure for bowlers that discriminates between different types of bowlers using the consistency of their performance measures. g) Following selection, an evaluation of the probability of correct selection of individuals to a team, relative to potential contenders. 9 Analysing the C Cricketer Introduction measures occurrence of performance. previous performance player is too good, iS 2 Individual a ce Data batting 10 is a or the player Chapter 2. Analysing the Characteristics of Individual Data 11 Previous work, involving the analysis of both batsman and bowler performance outputs has assumed random performance. This assumption needs to be clarified before further progress can be made. The initial thrust of the thesis is the identification of what constitutes form. Form can be likened to autocorrelation, in that an individual displays patterns or trends in performance over time. It is expected that two extremes may exist, either form exists, or performance is random. If autocorrelation is present, then form exists. Intuitively performance would be considered random, due to the apparent lack of predictability of such a sport. "Uncertainty plays a large role in sports, and one can argue that the uncertainty associated with sports outcomes is one reason that sports are so popular (Stern, 1997, p19)." It has been shown that baseball is a game of chance (Cook, 19]7). An analysis of team tactics as related to the game of baseball and analysis of the annual World Series competition revealed that results were subject more to the laws of chance than the relative calibre of the competing teams. Taking a simplistic view of competitive sports suggests this may also be the case in cricket (Assuming everyone is equally able to compete, and that natural ability will differ, dependent on the pool of talent available). Logistically it would be ideal if performance is random. If this is the case then it is a relatively simple task to select the best individuals, provided that the sampling distributions to which the data belong are known. In order to fully understand the summary statistics presented, and make effective use of the available information, the statistical distribution for each of the performance outputs needs to be known. Fulfilment of this requirement and that of randomness satisfies the most important assumptions regarding inference and quality control. An overview of previous research on performance output measures in cricket is presented in the next section . 2. 1 re Cricket is a In are 1 ( the Characteristics of outcome of every ball it comes as a surprise statistics of summer game perspective. Measures measures be 12 an is to random measures Chapter 2. Analysing the Characteristics of Individual Data Individual Score is the number of runs credited to an individual during an innings and Team Total is simply the total number of runs amassed by that team whilst batting in that innings. Over a period of time, a batsman's worth is investigated via their batting average. The traditional batting average is expressed below. L Individual Scores Traditional Batting Average= = -----­ dismissals However, our interest is with what an individual is expected to score in a given innings, as measured by a batting average. L Individual Scores Bal1ing Average= . . mmngs 13 Thus the aggregate score is divided by the total number of times batted to provide the batting average. This performance measure circumvents the debate surrounding the handling of 'not outs' by only considering the average score per innings. This is different from the traditional batting average, shown above, which estimates the runs scored between dismissals, however this method seems redundant due to the time constraints placed upon the game, especially as we are to consider the expected number of runs in an innings. A further discussion is included in Appendix B. Finally, average contribution can be used as a measure of batting performance as shown below. L Contribution Average Contribution = .. 111mngs It is defined in a nature similar to that of batting average. The sum of individual contributions is divided by the total number of innings. Chapter 2. Analysing the Characteristics of Individual Data 14 b) Bowling Performance for Individual Bowlers Of the individual disciplines, bowling is perhaps the hardest to evaluate quantitatively. A typical bowling analysis consists of four variables, Runs conceded. Maidens bowled, Overs bowled and Wickets taken. There is no easy way of interpreting these values independently. History plays a large part of how these statistics are perceived as does the game situation. This section briefly reviews the statistical methods for evaluating an individual's bowling performance. Kimber (1993) proposed a two-dimensional graphical display for comparing bowlers in cricket based on strike rate (SR) and economy rate (ER), taking advantage of the relationship that these two values have with the Bowling Average (AV). SRxER~100AV These values are traditionally calculated as follows the Economy Rate (ER) is defined as the runs conceded per ball. Total Runs ( 'onceded Fconomv Rate ~ ----- ·· --- -- --- --------- - ' JiJta/ Halls liowled The Strike Rate (SR) is defined as the number of balls bowled per wicket taken. ii>tal Halls liow!ed Strike Rafe'' - - --------~----­ Wi eke Is htk en (Kimber, 1993) However, this relationship does not take into consideration the team situation, and other confounding variables that confront a bowler, such as the state of the game, combined with environmental factors, as these can have an impact on how the specific individual's involved, batsman and bowler, approach each delivery. As a brief example, a batsman 1s more likely to attack the bowler towards the end of the innings, with wickets remaining in a run chase, than a batsman trying to save the match by remaining not out 1n a last wicket partnership when a run chase is no longer viable. Strike Rate has an additional problem, in that if a player fails to take a wicket, a value for SR is not returned as the divisor is zero Chapter 2. Analysing the Characteristics of Individual Data 15 Thus SR is not suitable for evaluation on an innings by innings basis. This measurn could be calculated using all the match results for a season, but in terms of selection and monitoring a player's performance, it is too late to address an individual's worth at the end of the season . Thus only players who have taken wickets can have strike rate as a performance measure. Bracewell (3)(1998) detailed a novel way to evaluate individual bowling performance, incorporating SR and ER into two separate indices that considered relative performance to the team. This involved an attempt to form ratio 's that took into account an individual 's performance in relation to the team performance. The Attack Ratio involved inverting SR for both team and individual so that wickets taken was no longer the denominator. The ratios are defined as follows: Economy Ratio= (Opposition Total I Total Overs - Runs Conceded I Overs] Attack Ratio = [Wickets/Overs - Total wickets/Total Overs] (Bracewell , (3) 1998) However it was found that as the number of overs bowled by an individual increased, the score for both indices tended to zero. This was because as a player bowls more and more overs (approaches 50%) this player is having a huge influence on the team performance. His performance therefore reflects the team performance very closely. The final evolution of performance measures for bowlers involved multiplying the ratio 's by a weighting factor related to time (overs). The problem described earlier was removed in this way. In addition it was found the Attack index needed to be multiplied by a wicket weighting factor, defined in terms of w, the number of wickets taken in any innings. This index is therefore innings specific whereas the other measures are more general . Chapter 2. Analysing the Characteristics of Individual Data 16 The wicket weighting factor in the Attack Index is given by p(w), the probability of taking a certain number of wickets in an innings. Standardisation allows the indices to be compared on similar scales. The indices are therefore defined as follows ECONOMY INDEX= [Economy Ratio x '10vers] ATTACK INDEX =[(Attack Ratio ·-!Overs) I (1 - p(w))] (Bracewell, (3) 1998) 2.1.1 Investigation of Bowling Measures Of the statistical analyses performed using cricket data, bowling is an area deficient in research. Only Kimber (1993) and Bracewell (3)(1998) have examined how to measure an individual's bowling performance. Kimber sought to do this via a graphical display based on Strike Rate and Economy Rate, whereas Bracewell tried extending these values relative to the team. 2.1.2 Randomness Very little research has been done on the aspect of randomness in an individual's performance in cricket. A distantly related team sport, baseball, was found to be essentially random (Cook 1977). There is anecdotal evidence supporting the claim that the role of an individual within a game is random, generally commenting on the apparent lack of predictability of cricket. Berkmann, (1990), Brittenden (1994) and Turner (1998) are just a small selection of cricket observers that subscribe to the unpredictability of cricket view Hunting through player biographies also reveals that those who play the game express this view Chapter 2. Analysing the Characteristics of Individual Data 17 Danaher ( 1989) applied a Run's test to 6 English County Cricketers and found that none showed a significant runs pattern at the 5% significance level. The batsmen chosen were of varying batting ability but chosen because they were either top, or close to the top, of their team's batting averages list. Kumar ( 1996) suggested that cricket is not by chance. However, this assertion was based solely upon over run rates in one-day cricket. The implications of this are manifested in the troublesome interrupted match rules. If over rates were random, then the simple Average Run Rate (ARR) rule would suffice, as this is based on the assumption that run rate of the batting side does not change during the innings. Instead, the resources available to a team play an important role in determining the outcome of a one-day match. One only needs to look to the Duckwort~-Lewis model (1996) to see the effect that time (overs in hand) and wickets in hand have in determining a batting side's capacity for team total. Team strategies also illustrate this point. As a simple illustration of batting capacity, this model accepts the fact that a side is more capable (or daring) of scoring runs when only 2 wickets have been lost, as opposed to being 8 down, with 10 overs remaining. The reasoning behind this is; with the loss of only 2 wickets, presumably the better batsmen are still available, and there are plenty of individual's remaining. Thus batsmen are more able to go after their shots, as the consequences to the team of their dismissal are not as great. Whereas, a batting side with 8 wickets down needs to adopt a more cautious approach , as once a team is dismissed, there is no further chance of adding to the team total. 2.1 scores scores were 4 he an (1 season an Reep, scores once 18 batting scores had been (Pollard, 1977). scores began when to model the a this more so distribution to the over one some a the scores Chapter 2. Analysing the Characteristics of Individual Data 19 Two World-Class players, Geoff Boycott and Ian Botham, were the centre of Burrows and Talbot's (1985) study regarding the exponential distribution. They found an adequate fit to Boycott's 77 innings and the 50 played by Botham. This study also considered the handling of 'Not Out's'. It was found that by adding the mean of the exponential distribution to a not out score an estimate is found for what the individual was likely to score in that particular innings. As exponential random variables have no memory, this is a valid estimate. Furthermore, using this information to establish a player's compensated batting average through a set of iterative equations and solving the first order difference equation, simply resulted in the traditional batting average normally quoted; the total number of runs scored divided by the number of times dismissed. However, the nature of the competitive game is glossed over. This study "excluded limited overs games since innings in these garo_es are necessarily restricted (p46)." To some extent all cricket played has some time restriction, whether 50 overs, 3 days or 5 days, hence the need for declarations, in the pursuit of victory. Further discussion on the effect of time limitations is provided in Appendix B. Pollard (1977) conceded "that a more elaborate model needs to be developed to describe the distribution of batsman's scores (p129) -" This was due to the fact that previous results did not cater for the higher than expected frequencies of failures to score, compared to the theoretical models. Bracewell (2)(1998) suggested a discrete version of a mixed exponential distribution for score and a relatively new concept in cricket statistics, contribution , based upon 5609 observations of individuals in the top 6 of the batting order from New Zealand domestic first class cricket. This involved separating the occurrence of zero and recalculating the mean to find the parameters of the distribution involving the non-zero values. Analysing the 0 Data used study refers to full score cards of cricket obtained from The Shell Cricket the Imperial Cricket or more day's duration nUl!\J\Jc.;>n , & Smith I., 1996). 1997-98 season. four Only to entered 20 of were were each scores. If was runs an a or Chapter 2. Analysing the Characteristics of Individual Data 21 fx(X) = r Po if x = O; i l (1- Po) x 1/~ x e-x1p ifx > 0. (Smith , 1993) The probability of a certain score is given by the area of the corresponding interval of the probability density function . Considering individual scores and contribution , "not scoring" is the failure to score a run. Analysing scores from 5609 individuals from only the top six of the batting order yielded the following models. 1) The fitted model for individual scores: fx(X) = r 0.015 i l o . 032 e-x128 955 2) The fitted model for individual contribution : f x(X) = r 0.092 i l 0.066 e-x113686 ifx = O; if x > 0. ifx = O; if x > 0. A chi-square goodness-of fit test indicated both models were of significantly good fit at the 5% significance level. Obviously the ability of batsmen in the top six differs. Thus, the suggested models are contaminated. However, the nature of the distributions give an insight into how the individual performance outputs for batting are distributed. the Characteristics of Data of Indices use measure of performance for 998). These measure for use on an by were duration of re-evaluated for cases. are is w an an a w = a-b. w per as a= 11.9 (11.42,12. b 3 (1 1 17) the and is no It was c an Chapter 2. Analysing the Characteristics of Individual Data 23 Using an iterative approach it was found that c was 0.25. Using this factor to linearize the equation a regression analysis was performed allowing the final estimation for a and b. The adjusted r2 value of the regression was 99.5, indicating the proposed regression line explained almost all of the variation in the data. Estimates for both a and b were acquired , and are given below, with 95% confidence intervals, a= 11 .1 (10.61 ,11 .59) and b = 14.2 (13.19,15.21). The value for a in the model must always be greater than ten. If it is equal to ten , then this assumes that the probability of taking all ten in an innings is impossible. Evidence proves this not to be the case: A. E. Moss, in the 1889/90 season took all 10 wickets in an innings for Canterbury against Wellington at Christchurch on debut (Payne & Smith , 1996). The interval for 'a' does not contain Bracewell 's (3)(1998) estimate. This suggests that the probabilities change slightly given the extra allowable day. However the changes provide only minimal difference. When 5 or more wickets are taken in an innings the probabilities are approximately equal . The most noticeable difference is the probability that no wickets are taken in an innings. The revised estimate is less than the initial probability from the three-day game. This suggests that in an extended match a player is less likely to go without a wicket. This is possibly due to the chance of having a prolonged bowling spell. However, the interval for 'b' contains both the original estimate and its confidence interval. The resultant x2 value of 12.855 indicated a suitable fit , as it is less than the critical x2 value of 18.307 at the 5% level of significance with 10 degrees of freedom. the probability of a given Attack I As number Standardising lies interval zero. p of a new w=11.1-1 taking a ( 11.1 - w \ 1 ) in the 1 -0.0912. As zero is as an as w= 1, .. ,10 is I (1 - overs and a mean it probably Chapter 2. Analysing the Characteristics of Individual Data 25 Utilising the fact that (n - l)s2 0- 2 is x2 with n-1 degrees of freedom allows a confidence to be formed interval for the standard deviation. As a result a 95% confidence interval for the standard deviation shows it probably falls between 0.3855 and 0.3858. Due to 1 not falling in this interval , the value given for standard deviation needs to be used to standardise the attack index. Both these intervals contain the corresponding Bracewell (3)(1998) estimates. This suggests that the attack index provides similar results for 3 and 4 day matches. Most importantly this index is not sensitive to match duration. The final formula for the Attack Index is given below. STANDARDISED ATTACK INDEX= [(Attack Ratio x '10vers) I (1 - p(w))] + 0.07883 0.37461 c) Economy Index As indicated in 2.1 0 Bracewell 's (2)(1998) economy index is: ECONOMY INDEX= [ECONOMY RATIO x ,/OVERS] Where Economy Ratio has been defined previously and overs is the number of overs bowled by the individual in the innings. Standardising the above equation by first subtracting the mean (0.3657) and then dividing by the standard deviation (3.2016) gives a value to comparable to the standardised attack index. A 95% Confidence interval for the mean reveals that the population mean probably lies between 0.3851 and 0.3463. As zero is not contained within this confidence interval the value for the mean can not be ignored and must be included in the standardisation. Similarly a 95% confidence interval for the population standard deviation shows that it probably falls between 3.2001 and 3.2035. Due to 1 not falling in this interval , the value given for standard deviation needs to be used to standardise the economy index. Analysing the Characteristics STANDARDISED ECONOMY INDEX = overs the Rate 2.1.0, a low is good attacking abilities taking wickets quickly. arises when no wickets are taken, and a is a reason an delivered 10 overs. 26 a are score is Chapter 2. Analysing the Characteristics of Individual Data 27 Tcta \t\lickas =10. Tcta Overs Bcwted=50. Overs Bcwted= 10 Figure 1. Relative Effectiveness of the Attack Index The first graph , showing the resultant value of the Attack index for various quantities of wickets taken in an innings, shows the immediate value of the attack index. The number of wickets taken correlates positively with the index score, as expected . It can be seen that in the given circumstances, taking 5 wickets in an innings corresponds to an index score of approximately 3. When examined in context the value of the indices is strengthened. Five wickets were taken in an innings only twice in the 97/98 Shell Cup season, which included 34 matches (Blake (Central Districts vs Northern Districts) and Maxwell (Canterbury vs Auckland)) . Five was also the maximum number of wickets taken by an individual in the 97/98 Shell Cup competition . This puts the y-axis parameters in perspective. Attack indices of above 3 are relatively rare . Chapter 2. Analysing the Characteristics of Individual Data 28 Below the Economy index is evaluated similarly. 3- • • 2 - • • 1 - • ii} • u 0 -c • >- • E -1 - • 0 c • 8 -2 - • w • -3 - • • -4 - • ' ' ' ' ' ' ' ' 10 20 30 40 50 60 70 00 Runs Conceded C:W001ti01 Tota= 220 Tota Overs BoNled=50 Overs Bo.vied= 10 Figure 2. Relative Effectiveness of the Economy Index The y-axis depicts the relative score for the individual's performance given differing values of runs conceded. Obviously the fewer runs conceded the better, and thus this corresponds to higher scores for the economy index. The above graph is much simpler to interpret In this case there exists a negative correlation between runs conceded and Index score, as expected. Essentially, given an opposition total of 220 (run rate of 44), an individual will concede between 10 (1 run per over) and 70 (7 runs per over) almost all of the time. That is an individual is most unlikely to concede more than 70 runs 1n a 10 over spell. The preceding graphs depict the value of the indices obtained from the varying number of wickets taken, and runs conceded. Having shown that the indices perform to expectation, they can be used effectively as a performance measure for bowling outputs. Chapter 2. Analysing the Characteristics of Individual Data 29 2.3 Methods 2.3.0 Introduction "The objective of statistical inference is to draw conclusions or make decisions about a population based on a sample selected from the population (Montgomery, 1997, p78) ." Considering cricket, each time an individual participates in a match, a sample of their true ability is revealed . Can a series of scores be used to make inferences about an individual's ability? If the probability distribution of a population from which the sample is gathered is known , then the probability distribution of the various statistics computed from the sample data can be determined (Montgomery, 1997). More importantly, it can be established what a player is expected to score and thus their progress can be monitored, which is especially relevant for team selection. A population is a set of measurements that can be described by a set of numerical measures called parameters (Ott, Mendenhall , 1985). In most applications of statistics the parameters are not known but inferences about them are made using information contained in a sample. For time series analysis it is assumed that for each time point t, Z1 is a random variable . Thus the behaviour of Zt will be determined by a probability distribution (Cryer) . In this instance time t, refers to each innings and Z1 refers to a performance output. Previous studies have assumed that the data are independent, in that for each individual bowler the previous match result does not have a direct impact on the following match result. At first class level this is a safe assumption as it is presumed that players who reach this level have developed the necessary mental skills. gets were it job "Cricket, fortunately, less section attempts to find if assumption 2.3.1 Tests for Randomness Runs popular a a zeros runs :::: testing ones is based on n2 = above u = of runs. it does not is 1 Data If 1990)." or IS runs test an is is U. are as if Chapter 2. Analysing the Characteristics of Individual Data 31 Values for 'ua12, and ua12 are to be found in table XI of Freund ( 1992). If n1 , n2 are both greater than 15 then u is approximately normally distributed with : Consider the scores of Adam C. Parore in the data set: 6, 8, 87, 4 , 6, 40, 26, 133, 0, 84, 14, 91 , 0, 63, 111 , 87. This series of scores has a median of 33. Thus the re-coded binary data is as follows: 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1. In the given example n1 = n2 = 8, and u = 12. Table XI of Freund indicates 'uo02s = 4 and ua 025 = 14. Thus the null hypothesis of randomness is rejected if u ;:::: 14 or u ~ 4 at the 5% level of significance. As the obtained value of u does not violate the limits, there is insufficient evidence to reject the hypothesis of randomness MINITAB performs this calculation , requesting only the median and the column in which the data is stored. In this study we need to test for randomness using two measures of performance simultaneously, that is, Attack and Economy Indices for bowling in order to determine whether performance is consistent (not form dependent) . This means we need to extend the runs test to a 2-dimensional test. Consider a 2- dimensional graph showing the performance of a player for a series of innings using two appropriate indices. 2. Analysing of Data 2 are a array. points is performing consistently tend to be neighbouring for randomness :::: 1 the the any patterns close together. Hence the distance As a consequence this, by the as a nature k performances it is standard normal populations are a our analysis the a of patterns in generates a detailed zero 1 is are runs mean or as correlation series, but of is Chapter 2. Analysing the Characteristics of Individual Data 33 The default lag (n/4) was used fork, where n is the number of observations in the series. The Ljung-Box Q statistics acts as safeguard against the explosion of the probability of Type I errors by testing the null hypothesis that the autocorrelations for all lags up to k equal zero (MINITAB, 1996). If the Majority of time series do exhibit autocorrelation, then the job of the selector is much harder. No longer is the average estimate of an individual 's ability sufficient. From the historical data, performance predictions need to be made using prior performance. In terms of recording results from this analysis, when the 95% confidence limits have been crossed, the player in question is noted as displaying significant autocorrelation, thus failing the assumption of randomness. Chapter 2. Analysing the Characteristics of Individual Data 2.3.2 Distribution Fitting In the next section we determine which of the following distributions best fit the performance output data for an individual player. 34 Three distributions are investigated for the batting data, Exponential, Negative Binomial and Geometric. As discussed in 2.1 .0, these distributions have been used to model individual player performance by other authors. Below the properties of each distribution is listed. In these formulae p denotes the probability of a success and (1-p) denotes the probability of a failure for independent Bernoulli trials. Exponential Pdf Mean and variance Negative Binomial Pmf Mean and variance For r = 1 Geometric Pmf Mean and variance f(xlP) = (1 /f3) .e-x'P, 0 ~ x ~ oo, f3 > 0 E(X) = p, VAR(X) = p2 P(X=xlr,p) = pr(1-pt; x = 0,1,2,. . . ;O ~ p ~ 1 E(X) = (' +:- 1 ) r(1-p)/p, VAR(X) = r(1-p)/p2 E(X) = (1-p)/p, VAR(X) = (1-p)/p2 P(X=xlp) = p(1-p)x-1 ; X = 1,2, ... ; 0 ~ p ~ 1 E(X) = 1 /p , VAR(X) = (1-p)/p2 (Casella, Berger, 1990) These distribution are waiting time distributions. These are suitable as our interest is with the number of runs scored till completion of the innings. The geometric distribution is the simplest of the waiting time distributions, and is also a special case of the negative binomial distribution when r is set at 1 (Casella, Berger, 1990). Note that for the formulae above X denotes the number of failures before the rth success for the negative binomial , while X denotes the trial corresponding to the first success. Chapter 2. Analysing the Characteristics of Individual Data The negative binomial , and thus the geometric, are discrete versions of the exponential function . The exponential distribution is a continuous distribution whereas the data here is discrete. 35 In itially it may seem redundant including both the geometric and negative binomial with r set at 1. However, there is a key difference in the estimator used for the sample mean, as seen in the above table. Previous research suggests that these are the most likely distributions to model individual batting scores. Hence the inclusion of the negative binomial with r =1 . In this study parameter estimation is done using the method of moments. The method of moments is one of the oldest methods for parameter estimation . This meth_pd consists of equating the first few moments of a population to the comparable moments of a sample, obtaining the required number of equations needed to solve for the unknown parameters of the population (Freund, 1992). Given a population has r parameters, the method of moments consists of solving the system of equations k = 1,2, ... ,r for the r parameters I /1 , "\""' k m1 =-~ x, 11 10 1 All three distributions being dealt with here require only one parameter to be est imated (Negative Binomial set r=1 ). Thus m1 ' = µ1' is used. Exponential The mean is given by E(X) = ~ = µ1', and the expected value from the sample is the mean Therefore setting the method of moments estimator for the exponential parameter is simply: fJ = x Chapter 2. Analysing the Characteristics of Individual Data Negative Binomial E(X) = (1-p)/p = µ1' and X = m1' Therefore setting: - I··· p .'( .CC --··- p 36 Rearranging give the method of moments estimator for p for the Negative Binomial Distribution . I p ~ =·-- x I- l Geometric. E(X) = 1/p = ~t1', Setting m,' =µ,'provides: Subsequent rearrangement yields a method of moment estimator as follows Using the estimates for the parameters of the given probability distributions, the data can be modelled and the fit evaluated using the chi-square Goodness-of-Fit test Fitting a Mixed Distribution As previously discussed in 2.1.2 a mixed model may be a better fit, due to the higher than expected number of zeros (Smith, 1993) (Bracewell. (2) 1998) As a result a separate component needs to be built into the probability model to cater for the number of zeros. The second component of the 'ducks and runs' distribution deals with the non-zero portion. Let Po be the probability of a zero score. In order to fit a mixed distribution it is necessary to multiply the probability model by (1-po) so that the area under the probability model 1s equal to one, that is (1-po)Px(x) for x>O. For a geometric distribution. the sum of the probabilities for all possible scores, shown below, clearly converges to one as n approaches infinity: " p,, '(I f'o)Lp(I - jJ)' ' ' I Chapter 2. Analysing the Characteristics of Individual Data 37 For calculation of the parameters of a mixed distribution (Po and p or p), the fraction of data set at 0 is separated out and the mean recalculated . The new sample mean is then used as the parameter estimate for the probability distribution (p or p) . The probability mass function of an individual batsman's contribution can be presented in the following form : Pmf P(X=xiR) = Pc(1 -Pcf'; 0 ::; x ::; 100; 0 ::; Pc ::; 1 Where Pc represents the reciprocal of the mean contribution and x corresponds to the random variable for percentage of the team total . It then follows that the probability mass function of individual batsmen scores can be represented as follows : r I Pc if x = O; 0 :-::::; Pc :-::::; 1 Pmf P(X=xiPcil~) ·=- ~ l x = 1,2 ,. 0 :-::::; Ps :-::::; 1 Once again , Pc represents the reciprocal of the mean contribution and x corresponds to the random variable for individual total. Siniilarl:v, Ps is the mean scor(~ inve1ied. Normality Test Bracewell (3) (1998) hypothesized the bowling indices for individuals are normally distributed. To test this hypothesis a normality test needs to be performed. The normality test for the bowling indices involved the generation of a normal probability plot. The probability for the x-values (index) is calculated then plotted against a standard normal probability score. A least-squares line is fitted to the points . This forms an estimate for the cumulative distribution function from which the data for the population is drawn. The Anderson-Darling test for normality is used, which is an ECDF (empirical cumulative distribution function) based test. Chapter 2. Analysing the Characteristics of Individual Data 38 2.4 Results 2.4.0 Introduction To enable a chi-square goodness-of-fit test to be performed on the batting outputs, the data needed to be broken into manageable segments, displayed as follows: Score 0 Contribution 0 1-10 11-20 21-30 31-40 41-50 51-100 100+ 1-5 6-10 11-15 16-20 21-25 26-30 31+ All tests were performed at a 5% significance level. 2.4.1 Batting Results Considering only individuals who batted in 20 or more innings yielded 66 individuals for analysis. A brief summary of the results follows. Appendix D contains the full results. Pass Fail Autocorrelation Score 62/66 4/66 Contribution 62/66 4/66 Runs Test Score 64/66 2166 Contribution 64/66 2/66 Table 1: Tests for randomness in Batting Performance Measures Chapter 2. Analysing the Characteristics of Individual Data 39 Score Pass Fail Obtained x2 Critical x2 Exponential 49/66 17/66 739.03 581.51 Geometric 52/66 14/66 705.04 581 .51 I Negative Binomial I 53/66 13166 I 711 .85 581 .51 I i i I I I I I I I I I I Mixed Exponential 65/66 1/66 336 47 512 06 Mixed Geometric 65/66 1/66 335.04 512.06 I Mixed Negative Binomial 65/66 1/66 397 .14 512.06 I Table 2: Distribution Fitting for Individual Batting Scores The obtained x2 and critical vaiues shown in tables 2 and 3 refer to the fit of the model over the entire population. That is the x2 values for all individuals are summed and compared to the critical / value. For the standard distributions this was 527 degrees of freedom (66x8-1) and 461 degrees of freedom for the mixed distributions (66x 7-1) . This helped confirm the best model . Contribution Pass Fail Obtained x2 Critical x2 1 Exponential 62/66 4/66 482.38 581 .51 i Geometric 56/66 1 10166 I 645.29 581 .51 I ! Negative Binomial 62/66 I 4/66 I 449.44 581 .51 I I ! Mixed Exponential 57/66 I 9/66 470.55 512.06 I I Table 3: Distribution Fitting for Individual Batting Contribution Chapter 2. Analysing the Characteristics of Individual Data 2.4.2 Bowling Similarly only individuals who bowled in 20 or more innings were considered, providing 35 individuals for examination. A brief overview of the results follows. Appendix C gives the results in full. i - - ····-· I -- -1 PaSSf--Falll 1-Autocorrelation !E:corlomy I 34135-j 1135 1 1 · 1 I j I [ ___________ J Attack _T! 32!35-~=~--i I Runs Test : Economy 32/35 i 3/35 i , i Atiack ! 34/35 1 /35 I Bivariate 32/35 3135 · ! ___________ J. -------'--------------·~- Table 4 Tests for randomness m Battmg Performance Measures ~~--c:;r:-m~il: --~~- I, -Eco;omy-1~~:1~::--~ -;~~---I, ' ! Attack I 33/35 i 2/35 I l- _ --------------- ______ L ________________ _____L ___ "------_ _i_ ______ _J Table 5: Normality test for Individual Bowling Indices 40 Chapter 2. Analysing the Characteristics of Individual Data 41 2.5 Discussion The evidence provided from the analyses performed in this chapter clearly suggests that individual performance in the primary disciplines of first class cricket in New Zealand is random. Considering first the case of batting. Only 2 individuals from the sample of 66 failed the runs test. These two individuals, Mark Haslam and Shayne O'Connor, failed the test for both contribution and score. As both are primarily selected as bowlers (Haslam SLA and O'Connor LFM) and have low medians it could be argued that the basis behind the non-random behaviour is that their skill level with the bat is not sufficient. Four individuals failed the test for Autocorrelation. Due to the low numbers violating the assumption of randomness, expressed through the runs test and the test for autocorrelation, it is considered sufficient evidence to claim that batting is random in New Zealand first class cricket. A similar situation applies to the bowling results. From the sample of 35, at most 3 failed the runs-test, or the test for autocorrelation. Once more the majority exhibit random behaviour and this is taken as sufficient evidence for stating that bowling performance outputs are random in New Zealand first class cricket. According to the analyses performed individual batting scores are best modelled by a mixed geometric distribution, mixed in two parts, the zero portion and non-zero portion. Batting contribution is best modelled by the negative binomial distribution (with r set equal to 1 ). The zero component of this distribution represents the zero portion of the score distribution. It is important to note that the negative binomial distribution,and the parameter from the contribution distribution, model the occurrence of zero amongst individual scores. This is an interesting phenomenon. Chapter 2. Analysing the Characteristics of Individual Data 42 Obviously either score or contribution has to be mixed as both share the same number of zeros; unless the team continually scores exactly 100 in each innings, in which case the two distributions will be equivalent. That is the score is effectively the percentage contribution, as score is continually divided by 100 runs (the team total). Considering the case of scoring 0, the probability of this occurring is the same for both contribution and score. This is because for contribution O/y = 0, where y is the team score. Due to the sample mean representing the shape parameter, there is a difference between the shapes for the score and contribution distributions. This is shown in the graph on the following page detailing the probability mass function for differing values of score and contribution. The means for contribution and score are 14% and 29 respectively. These were the population values of top 6 batsmen obtained from Bracewell (2) (1998). 0.07 0 0.06 0 CJ a 0.05 8 0 £ CJ 0.04 0 Key :.0 0 cu 0.03 0 C.ootrh.rtioo ..0 e x Score a.. 0.02 O.Q1 0.00 0 50 100 Scoring Performance Figure 3. Distributional Comparison of Contribution and Score Chapter 2. Analysing the Characteristics of Individual Data 43 Tt1e distribution for scores confirms the high likelihood of being dismissed early. Tl1e fact that the distributions involved are memory-less, as discussed in Appendix B, is also of inter esl. This harks back to the old adage; it only takes one ball, referring to the fact that only one bail is needed to dismiss a batsman, no mater what score the individual is on. The results from the normality test showed that Bracewell's (3) (1998) initial hypothesis of normality for the Bowling Indices is correct as an overwhelming majority exhibited this property (33/35 for Attack and 31 /35 for Economy). The above results confirm the beliefs discussed in the literature review. Having proved th~t performance outputs are random, and gained knowledge of the distributions adhered to by the data, this enables sound statistical methodology to be applied to the performance outputs. A natural extension of this knowledge 1s to monitor individual ability through statistical process control. This is approached in t11e next cl1apter. CHAPTER 3. Monitoring Player Performance Chapter 3 Monitoring Player Performance 3.1 Introduction implementation of quality control procedures is in it of the most statistically successfully players, IS Furthermore, it enables the mon any change the a of certain coaching A arises with the study of sports data to 44 use standard it is preferable to be better than the average. It is therefore an 'out-of-control' on the of portant to are how an data it is important to note deviation), as these are an CHAPTER 3. Monitoring Player Performance 45 Performance measures are initially standardised with respect to the population , that is subtracting the population mean and then dividing by the population standard deviation, giving estimates of the individuals ability relative to those competing in the same competition. In order for charts to be based upon the standard normal distribution, these indices are standardised using personal means and standard deviations. When the data is standardised and tested with the quality control tools, the test is for how reliable our estimate of the individual's ability is. Chapter Two sought to prove the fundamental assumptions involved with statistical process control, namely independence and normality, in the context of performance evaluation in cricket. Following on from the findings of chapter two, it is relatively easy to apply control charts to the bowling indices and contribution as the assul'!]ptions of normality, and independence are upheld (normality for contribution is achieved by transformation). Thus. the application of conventional parametric quality control methods is assessed in this chapter. Initially, possible techniques for the situation presented by individual scores are reviewed and a selection applied to real data. Standard procedures for use with normal data are then discussed. For the univariate case, three methods will be examined , namely the shewhart control chart for individual observations with run rules. CUSUM and EWMA. Then a new type of non-parametric control chart based on quartiles is proposed to deal with the mixed distribution of 'ducks and runs' presented by individual batting scores for an innings. The theoretical quartiles of this distribution are used, maintaining the integrity of the distribution. In the entire scheme of things we are interested in a control method that picks up a change in performance within a season. Thus the control chart must pick up changes rapidly. In designing charts and rules consideration must be given to the number of 'samples' per season. The basis behind the need for a short nominal ARL is determined by the structure of the Shell Trophy competition and also the relative lack of sampling opportunities supplied by cricket in general. a is a are in area are scores are on was It is a case can IS an team scores are no sense a CHAPTER 3. Monitoring Player Performance 47 As cricket is played under varying conditions against varying opposition it is not suitable to artificially create subgroups. Information relating to an individual's tendency to struggle under differing conditions is potentially lost. Also , the special nature of an outlying score can be forfeited . It is well known that an outlier can influence our estimate of an individual's ability. In this case , the effect will be to inflate the mean. However, we want to retain that influence as it indicates the individual is more capable than suggested by the bulk of the data. By the reduction to ranks the nature of very large scores is removed. Our estimate of an individual 's performance indicates what a person is capable of scoring. The presence of an outlier can reveal that our estimate is possibly wrong and the player in question is capable of much more . Outliers are usually regarded as aberrations or errors. But in cricket outliers must be regarded in a totally different light. High scores are valuable 2bservations for performance measures. Another problem arising with the use of ranks dwells with the presence of ties. This makes the use of ranks inexact (Rossini, 1997). The probability of this occurring is quite high due to the likelihood of an individual not scoring. Hackl and Ledolter ( 1992) proposed a non-parametric technique utilising sequential ranking. This method involved using the sequential ranks of observations in association with an EWMA control chart. The method is outlier resistant , as all rank based charts are. Hence, the importance of the extreme score described previously is ignored. After their early attempt with the Wilcoxon signed rank statistic Reynolds and Bakir joined Amin (1995) in investigating a method based on the sign statistic gathered from within artificially created group. As discussed previously it is inappropriate to deal with subgroups in first class cricket as they are not always present. McGilchrist and Woodyer (1975) applied a Distribution-free CUSUM procedure . However, this method also reduced scores to being above or below median thus ignored the impact of high scores . R3. (1 A score scores we are It an an are reason a new non- non- if we to a IS CHAPTER 3. Monitoring Player Performance 49 In addition previous work has considered subgroups of data, where natural subgroups do not occur the authors recommend the creation of artificial groupings of innings results for an individual player. In this thesis it is argued that this approach is also not appropriate because: • Considering the match context, subgroups of size 2 would be reasonable except that a player may bat in only one innings. • Cricket is played in variable conditions , playing surfaces or environmental conditions may vary and as a result artificial subgroups are meaningless. • Outliers are lost In this section three of the methods outlined in the literature review will be applied to the batting scores of Hartland and Horne and compared to the Quartile chart later. a) Non-Parametric EWMA (Hackl, Ledolter, 1992) This control chart is based on an EWMA of sequential ranks. Where the sequential rank, R·1, is an observation 's rank amongst the most recent g observations. This chart performs well with slowly trending process levels. Once again a short ARL is required and this is obtained approximately from Figure 1 and Table 1 (Hackl et al , 1992). An immediate problem arises with the selection of g. As we require a short nominal ARL, we also require a relatively small g, which can lead to correlations among successive ranks. The control statistic is defined as follows : l = 1,2, .... The initial value ,T 0, is set at zero. Three parameters are required, obtained from tabulated values where the group size is taken at the smallest available value. Thus the parameters are set as follows to give a nominal ARL "" 19.8, "-=0.25 , g=4, h=0.2980. The resultant statistics, Tt. can be examined via normal chart form , where an alarm for an out-of-control situation is signalled if I T11 > h. As this is the most recent non­ parametric technique it is also applied to the simulation data in section 3.4. CHAPTER 3. Monitoring Player Performance 50 0.3 0.2 0.1 r= 0.0 \ -0.1 \~ -0.2 -03 v L 10 30 Figure 4. Non-parametric EWMA for B.R. Hartland Two Aiarms are signalled for Hartland, one very early on and the other towards the end of the series. Both are associated with signals for inferior form. -'J'.l -t==o::--=--=--=--=-=r=======:r=====::J 1') lnnirqs Figure 5. Non-parametric EWMA for M.J. Horne No alarms are signalled for Horne, suggesting that he is consistently playing to the level his natural ability 1mpl1es. CHAPTER 3. Monitoring Player Performance 51 b) Non-Parametric CUSUM (McGilchrist , Woodyer, 1975) To allow detection of changes in the extreme distribution posed in hydrology McGilchrist et al developed a distribution free CUSUM . Using an even number of observations the control statistic V1 is defined as follows : I V, = I q(XJ - k ) 1 =1 Where q (x) = 1, x ~ 0, -1 , X -i -2 -3 10 30 10 20 imrgs for s CHAPTER 3. Monitoring Player Performance 53 c) A non-parametric CSCC procedure based on Within Group Rankings (Bakir, Reynolds, 1979) This non-parametric procedure was developed to quickly detect any shift in the mean process level. Using Wilcoxon signed rank statistics and within group ranking , a CUSUM type procedure is implemented. CSCC, as defined by Bakir et al , given in the sub-title is better known as the CUSUM (Cumulative Sum Control Chart). For the within group ranking subgroups are required. Where these do not occur naturally, they must be created artificially. In this instance artificial subgroups of size 4 are created. To detect shifts in the process level on the positive side the following control value is applied. m m L )SR 1 - k ) - min L, (SR 1 - k l~ h o~ ,,, ~" 1= \ 1=\ Similarly, to detect deviations in the process level on the negative side the formula below is implemented. Ill fl/ max L (SR 1 - k ) - L, (SR 1 -k ) ~ h 0-Sm~n 1=\ 1=\ The Wilcoxon signed rank statistic is defined as follows: SR = L, rank (I X . -µn I) \ ·"'' >0 (Smith , P. 1993) Signed rank has the advantage of conside ring the relative rankings of the magnitudes of the data points. A problem associated with the signed rank test in this instance is the assumption of a symmetrical distribution . Clearly the mixed distribution presented by individual batting scores is not symmetrical. As a result the median is used to circumvent this problem. In this case the mean µ 0 has been replaced by the median. Effectively the median is subtracted, then the values ranked . Finally , those ranks associated with points greater than or equal to the median are added to provide SR. CHAPTER 3. Monitoring Player Performance 54 From the tabulated values of Bakir et al the parameters are chosen to provide a short nominal ARL in the vicinity of 17. Choosing k=O, h=6, and g=4 provides a nominal ARL=17.07. Applying the above procedure to the score series of Horne and Hartland an alarm is signalled in both cases after 8 innings (2 groups) indicating that the estimate of batting ability for both individuals has changed. For each individual this alarm was associated with a decrease in performance. Bracewell (3)(1998) showed how shewhart control charts could be used to monitor bowling performance. It was also shown that the interpretation of zone run rules could be modified to accommodate the example presented by cricket. A brief demonstration of how multivariate control charts using Hotelling's T2 statistic was also provided in the same study. These approaches are described and extended in the rest of this ch?,pter 3.3 Univariate Quality Control Methodology It is appropriate to monitor an individual's performance with control charts. Provided the measurements of the ·product' are reflective of quality, function, or performance then the nature of the ·product' has no bearing on the general applicability of control charts (Montgomery, 1997) The control chart is a useful tool in statistical process control. First developed by W.A. Sl1ewhart. the Shewhart charts are widely accepted as standard tools for monitoring process of univariate independent and nearly normal measurements (Liu & Tang, 1996). Control charts have three fundamentai uses 1 Reduction of process var1ab1lity 2 Monitoring and surveillance of a process 3 Estimation of product or process parameters (Montgomery, 1997). CHAPTER 3. Monitoring Player Performance 55 It is the second use that is of the essence in the application to cricket, and possibly other sports. Process in industry is the parallel term to player performance. Control charts have found frequent applications in both manufacturing and non­ manufacturing settings (Montgomery, 1997). The third use is also of relevance when dealing with team selection. This is a result of the interest in the estimate of an individual 's ability in relation to other player's available for selection . Before standardising the data it is important to note the parameter values as these are the estimates of the player's latent ability. In particular those applied to the bowling indices. These are initially standardised with respect to the population , by the natur~ of the indices, giving estimates of the individuals ability relative to those competing in the same competition. For charts that assume a standard normal distribution , these indices are standardised again , for within person evaluation, using individual means and standard deviations. When the data is standardised and tested with the quality control tools , the test is for how reliable the mean is as our estimate of the individual 's ability. 3.3.1 Shewhart Control Charts Shewhart charts are strongly dependent on the assumption of normality and independence. Also assumed is the absence of between subgroup variation when the process is in control (61 .325 Study Guide) . However, this statement is irrelevant in this study, as only individual innings observations are taken , that is only subgroups of size one exist. The operation of a shewhart control chart with only action limits is slow detecting small shifts in process level (61.325 Study Guide) . However, the Shewhart chart can be sensitised by utilising zone rules. 1 to case use are is a a mean, Performance & of as IS as s IS in is on is it a as: 3 it is CHAPTER 3. Monitoring Player Pertormance 57 However minor alterations need to be made to the labelling of the zones and to the rules to make it compatible with the evaluation of an individual's performance. Whilst being similar, there are fundamental differences in testing for quality in a product and sport. Typically quality control monitors the maintenance of certain control limits and deviation from a common mean. 3s A 2s Q B U M s A E L A c Where I S c T U -s --------- ---------- y R B E -2.,._e _________________ ~ A -3s Subgroup number Figure 8. Control Chart with Warning Lines UCL= Upper Control Limit LCL= Lower Control Limit LWL= Lower Warning Limit UWL= Upper Warning Lim it UCLx UWL x LWL LCLx In a sporting context. which side of the mean a point falls is important and needs to be built into any control chart. R3. 58 A. 2s E.P. B B I s c L I D -s E a zones can cr are are cr is mean. If a a 17 zone are a CHAPTER 3. Monitoring Player Performance 59 Zone Rules for Cricket Bracewell (3)(1998) proposed the following interpretations on the Zone Rules in a cricketing context. Understanding the intrinsic differences between the product attributes and sports performance allows the zone rules for Shewhart Control Charts to be manipulated to indicate 'Out-of-Control ' conditions. This effectively means that the individual 's performance relative to the team is no longer random in most instances. The tests given by Montgomery (1997) are modified to suit the situations presented in cricket. Test 1. Extreme Points Points that fall outside the control limits. Falling outside of zone A indicates an exceptionally brilliant performance relative to the team. Conversely a point falling outside ZQne F indicates an exceptionally awful performance relative to the team. Test 2. Two Out of Three Points in Zones A or F and Beyond Two of three performances in and beyond A shows continued excellent performance. Whereas, two of three performances in and beyond f shows continued inferior performance. Test 3. Four Out of Five Points in Zones B or E and Beyond Four out of five successive points in zone B or beyond indicates continued good performance. The same situation for zone E and beyond reveals continued bad performance. Test 4. Runs above or Below the Centreline This test considers long runs (eight or more successive innings) either strictly above or below the centreline. This indicates either consistently above expected performance (above the centreline) or consistently below expected performances (below the centreline) . versa. R 3. 14 a is a 7. A is 15 If no in in an IS Dan can in zones C or is as in or D. CHAPTER 3. Monitoring Player Performance 61 The numbers identifying the tests differ from those used in MIN ITAB. As mentioned earlier, the shewhart chart can be set up in two ways. The first uses only the contro l limits th at are set at 2 standard deviations away from the process mean. Secondly all the run rules are applied along with the traditional 3-sigma limits. Both cases can be shown simultaneously. 5 3.00l...=4 639 c 4 2.CB..;:.3.m 0 ·s 3 !\!\ M11 1' /\ .0 ·c c 'I = 8 2 \J\j y v v~c \ X=2.C64 -0 Q) ~ 0 -2.C$.;:0.:m3 't7l I c 0 ~ -3.0SL.:-0.5314 I- -1 0 5 10 15 20 2) Obsarvation NJmber Figure 10. Shewhart Control Chart of M.J. Horne's Transformed Batting Contribution With Zone Run Rules. The initial graph (zone rules, 3-sigma limits) reveals no signals, indicating that Horne is performing to expectations in terms of contributing to the team total. This also indicates that the impressive estimates for his natural ability are significantly adhered to. If the 2-sigma limits are applied , an alarm is signalled at point 19. However, this corresponds to a score of zero, which is not necessarily an indication of form change. As it is impossible to achieve a negative score with cricket data as applied to this type of control chart , negative control limits should be set at zero. Nevertheless, it is useful to see the impact the effect control limits have in this type of situation. Effectively every t ime an individual fai ls to score an alarm is signalled. R 4 3.0Sl.=3.885 3 2.0Sl.=3.144 2 X=1.661 -20Sl.=0.1793 -3.0SL=-0.6617 0 10 11. a is to a zero score. IS as as zone or new as as or are next. CHAPTER 3. Monitoring Player Performance 3.3.2 CUSUM The British Statistician Page first proposed cumulative sum (CUSUM) charts in 1954. This procedure involves cumulating sums, such that past values have an impact on the control statistic . 63 All measures of performance involved in this study are standardised to make computations somewhat easier for setting up general standards and relating these in terms of player performances. Also the implementation of the charts is designed to monitor the given estimate of ability. It is preferable to work with one-sided standardised CUSUMs for the case presented by cricket. As mentioned earlier, a special situation arises in the application of quality control methodology to sport. If a player' ~performance process changes such that an alarm is signalled, it is necessary to note if the alarm was due to superior or inferior performance . Obviously if an individual 's performance is improving the desired situation has occurred . Standardised values are used to compare within player performance on relative scales. As we are dealing with individual observations the statistic required for the CUSUM scheme is given as: X -X I .... , - CJ \ For detecting shifts on the upper side of the mean the procedure is defined as With SHo set at zero. The slack constant , k, must be less than 3 or a situation akin to a shewhart chart is implemented , which detects only large shifts. Generally this value is 0.5, designed to detect smaller shifts. CHAPTER 3. Monitoring Player Performance 64 The next part of the chart is the adoption of some threshold value, for which when crossed, indicates an out of control situation. This value is referred to ash. A similar procedure is used for detecting shifts on the tower side of the mean: Su= max {0,(-z;-k)+ Su+1}, with SHo set at zero. The optimal values for the parameters of the CU SUM procedure can be found from Gan's (1991) nomograph. To achieve a small nominal ARL of approximately 17 h is set a 0.25 and k to 1. 7. Whenever a signal is given. this implies an out of control situation, that is the estimate of the player's natural ability has changed. If a cause for the alarm is found the CUSUM is reset to zero. -4,,~ ~·-- ......... I '--·--· .... -- .. --------------- ... - ............. ___ _,_ .... L-................. 1 -2.2E-01 E' -o3 -l a ! I -C•.6 _ __j I ~'.of,,·rCUSU!vi I 0 10 Innings Figure 12. CUSUM Control Char1 of M.J Home's Transformed Batting ContributJOn. CHAPTER 3. Monitoring Player Performance 65 A signal is given at point 19. once again corresponding to a score of zero. Apart from the one 'duck', no other signals are given , indicating that Horne is performing to expectations in terms of contributing to the team total. 0.2 - 0.1 - E :::::i Cf) 0.0 - ~ "+;j co -0.1 - :::::i E -0.2 - a -0.3 -- -0.4 - Upper CU SUM 0.185268 .-.-., tj ~ ~ ....... -. ................. ..-... ................ - I I I I I I 1 I I I, I I I I I I I 11 11 11 f t ~ -1 .9E-01 lower CUSUM I I I I 0 10 20 Innings Figure 13. CUSUM Control Chart of B. R. Hartland's Transformed Batting Contribution. A signal is given at each point Hartland failed to score. This is the same as the Shewhart chart with only the 2-sigma lim its in operation. It is not necessarily confirmation of deterioration in form. However, with three alarms in a short space of t ime. this suggests that the true estimate of mean batting ability for Blair Hartland is actually significantly less than noted. CHAPTER 3. Monitoring Player Performance 66 3.3.3 EWMA An alternative to the Shewhart control chart is the Exponentially Weighted Moving Average control chart developed by Roberts (1959). The performance of the EWMA chart is similar to the CUSUM scheme, but easier to set up and operate (Montgomery, 1997). The EWMA chart proves to be useful when it is not practical to take more than a single observation per sample, as is the situation presented by cricket. An advantage is the effect of averaging to detect process level shifts and damping out some effect of random errors on individual observations, due to the reliance on past observations. The exponentially weighted moving average is defined as follows: Z j = AXi + (1-A.) Zi- 1 (Montgomery, 1997) Where A. is a constant greater than zero , but no larger than one. The starting value (when time , i, is one) is the process target and hence equal to the population mean . The C(lntrol limits are defined a.s follows UCL =-0 Pu+ Lu t=-~~-=·[~-~~~~1)2; ) \(7..-)~) 1----;----·-- --- ~- I ,.,l. - L-· 1 il r1 _c· ·- ')2;] Jc , - u,, - l) L... i J. I v '(2-A) CL= 11 ~1 The factor L represents the width of the control limits. A. indicates the weighting placed on previous values. To obtain an appropriate ARL, of approximately 17, values are taken from Crowder's (1989) nomograph to detect a shift of one standard deviation . Land A. are set at 2 and 0.25 respectively. CHAPTER 3. Monitoring Player Performance 2.8 s.J-- 2o&.=2.705 ~;v\\ 23 ~ ~ /\ - ~ -~ ' VJ \ ><=2054 w 18 1..-,_ -2.09-.= 1 402 13 0 5 10 15 20 25 Innings Figure 14. EWMA Control Chart of M.J. Horne 's Transformed Batting Contribution With Zone Run Rules. 67 As no signals are given, further confirmation that Horne is performing to expectations in terms of contributing to the team total is found in the above EWMA chart. 2. 09.,:2. 222 = X:l €61 -20$..:1101 10 20 30 Innings Figure 15. EWMA Control Chart of B.R. Hartland's Transformed Batting Contribution With Zone Run Rules. In contrast to the Shewhart and CUSUM scheme, the EWMA chart only signals after the third zero score, suggesting that there is no significant change in form. As the EWMA scheme is not readily influenced by the zero scores, this is the preferred option, for assessing batting contribution. CHAPTER3. Monitoring Player Performance 68 3.4 Proposed Control Chart Based on Quartiles. The performance of a control chart in the traditional sense is very sensitive to the assumptions when dealing with individual scores due to the relatively high likelihood of failure. When the assumptions of either independence or normality fail, traditional methods introduce high probabilities for false alarms (Type I errors) (Vasilopoulos. Stamboulis, 1978). Problems arise in that the distribution tor individual batting scores is mixed geometric. as indicated in Chapter 2, and thus transformation will not yield an approximately normal distribution. However, we are wanting to preserve the influence of extreme scores as these are an indication of an individual's capability, which non-parametric methods negate. A potential problem with standard control charts is that a single extreme outlier can trigger an out-of-control situation. We want a technique that is devoid of a distribution, or enhances, or is adaptable to a given distribution, and makes use of extreme values for individual observations. An approach is introduced here based around the simple concept of quartiles. The proposed method is based on Quartiles. As sample sizes are generally small, theoretical quartiles are used rather than the observed values, which also allows parametric influences. To maintain the attachment to a given distribution, the theoretical quartiles are gained using the estimate of tile mean. Hence, outliers potentially influence these values and as a result information pertaining to outliers is not lost. How these values were obtained is discussed later. Essentially, due to the nature of the 'ducks and runs' distribution, specifically the number of zero's occurring. the use of a shewhart type scheme is inappropriate. Moreover, the LCL is incompatible with the number of zeros that are generated, effectively causing an alarm every time a zero is recorded. From Chapter 2 there is sufficient evidence to imply that individual batting scores are from a mixed geometric distribution. As this distribution is discrete, to find the theoretical quartiles involved investigating the cumulative probab1l1t1es. Values for U1e theoretical quartiles were computed based on the mean score and mean contribution. CHAPTER 3. Monitoring Player Performance Using EXCEL a table was produced listing the probability of a given score, given varying batting parameters, mean score and mean contribution. A scatterplot of the mean contribution and mean score revealed a linear relationship between score and contribution. Consequently a 99% Prediction interval, shown below, from a simple linear regression gives the most likely region of interest, giving general bounds on which to form the table. 69 The spreadsheet was designed to sum al l previous values. Where the cumulative probability was closest to the values of the quartiles (0.25, 0.5, 0.75), the corresponding score was taken. Invariably all the values fell between the positive integer values. As individual scores are represented by only positive integers and zero, the precise score at which they were obtained is not needed, so values given are of the form (m+n)/2 where m and n are the two immediately neighbouring points that surround the quartile value of interest. 70 60 / / 50 , .,,,,,,., @ 40 ,,...., .. . ~ / ,,,,,. .. 30 ·.,/" c ,,,,,. .. cu ./ ~ 20 ,,,,,. 10 / / Reg ression 0 / .. 99% Pl .. -10 ,,,,,.,. 0 10 20 Mean Contribution Figure 16. Fitted Line Plot of Mean Contribution Vs Mean Score CHAPTER 3. Monitoring Player Performance Three control !ines are drawn corresponding lo the lower quartile, median and upper quartile. These theoretical ·,·a!ues are observed from tables !isted in Appe1·,dix t:=., 'Jtiiising the averagl"' score of tt1e no:i-zero component and mean coritributicn. F1& exac: :easoning behind this is given in the development of the mixoc' rnodol desc:ioed Radie'. The mean score and meaci -~ontribution are then 1c!a0ted tr:"'· oeing dis;ricut:e;r, l·ee to be;ng based upon mixed geometric. To signal an alarm rules are developed based on the probabilities of a point appearing in a certain zone or pattern of zones which can be used to identify performance. These rules try to emulate the zone rules that supplement the Shewhart charts. 70 Due to the nature of cricket, it takes a number of observations to be able to effectively estimate a player's ability. It is logical to also infer this holds for a change rn ability. The quartile chart with zone rules described below will give a false signal on average every 16.8 innings. If five rounds of Shell Trophy mar.ch0s (five matches. ten innings) are played, this equates to approximately one signai ill two seasons. For an out of control situation, either through ioss of form or an improvement a signal wil! !JG picked up approximateiy 1n one season. The run :1 des de~)gr~Eid fur ust; \'\ii!h this cl 1~1rt are set at 3 95°/; confidence !!rrlit and are ref:.::11.l-Jf"':Y' ~;1c;:-:;e ,~.!torinfJ th~s to a sf:ric+(~f plaJ1 98.5°1~ VJOuld correspond to !arger CHAPTER 3. Monitoring Player Performance 71 The figure below details the positioning of the zones about the quartiles. ZONE 4 Upper Quartile ZONE 3 Median ZONE 2 Lower Quartile ZONE1 Observation Figure 17. Quartile Control Chart In creating a set of zone rules for the Quartile chart it was attempted to have run rules com_parable to those implemented by the Shewhart Control Chart. Obviously shewhart charts assume normality, which is not found in this situation. A situation akin to a uniform distribution is created . Four zones with equal probability (0 .25) are created based upon the Median , Upper Quartile and lower Quartile . Knowing the probability that a point falls into a certain zone enables improbable run lengths to be established. That is certain patterns of data - similar to that used in the zone rules for Shewhart Control Charts - that are unlikely to occur in an in-control situation can be established. The run lengths are given where the probability for a given string of values drops below a given error setting . In this instance run lengths are given for an error setting of 0.05. That is there is less than a 95% chance of a given pattern occurring. This setting is chosen as it is sufficiently tight for the context presented . 6 rules are proposed as follows. CHAPTER 3, Monitoring Player Performance Zone Rules for Quartile Chart An 'alarm' in this instance refers to a change in form signal. H0 : Playing to natural ability 1 Runs in Extremities Involves a run of points exclusively in zones 1 and 4, 72 As there is a 05 probability of a score occurring in these zones, under Ho an alarm will be sounded after 5 points P[False Alarm]= 0.5><0,5x0.5x0,5x0.5 = 0 03125 In terms of a player's performance this signal suggests that a player is either failing or going on to make a big total, relative to the estimate of their natural ability, 2 Runs in Central Zones Similarly. this involves a run of points exclusively in zones 2 and 3, As there is a 0,5 probability of a score occurring in these zones, under Ho an alarm will be sounded after 5 points P[False Alarm]= 0.5x0.5x05x0,5<0.5 = 0,03125 In terms of a player's performance this signal suggests that a player is getting star1s but not going Oil to make a big total, relative to the estimate of their natural ability 3 Runs in one Zone As there is a 0.25 chance of falling in any give zone under H0 P[False Alarm]= 0.25>0.25x0.25 = 0,015625 3 runs 1n one zone results in an alarm. Depending on which zone the alarm originates. influences the interpretation of the signal An alarm from zone 4 indicates an extremely good senes of scores, whereas 3 points in zone 1 shows poor form, CHAPTER 3. Monitoring Player Performance 73 4 Runs above or below the median This involves a series of scores in either Zones 1 and 2 or in Zones 3 and 4. As there is a 0.5 probability of a score occurring in these zones, under H0 an alarm will be sounded after 5 points P[False Alarm]= 0.5x0.5x0.5x0.5x0.5 = 0.03125 Again depending on which region has caused the alarm sways the interpretation. Runs above the median are highly favourable and thus indicate good form. Conversely a signal below the median indicates inadequate performance. 5 Points increasing /decreasmg s c 0 r e • P1 Time Figure 18. Establishing Number of Consecutive Increasing Points for Alarm The number of strictly increasing points can be found by solving an integral of the following form such that the resultant probability is below 0.05 . This is because for the sequence to be increasing P1 must lie in the range (0. P2); P2 must fall in the range (O, P3) and so forth. Solving the above integral yields a probability of 0.0417 (4d.p.) . the first sequence (4 consecutive points) to fall below the defined 5% Type I error. A similar situation exists for points strictly decreasing except the integration takes place over the ranges (pn, 1) instead of (0, Pn) . Therefore for increasing or decreasing points P[False Alarm] < 0.05 occurs after 4 points (0.0417). If the points are increasing th is suggests a player is improving, conversely a decrease suggests poor form.