Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author. A Comparison of Task-Specific and Dimension-Specific Assessment Centres Duncan J. R. Jackson Members of the Supervisory Panel Dr. Stephen G. Atkins (Chair) Dr. Jennifer A. Stillman Dr. Douglas Paton Dr. Phillip E. Lowry The real voyage of discovery consists not in seeking new landscapes, but in having new eyes -Marcel Proust ? Massey University COLLEGE OF HUMANITIES & SOCIAL SCIENCES To Whom It May Concern: School of Psychology Private Bag 102 904, North Shore MSC, Auckland, New Zealand Telephone: 64 9 443 9799 extn 9180 Facsimile: 64 9 441 8157 This is to state that, with respect to the research conducted for the Doctoral thesis entitled "A Comparison of Task-Specific and Dimension-Specific Assessment Centres" carried out by Duncan John Ross Jackson, the following statements are true: i) Reference to work other than that of the candidate has been appropriately acknowledged. ii) The research practice and ethical policies approved by Massey University have been complied with. iii) Although the current thesis guidelines request a word limit of 100,000, the current thesis was substantially completed prior to the introduction of this limit. (It consists of approximately 117,000 words.) D.J.R. Jackson Candidate Te Kunenga ki Pttrehuroa Inception to Infinity: Massey University's commitment to learning as a life-long journey ? MasseyUnhter?it?- COLLEGE OF HUMANITIES & SOCIAL SCIENCES To Whom It May Concern: School of Psychology Private Bag 102 904, North Shore MSC, Auckland, New Zealand Telephone: 64 9 443 9799 extn 9180 Facsimile: 64 9 441 8157 This is to state that the research carried out for my Doctoral thesis entitled "A Comparison of Task-Specific and Dimension-Specific Assessment Centres" in the School of Psychology, Massey University, Albany Campus, New Zealand, is all my own work. This is also to certify that the thesis material has not been used for any other degree. D.J.R. Jackson Candidate: -;?gthat I am capable of being successful at this extended interview". This was fol lowed by a series of items that related to performance on the specific exercises in the AC. The first item in this series read : "I believe I wil l be successful in particular on the following exercises". This was fol lowed by a l ist of the assessment exercises to create an 8-item measure comprising one general item, and seven exercise related items on a 7-point scale ranging from strongly disagree to strongly agree. Tovey had not col lected data with her scale at the time she was contacted, and as such, no psychometric information pertaining to the scale was avai lable. Tovey's scale was, however, intuitively appealing as a face valid scale framework that could easily be adapted to different ACs. As a result, Tovey's framework was employed in this sample. General Self-Efficacy: Several researchers have hypothesised that a global sense of self? efficacy could result from several self-efficacy fostering or diminishing experiences across different domains. Labelled general self-efficacy, this construct asserts that a collection of experiences related to varying levels of self-efficacy in the past could carry into perceived self-efficacy expectations in new situations for an individual . Most of the 101 current research into general self-efficacy has focused on a scale developed by Sherer, Maddux, Mercandante, Prentice-Dunn, Jacobs, and Rogers ( 1 982) and later researched and revised by Woodruff and Cashman ( 1 993) and Bosscher and Smit ( 1 998). The present study util ised the version of the Sherer et al. scale that was presented in Bosscher and Smit ( 1 998), which comprised a 1 2-item general self-efficacy scale (GSES- 1 2). The scale breaks down general self-efficacy into 3 sub-constructs (initiative, effort and persistence) and also purports to measure a higher order general self-efficacy construct composed of the combination of these three components. Various studies, in general, have found acceptable levels of internal consistency for the general self-efficacy scale. Minor changes were made to some of the items in the scale across the different studies. For the overall general self-efficacy scale, Cronbach alpha reliabil ity coefficients of .86 and .69 were reported by Sherer et al . ( 1 982) and Bosscher and Smit ( 1 998) respectively. For the subscales of the general self-efficacy scale, Woodruff and Cashman ( 1 993) and Bosscher and Smit ( 1 998) found the fol lowing Cronbach alpha coefficients for the three scales respectively: Initiative: .74; .64, Effort: .75; .63, Persistence: .64; .64. these inter? item consistency coefficients fal l within the l imits of moderate acceptability as suggested by Nunnal ly and Bernstein ( 1 994). Internal consistency for the Bosscher and Smit ( 1 998) study was slightly lower than the other studies. This may have been due to the fact that Bosscher and Smit excluded 5 items that were found in a pi lot study to have low item? total correlations and ambiguous wording. The alpha differences might also have been due to Bosscher and Smit 's use of elderly people as a sample, while the studies by Sherer et al . and Woodruff and Cashman employed student participants. In any case, it was decided that the internal consistency estimates for the Bosscher and Smit version of the 1 02 scale were sti ll within the limits for acceptability, and that a slightly lower number of items might assist to maximise return rates. Note that only the unitary scale was employed in the present study. In this study, General self-efficacy was measured on a 7? point scale ranging from I (disagree strongly) to 7 (agree strongly). Convergent validity evidence has been reported by Sherer et al. with the finding that general-self-efficacy, as measured by the general self-efficacy scale, correlated positively with the likelihood that a given individual was in current employment, with quitting from fewer jobs and being fired from fewer jobs, with educational level and military rank. General self-efficacy was also found to correlate with an internal locus of control and self esteem. Woodruff and Cashman ( 1 993) found a similar pattern of correlational data, with positive relationships found between general self-efficacy and personal mastery, task specific self-efficacy, and expectations of receiving higher grades. Tacit Knowledge: Klimoski and Brickner ( 1 987) specifically theorised that a type of managerial intel l igence might be related to the extent to which an individual could be successful in an AC, and on later criterion measures of performance and promotability. As such, the present study employed the Tacit Knowledge Inventory for Managers (TKIM) (Wagner, 1 985) . Although the participants were not managers themselves, the current measurement could be viewed as an indication of an individual ' s aptitude for being a successful manager. The measure could also be viewed as a gauge of the extent to which individuals already held the characteristics that may be conducive to holding managerial intelligence that theoretically may in turn assist them towards success in the AC and into the j ob. Using the TKIM on non-managerial samples is certainly not 1 03 unprecedented, and it has been used successfully for the assessment of non-managerial individuals in past research (Wagner & Sternberg, 1 985). Colonia-Willner ( 1 998) reported Cronbach alpha coefficients of .85, .83 and .85 and Wagner and Sternberg ( 1 99 1 ), the coefficients .74 and .80 for separate samples for the entire TKIM scale. The theory relating to managerial tacit knowledge delineates the construct into various components relating to managerial intelligence concerning self; others; and tasks. Colonia-Willner found moderate internal consistency coefficients for these sub-constructs with respective Cronbach alphas of .74, .67 and .64 in her first study, .70, .64 and .60 in her second study and .74, .68 and .65 in her third study. This might suggest that the TKIM may be better employed as a unitary scale. In the interests of maximising measurement precision, it was decided, on the basis of the previously mentioned study, to employ the unitary conceptual isation of this construct in Study One. A multitude of evidential information exists for the convergent and discriminant validity of the TKIM, some of which has already been discussed . in the previous section, ' and as such, this will be only briefly mentioned here. Discriminant validity studies suggest that tacit knowledge is independent of academic performance and cognitive ability test scores (Colonia-Wil lner, 1 998; Wagner & Sternberg, 1 99 1 ), and convergent evidence suggested that scores on the TKIM were related to job performance (Wagner & Sternberg, 1 985). Colonia-Wil lner found that the best scorers on the TKIM were more experienced managers, which corresponds to the theory that tacit knowledge is gleaned from experience. Two versions of the TKIM exist. One of these employs expert samples to create deviation scores for scoring participants. The present study employed a version of the 1 04 TKIM that does not require the use of an expert sample (Wagner, 1 985), in the interests oftime and available resources. The number of items in the total scale for this version was 39. These 39 ' real ' items were imbedded within another 1 27 dummy items that were not scored. A set of items related to a set of 1 2 managerial scenarios that were each presented in a vignette. These items came as a booklet sent directly from the author (Wagner, 1 985) and were presented on a 7-point scale ranging from 1 (not important) to 7 (extremely important) . The scale broke managerial tacit knowledge into the areas of tacit knowledge related to managing one's career, managing self, managing others, and "other" items that were described as discriminating between those who had higher levels of tacit knowledge, but did not fit the theory. Wagner replaced the 'career' scale with tacit knowledge relating to 'tasks' in a later version of the tacit knowledge inventory for managers (a version that requires the use of expert samples) in response to a subtle development in the theory of the tacit knowledge concept (Wagner and Sternberg, 1 99 1 ). In any case, just as the evidence suggests for the version of the TKIM that employs expert samples, the results ofWagner and Sternberg's ( 1 985) article suggested that the non? expert sample version of the scale should be viewed as a measure of a unitary tacit knowledge construct, with one study showing evidence of moderate levels of acceptable internal consistency for the entire measure (at .68). Self-monitoring: The present study employed the 1 2-item O'Cass (2000) revision of the Lennox and Wolfe ( 1 984) Revised Self-Monitoring Scale. The O 'Cass revision was a subtle modification of the scale, whereby one item was dropped from the original measure because a pi lot study revealed poor reliabil ity and item total correlations, and the 1 05 scale poles were changed from a 6-point scale ranging from 1 (certainly always false) to 6 (certainly always true) to a new 6-point scale ranging from 1 (strongly disagree) to 6 (strongly agree). It was decided to use the latter of these poles, as O'Cass found that participants were better able to interpret the modified scale. Lennox and Wolfe ( 1 984) conceptual ised self-monitoring as being composed of two underlying factors : self-monitoring abi l ity and self-monitoring sensitivity. The revised self-monitoring scale reflects this theory by attempting to tap both of these factors. O 'Cass found Cronbach alpha coefficients of .86 and .85 for the two subscales measuring self-monitoring ability and self-monitoring sensitivity respectively. For the entire scale, the reported Cronbach alpha was .87. This study also found convergent relationships between high scores on the self-monitoring scale and concern for personal image. OARs: OARs were derived from the average of two separate OARs specified by two independent senior assessors. To elaborate, according to Air Force policy, upon completion of the assessment exercises, two senior officers decided upon two independent OARs based on their judgement, the assessment ratings, and their observations during the entire assessment process. The OARs themselves were on a four point scale with the anchors A (strongly recommended), B (recommended), C (marginal), and D (not recommended). As these categories were intended, according to airforce officals, to graduate from high to low, they were treated numerically as A ( 4), B (3), C (2), and D ( 1 ) for the purposes of analysis. 106 Procedure As the theory suggested that the three constructs self-efficacy, tacit knowledge and self-monitoring in their combination may contribute to the effective performance in ACs, the design was set up so that measurements of the constructs were taken before the AC. Potential participants were invited to partake in the present research prior to their arrival at the AC, by sending questionnaires along with information packs that the RNZAF administered through the post. All questionnaires were coded, and were sent via the post directly back to the researcher. Ratings of individual 's performance on the AC were also collected once the AC had been completed. The codes for pre-measure constructs and AC measures were then matched for subsequent analysis. Note, this procedure was repeated in the same manner for the sample described below. Organisational Sample The organisational sample for Study One was a repeat of the study described above for the military sample. As such, the measures and procedure were identical across both samples: The participants, assessors, and key aspects of the AC in the organisational sample are described below. Participants For Study One, data were collected from an AC that was already in existence and was being used for recruitment and selection purposes by a large retail company in Bayfair, Tauranga, New Zealand. Data were collected from the AC, which ran for one week, beginning on the 1 4th of August and ending on the 2 1 st of August, 200 1 . AC 1 07 ratings were collected from 87 potential recruits. Demographic information on this sample is presented in Table 6. Table 6 Demographic Statistics, Candidates, Study One Organisational Sample N= 87 Frequency % Gender Male 2 1 24 Female 66 76 Ethnicity Caucasian 54 62 Asian 6 7 Maori 1 8 2 1 Other 3 3 Non Responders 6 7 Age 1 5-20 8 9 2 1 -25 14 1 6 26-30 5 6 3 1 -3 5 9 1 0 36-40 1 0 1 1 4 1 -45 7 8 46-50 1 2 1 4 5 1 -55 1 3 1 5 56-60 7 8 66-70 1 1 Non Responders 1 1 Education No formal education 1 3 1 5 School Certificate 23 26 Sixth Form Certificate 1 6 1 8 Bursary 2 2 Bachelor's Degree 5 6 Higher University Degrees 3 3 Other 25 30 108 Assessors Assessors included 1 7 managerial level staff members of the retail organisation from various parts ofNew Zealand. Only partial demographic information was avai lable from the assessor group due to non-response. Of the seven who responded, three were male, four were female and their mean age was 2 1 .86 (SD = 2.80). All were located in Auckland, New Zealand. According to information subsequently obtained from the organisation, the non-responding assessors were older and were more experienced than those who did respond to the demographic items. Also according to information obtained from the organisation, all assessing participants had previous experience in assessing ACs for the retail store under scrutiny. Only one of the assessors had previously received any training in psychology, having completed a Bachelors degree. All participants had over two years experience in their positions, and were regarded as subject matter experts of the position being assessed. The AC An external multi-national consulting company constructed the AC under scrutiny for the purposes of recruitment and selection. Rather than a bespoke approach, the consulting company who designed the AC selected 'off-the-shelf competencies that were deemed relevant, and assessed these through 'off-the-shelf exercises that were also deemed relevant for assessment. I 1 09 AC Dimensions Candidates were rated on the following 9 dimensions: Interpersonal Skills; Social Confidence/ Assertiveness; Problem Solving/Decision Making; Decisiveness; Results Focused/Perseverance; Customer Focus; Team Player; Sales; Mentoring. One other dimension called Numeracy was also assessed through an external paper and pencil test and a single dichotomous pass or fail rating. Two other dimensions, named Availability and Personal Presentation, were again dichotomous items, which probed whether the individual was available to perform the position, and whether their personal presentation was up to standard, respectively. These last three dimensions were not included in the present study as they did not form part of the psychological assessment process of the AC, and these factors did not utilise multitrait-multimethod assessment methodology (see Figure 1 ). Different dimensions were assessed across exercises as outlined in the exercise competency matrix in Figure 1 . Note that blackened areas in Figure I indicate where a dimension was not assessed. The following definitions were provided for the other dimensions in the AC: Problem Solving/Decision Making: solves difficult problems with effective solutions; asks good questions and probes for answers; looks beyond the obvious; able to consider information from a variety of sources; exercises good judgement when making decisions; comes up with new and innovative ideas; sees the long-term impact of decisions; has good sound judgement about which creative ideas and suggestions will work; brings creative ideas of others to the fore. Figure 1. Competency/Exercise Matrix for Study One, Organisational Sample. 1 1 0 Final Rating Decisiveness: makes timely business decisions based on assessment of facts, assumptions and implications; makes timely decisions, sometimes with incomplete information and under tight time pressure; most solutions turn out to be correct and accurate when judged over time; has a bias for action. Goal Orientation: can be counted on to reach goals successful ly; very bottom line orientated; pushes self and others to achieve results; pursues goals with energy and drive; seldom gives up without finishing, especially in the face of setbacks; is resourceful and tenacious in finding an alternative means to reach a goal. Interpersonal Skills: communicates well with al l kinds of people internally and externally; builds appropriate rapport; builds constructive and effective relationships; uses diplomacy and tact; practices active listening; has the patience to hear people out; is easy to approach and talk to; puts others at ease; genuinely cares about others; is avai lable I l l and ready to h?lp; acknowledges others' concerns; is co-operative; gains the trust and respect of tjeers; works with others, sharing tasks and accountabil ities. Social Confidence/Assertiveness: seeks out social situations and interacts confidently in group situations; can challenge others' views appropriately; comfortable sharing own perspective with managers and peers in a group situation; stands up for what he or she bel ieves in and holds own ground, even in the face of opposition. Team Player: invites input from each person and shares ownership and visibil ity; makes .?. each individual feel as though their work is important; is someone people like working with; creates strong morale and spirit in the team; shares successes; fosters open dialogue; creates a feel ing of belonging in the team; works co-operatively with others. Customer Service: is dedicated to meeting the expectations and requirements of internal and external customers; gets first hand customer information and uses it for improvements in products and services; talks and acts with customers in mind; establ ishes and maintains effective relationships with customers and gains their respect and trust. Sales: understands and can describe the steps in the sales process; understands the importance of sales; acts with the customer in mind at al times. Mentoring: first identified how much the subject knew; created a plan (not necessari ly written) to use to develop the person; used an appropriate approach(es); identified fol low- up action. OARs: AC overal l ratings constituted the average ratings across all of the dimensions assessed. The mechanical integration of ratings was used in congruence with the practice ofthe organisation whose AC was under study. Such methods of integration have been 1 12 deemed acceptable according to the latest international guidelines for ACs (International Task Force on Assessment Center Guidelines, 2000). Each dimension was rated on the following scale, ranging from 1 (The person does not have the competency), 2 (Individual does not have the competency level required), 3 (Individual does not quite have the competency required), 4 (Individual has the required competency level), to 5 (Level of competency is beyond that which the position requires). A C Exercises The following three exercises, designed by an external consulting company, were employed to assess the nine dimensions in the AC, along with their descriptions: Egg Simulation Exercise: This exercise comprised a low-fidel ity teamwork activity, where participants set about constructing a framework composed of certain stationery items (e .g., paper, paper-clips, a balloon, string). The object of the activity was for the group to construct a framework that would al low an egg to be dropped from a height of approximately two metres onto a hard surface, such that the egg did not break. Group Interview: This comprised an low-fidelity individual exercise, contextualised within a group setting. Each individual in a group was asked a series of questions to which they had to formulate an answer. Example questions included 'What is the most challenging thing you have done, and what did you learn from it? ' and 'What is the most rewarding experience you have had in a team? Why was this rewarding and what made it different from other team experiences?' 1 1 3 Lost At Sea Simulation Exercise: This comprised a low-fidelity teamwork exercise, where participants were requested to imagine that they were adrift in a private yacht, irreparably damaged by a fire of unknown origin. The group were told that certain items had remained intact, and that, as a group, they were to rank these items in terms of their overall importance to survival. Note that these were all low-fidelity simulation exercises, however this is typical of many ofthe AC simulations offered by consulting companies worldwide (Muchinsky, 2000). This is also in agreement with the international guidelines for AC development, as these guidel ines stipulate that the fidelity of simulation exercises may be relatively low if the centre is used for early identification and selection programs and for non? managerial personnel (International Task Force on Assessment Center Guidelines, 2000). The AC employed in Study Three also fitted both of these criteria. Results The data from two separate samples, one from the Royal New Zealand Air Force (RNZAF) section of the New Zealand mil itary, the second taken from a large New Zealand based departmental retain chain, were explored as outlined below. The following statistical considerations were appl ied in the analysis. Power analyses were conducted, followed by the calculation of relevant descriptive statistics, comprising means and standard deviations for each measure. Bivariate correlations between variables and internal consistencies for each measure were calculated. Multiple 1 1 4 regression analyses were conducted to investigate the extent to which the composite of the variables under study (self-efficacy, self- monitoring and tacit knowledge) explained meaningful variance in OARs. In the military sample, the amount of variance associated with the full composite theoretically explained approximately 1 6% of the variance in OARs in the population. When correcting for validity shrinkage for generalisation across samples, the full composite explained approximately 1 % of the variance in OARs. The strongest predictor in this composite was tacit knowledge, despite the fact that the tacit knowledge measure held low internal consistency in this sample. In the organisational sample, the results suggested that the composite measures explained very little variance in OARs (approximately 4% of population variance). Military Sample Practical problems occurred with the measurement of tacit knowledge in the military sample due to non-response on the tacit knowledge inventory. The sample was divided into two separate runs of the AC over two time periods. Initially, the participants were administered a version of the inventory that required the use of an expert sample from which deviation scores would be calculated. Unfortunately, the expert sample had such a high non-response rate, that the questionnaire had to be abandoned. On the second run of the AC, the sample was administered a version of the questionnaire that did not require the use of an expert sample. Thus, the analyses will be divided into two sets. Set One will show the entire sample with the measure of tacit knowledge removed (N1 = 1 00). Set Two will show the second run of the AC only, where the tacit knowledge inventory is included (N2 = 44). 1 1 5 Set One Set One utilised the full military sample (N = 1 00). The total pool of people who appl ied for the group of positions was 1 1 6, thus 1 00 was a high response rate at roughly 86%. Data were imputed for missing values using EM (expectation maximisation), which employs an iterative process by which to estimate missing values. This method was recommended by Gold and Bentler (2000) for optimal data substitution, regardless of sample size, proportion of missing data and distributional characteristics. Imputations were required for 1 6% of OARs, 1 4% of specific self? efficacy ratings, 0.% of general self-efficacy ratings and 26% of self-monitoring ratings. In general, the response rates for the questionnaires were reasonably high, except perhaps for the self-monitoring ratings. After imputing the OARs with EM (as stated above), the two sets of ratings provided by the senior assessors were found to correlate at r = .93 , p < .0 1 . This suggests that the assessors were generally in agreement with one another on their derivation of OARs. Note that all power analyses in this study were conducted using GPOWER version 2 .0 (Paul & Erdfelder, 1 992). An a priori power analysis was performed for multiple regression analyses for three predictors. This analysis revealed that the number of cases in this study was more than the 77 cases necessary for a 2-tailed test at the .05 level of significance, at a power level of .80 for medium effect sizes. The current analysis therefore achieved acceptable power, contingent on obtaining medium level effect sizes. Note that GPOWER converts the Cohen ( 1 988) measure of effect size (j) which is , by convention, set at 0. 1 5 for medium effect size, into an estimate of multiple R2 medium effect size (see Murphy & Myors, 1 998 for a summary on effect s ize conventions). This principle applies to all studies within this chapter. 1 1 6 Table 7 shows the means and standard deviations for the measures used in the study. Particularly with respect to the specific self-efficacy scale, (SSE) small standard deviations could represent range restriction problems, as a small amount of systematic variance i n scores can make it difficult for correlations to manifest. Table 8 shows the bi-variate correlations and internal consistencies for the measures used in the study. All of the internal consistency coefficients were within the limits suggested by Nunnally and Bernstein ( 1 994, p. 252). Significant correlations were found between GSE and SSE (r = .48,p < .0 1 ). This was the strongest relationship with respect to magnitude. Similarly, it was ag!!-in found that SSE was related to SM (r = .23, p < .05). The pattern of correlations and lack of significance between the OAR and the set of presumed predictors suggests no bi-variate relationship. As the strongest correlations were among the set of presumed predictors, the possibility of Table 7 Overall Means and Standard Deviations for Measures Employed in Set One of the Military Sample Scale M SD Overall Assessment Ratings (OAR) 2.32 0 .94 AC Specific Self-Efficacy (SSE) 5 . 84 0. 1 0 General Self-Efficacy (GSE) 6. 1 0 1 . 1 1 Self-Monitoring (SM) 4.46 1 .00 1 1 7 Table 8 Bivariate Correlations Between Measures Employed in Set One of the Military Sample Scale 1 2 3 4 1 . OAR 2. SSE . 1 0 ( .79) 3 . GSE -.07 .48** (.76) 4. SM . 1 7 .23* . 1 3 (.73) * p < .05; ** p < .01 (2-tailed) Cronbach's alpha is provided in parentheses. aWhile Cronbach's alpha is not a measure of inter-rater reliability, it can be used as an estimate of intra-rater reliability (Schmidt & Hunter, 1996). Information on the specific allocation of raters was not made available. The alpha provided for OAR reflects internal consistency with respect to two items that made up the overall rating. The reader is cautioned that only the correlation between GSE and SSE manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (atp < .05). confounding exists, and therefore multivariate analysis was employed. The bi-variate correlations may indicate a violation of multicollinearity assumptions, particularly with respect to the relationship between GSE and SSE. The reader is cautioned that only the correlation between GSE and SSE manifests as a statistically s ignificant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (at p < .05). For the reasons detailed earlier, taken together with the restrictive sample size of the present study, standard all-in regression was selected as the multivariate techn ique that would be most appropriate. The summary statistics in Table 9 display two i ndices of R2 adjusted for validity shrinkage (Rosenthal & Rosnow, 1 99 1 ). For a detailed account of these indices and their respective formulae, the reader i s directed to Bobko ( 1 990). The first i ndex, labelled 'adjusted R2', estimates what would happen if the sample, i n a given study, were to be the population in its entirety. The second 1 1 8 index is labelled ' shrunken R2 ' , and estimates how well a given model would predict in other future samples on average (i.e., in the population of samples) (Bobko, 1 990). The adjusted R2 suggested that if the sample were the population, the set of predictors theoretically accounted for 2. 1 % (ns) of the variance in the OARs (see Table 9) (Licht, 1 995). The shrunken R2 suggested that in other samples, the set of predictors would account for .02% (ns) of the variance in OARs. None of the predictors displayed significant partial relationships with the criterion. Note that a post-hoc power analysis revealed for an R2 = .05 1 , a sample size of 207 would be needed to achieve power of .80 with three predictors in a regression model. In the case of this study, power was equal to 0.45, and thus the probability of making a type II error was . 55 (Rosenthal & Rosnow, 1 99 1 ). The present study, therefore, stacked Table 9 Multiple Regression Analysis for the Prediction of OARs in Set One of the Military Sample Predictors SSE GSE SM Intercept Raw (B) 0.02 -0.02 0.03 1 . 1 7 Partial Regression Weights Standardised Beta . 14 -. 1 5 . 1 6 95% Confidence intervals -.02 < B < .07 -.05 < B < . 0 1 -.0 1 < B < .06 Summary: R = .225(ns), R2 = .05 1 , Adjusted R2 = .02 1 3, Shrunken R2 = .0002 1 1 9 the odds in favour of the null hypotheses, given the attenuated effect size and small sample size. A residual analysis revealed no clear threat to homoscedesticity assumptions. Evidence assuaging multicollinearity concerns was found with variance inflation factor (VIF) indices being less than 1 0 (maximum = 1 .35, minimum = 1 .05) (Chatterjee, Hadi & Price, 2000), and tolerance indices did not approach zero (minimum = 0.74, maximum = 0.95) (Tabachnick & Fidell, 1 983). Additionally, eigenvalues did not differ greatly (maximum = near zero, minimum = near zero) (Belsley, et al . , 1 980). The scores were not normally distributed, with a large cluster to the negative side and two clusters toward the centre of the distribution. This may reflect an overuse of central gradings. The residual plots also showed evidence of several outliers. Reconsideration of these distributional problems did not alter conclusions regarding the non-significant outcomes. Nonparametric significance tests yielded similar outcomes regarding the view of linear relations between all measures (see Table 1 0), and inspections of scatterplots revealed no reason to suspect curvilinear outcomes. Note that the bivariate correlations between the presumed predictors and OARs, corrected for attenuation due to unreliabilityb (Schmidt & Hunter, 1 996, p. 20 1 ) were as follows. SSE and OAR (r = . 1 1 , ns); GSE and OAR (r = -.08, ns); SM and OAR (r = .20, ns). The corrected correlations here were considered to be similar to those correlations reported in the uncorrected bivariate correlations, thus no further correctional analyses were conducted. Note that the corrected coefficients may have bBobko (200 I ) asserts that it is "customary to test the original, uncorrected Pearson r for statistical significance and then report corrected r as the best point estimate of the true relationship between the variables" (p. 82). This is because the t tests associated with Pearson's r assume that the sample-based r is computed. This principle is applied to all corrected correlations within this chapter. 1 20 Table 1 0 Spearman 's Rho Between Measures Employed in Set One of the Military Sample Scale 1 2 3 4 1 . OAR (.96) a 2. SSE .08 (.79) 3. GSE -.07 .48** (.76) 4. SM . 1 3 .22* . 1 7 (.73) * p < .05; * * p < .01 (2-tailed) -.. Cronbach's alpha is provided in parentheses. aWhile Cronbach's alpha is not a measure of inter-rater reliability, it can be used as an estimate of intra-rater reliability (Schmidt & Hunter, 1996). Information on the specific . allocation of raters was not made available. The alpha provided for OAR reflects internal consistency with respect to two items that made up the overall rating. The reader is cautioned that only the correlation between GSE and SSE manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (atp < .05). been higher than reported here, had a proper index of interrater reliability been available for OARs. Schmidt and Hunter (1 996, p. 209) describe Cronbach's alpha as an estimate of intrarater reliability in contexts such as these. Such measures tend to give higher estimates when compared to what might be expected from other indices of interrater reliability. Unfortunately, traditional intraclass correlation-based indices of interrater reliability (e.g. , Shroui & Fleiss, 1 979) could not be employed in this sample because specific information relating to the allocation of assessors was not provided by the organisation under study. The other samples in Study One were also afflicted with this potential limitation. 1 2 1 Set Two Set Two included the same data as above, but only selected those cases that integrated the TKIM (n = 44). Bearing in mind that the total number of people who applied for the group of positions in this sample was 46, a sample of 44 was thought to constitute a high response rate at roughly 96%. OARs considerations were the same in Set One as for Set Two. An a priori power analysis was performed for multiple regression analyses with four predictors. This analysis revealed that the number of cases was less than the 85 cases necessary for a 2-tailed test at the .05 level of significance? at a power level of .80 for medium effect sizes. The current analysis therefore stacked the odds in favour of the null hypothesis, contingent on obtaining medium level effect sizes. Table 1 1 shows the means and standard deviations for the measures used in the study. Restricted range may have been a problem, particularly for the OAR in this sample, which yielded a relatively small standard deviation. This may have restricted Table 1 1 Overall Means and Standard Deviations for Measures Employed in Set Two of the Military sample Scale M SD OAR 2.38 0. 1 0 SSE 5 .77 1 . 1 6 GSE 5 .63 1 .45 SM 4.48 0.94 Tacit Knowledge Inventory for Managers (TKIM) 3 .95 1 .73 1 22 opportunities for correlations to manifest. Table 1 2 shows the bi-variate correlations and internal consistencies for the measures. The internal consistency coefficients for SM were bordering on the lower end of those suggested by Nunnally and Bernstein ( 1994, p. 252) and the coefficient for the TKIM was well below the suggested limits. The measurement of tacit knowledge in this portion of the study therefore lacked internal consistency. The difference in terms of internal consistency and bi-variate correlation between the TKIM as measured in the organisational sample (discussed later) and the military sample was probably due to sampling error (at n = 44), however, it may be due to real differences between the samples. This is discussed further in the discussion section. Table 1 2 Bivariate Correlations Between Measures Employed in Set Two of the Military Sample Scale 1 1 . OAR (.95) a 2. SSE . 1 2 3 . GSE - .04 4. SM . 36* 5 . TKIM .36* * p < .05; ** p < .0 1 (2-tailed) 2 3 (. 80) .47** (.73) .24 -.07 -.0 1 .04 4 5 (.62) .09 (.4 1 ) Cronbach's alpha is provided in parentheses. aWhile Cronbach's alpha is not a measure of inter-rater reliabil ity, it can be used as an estimate of intra-rater reliability (Schmidt & Hunter, 1 996). Information on the specific allocation of raters was not made available. The alpha provided for OAR reflects internal consistency with respect to two items that made up the overall rating. The reader is cautioned that only the correlation between GSE and SSE manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (at p < .05). 1 23 The predictive validity coefficient relating to the TKIM, that is, the correlation between the presumed predictor TKIM and the DV OAR, is possible despite the low reported internal consistency. According to Bobko (200 1 ), a rule ofthumb concerning correlations and their relationship to internal consistency is that the predictive validity of a measure can be no greater than the square root of its reliability. The reported correlation between TKIM and OAR, at .36, is less than the square root of the internal consistency estimate ofthe TKIM's reliability (.64). All conclusions from here on must be drawn bearing in mind the implications of low reliability in the measure of tacit knowledge. These include that it is questionable that the measure was actually measuring a unitary concept, and it may be that its respective components were sufficiently unrelated to call their union into question, for this particular sample. Thus, the respective components of the TKIM did not appear to share sufficient dimensionality in this sample. Given this caution, !able 1 2 showed significant correlations between GSE and S SE (r = .47, p < .05), as across all runs of this study. Of greater interest was the finding that SM (r = .36, p < .05) and TKIM (r = .36, p < .05) both displayed significant correlations whh the DV, OAR. The reader is cautioned further that only the correlation between GSE and SSE manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (at p < .05). The strongest bi-variate relationship here was between two of the presumed predictors. On this occasion however, two of the presumed predictors displayed relationships with the DV. This would suggest, again, the need for multivariate analysis. For this reason, and with respect to the restrictive sample size, standard all? in regression was selected as the appropriate technique. The adjusted R2 suggested that the set of predictors in Table 1 3 theoretically accounted for 1 6% (p < .05) of the 124 Table 1 3 Multiple Regression Analysis for the Prediction of OARs in Set Two of the Military Sample Predictors SSE GSE SM TKIM Intercept Raw (B) 0.01 -0.01 0.06 -0.03 -5.48 ->? Partial Regression Weights Standardised Beta .01 -.07 .30 .33* 95% Confidence intervals -.04 < B < .07 - .06 < B < .04 .00 < B < . 1 3 .01 < B < .06 Summary: R = .489*, R2 = .239, Adjusted R2 = . 1 609, Shrunken R2 = .0 1 08 * p < .05 (2-tailed) variance in OARs in the population. This result was, with respect to overall magnitude, seemingly stronger than the effects found in the previous example. The TKIM was the strongest predictor in this regard, which was notably intriguing, given its lack of internal consistency. The shrunken R2 suggested that across different samples, the set of predictors would account for 1 .08% (p < .05) of the variance in OARs. Note that a post-hoc power analysis revealed for an effect size of R2 = 0.239, a sample size of 44 would be needed to achieve power of .80 with four predictors in a regression model. In the case of this study, power was equal to 0 .8 1 , and thus the probabi lity of making a type II error was 0. 1 9 (Rosenthal & Rosnow, 1 99 1 ). 1 25 A residual analysis revealed no salient threat to homoscedesticity assumptions. Evidence assuaging multicollinearity concerns was found with VIF indices being less than 1 0 (maximum = 1 .43, minimum = 1 .02) (Chatterjee, et al. , 2000), and tolerance indices did not approach zero (minimum = 0.70, maximum = 0.99) (Tabachnick & Fidell, 1 983). Additionally, eigenvalues did not differ greatly (maximum = near zero, minimum = near zero) (Belsley, et al. , 1 980). The distribution of scores did not fit a perfect normal curve with a large cluster of scores toward the positive end of the distribution. This may again reflect an overuse of mid to upper gradings. The residual plots als6 showed evidence of minimal outliers. Reconsideration of these distributional problems did not alter conclusions regarding the non-significant outcomes. Nonparametric significance tests yielded similar outcomes regarding the view of linear relations between all measures (see Table 1 4), and inspections of scatterplots revealed no reason to suspect curvilinear outcomes. Table 1 4 Spearman 's Rho Between Measures Employed in Set Two of the Military Sample Scale 1 1 . OAR (.95) a 2. SSE . 1 3 3 . GSE -.03 4. SM .25 5 . TKIM .34* * p < .05; ** p < . 0 1 (2-tailed) 2 3 (. 80) .5 1 ** (.73) .29 .03 - .08 .03 4 (.62) - .0 1 5 (.4 1 ) Cronbach's alpha is provided in parentheses. aWhile Cronbach's alpha is not a measure of inter-rater reliability, it can be used as an estimate of intra-rater reliability (Schmidt & Hunter, 1 996). Information on the specific allocation of raters was not made available. The alpha provided for OAR reflects internal consistency with respect to two items that made up the overall rating. The reader is cautioned that only the correlation between GSE and SSE manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (atp < .05). 1 26 The bivariate correlations between the presumed predictors and OARs, corrected for attenuation due to unreliability (Schmidt & Hunter, 1 996, p. 201 ) were as follows. SSE and OAR (r = . 1 4, ns); GSE and OAR (r = -.05, ns); SM and OAR (r == .47, p < .05); TKIM and OAR (r = .66, p < .05). The correlations between OARs and SM and between OARs and TKIM were stronger than correlations reported in the uncorrected bivariate correlations. This was because the corresponding measures, particularly the TKIM, were afflicted with low internal consistency. As previously mentioned, the corrected coefficients may have been higher than reported here, had a proper index of interrater reliability been available for OARs (Schmidt and Hunter, 1 996, p. 209}. Supplementary Analysis for Set Two of the Military Sample Given the enlarged bivariate correlations observed when correcting for unreliability in Set Two, it was decided that as a supplementary analysis, problematic items would be removed from the TKIM in order to improve its internal consistency, whilst preserving its construct domain coverage. Murphy and Davidshofer (200 1 ) suggest that test items should be representative of the domain of attributes being measured. As mentioned in the method section, the TKIM covers tacit knowledge relating to managing career, self, other people, and 'other' discriminating items. Item analyses revealed several negative item-total correlations in the data for Set Two. Negative item-total correlations indicate divergence between particular items and test scores or, of course, the possibility of encoding errors or the possible need for reverse coding, et cetera (Murphy & Davidshofer, 200 1 ). Items with negative or low item? total correlations were removed selectively to maintain the theoretical framework of the TKIM to the greatest degree possible. In the original scale, fifteen items related 127 to managing career in the TKIM, sixteen related to managing self, four items related to managing other people and four items were classified as 'other ' . In order to assist in maintaining construct domain coverage, items were not removed from the managing other people and the 'other' scales. From the managing career scale, four items with negative item-total correlations were removed (most divergent item-total correlation = -.42, least divergent item-total correlation = -. 1 1 ). From the managing self scale, nine items with negative or low item-total correlations were removed (most divergent item-total correlation = -.2 1 , least divergent item-total correlation = .08). Of the thirteen items removed in this military sample, nine of the TKIM items also ??? loaded negatively in the organisational sample (described later) in Study One. The analyses for Set Two of the military sample in Study One were repeated with the altered version of the TKIM. The grand mean for the revised TKIM was 3 .82 (SD = 1 .7 1 ). This lack of variation may have lead to problems related to range restriction, and thus, may have restricted the extent to which correlations manifested in this sample. Table 1 5 shows the bi-variate correlations and internal consistencies for the measures ?mployed. Note that the correlations are identical to those displayed in Table 1 2, except that, most notably, the relationship between TKIM and OAR increased from .36 (p < .05) to .42 (p < . 0 1 ) in this supplementary analysis. This correlation was between a presumed predictor and the DV, OAR. Correlations were observed between the set of presumed predictors, and thus, the possibility of confounding existed. The reader is cautioned, as with the initial analysis of Set One, that the bi-variate correlations may indicate a violation of multicollinearity assumptions, particularly with respect to the relationship between GSE and SSE. The reader is cautioned that the correlations between GSE and SSE, and between TKIM and OAR manifest as statistically significant outcomes 128 Table 1 5 Bivariate Correlations Between Measures Employed in Supplementary Set Two of the Military Sample Scale 1 1 . OAR (.95) a 2. SSE . 1 2 3 . GSE -.04 4. SM .36* -.?? 5 . TKIM .42** * p < .05; * * p < . 0 1 (2-tailed) 2 3 (.80) .47** (.73) .24 -.07 .08 . 1 7 4 5 (.62) .02 (.73) Cronbach' s alpha is provided in parentheses. aWhile Cronbach' s alpha is not a measure of inter-rater reliability, it can be used as an estimate of intra-rater reliability (Schmidt & Hunter, 1 996). Information on the specific allocation of raters was not made available. The alpha provided for OAR reflects internal consistency with respect to two items that made up the overall rating. The reader is cautioned that only the correlations between GSE and SSE and between TKIM and OAR manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (atp < .05). after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (at p < .05). Note that inter-item consistency was identical to those displayed in Table 1 2, except that the revised TKIM scale yielded a Cronbach's alpha of .73, which was within the limits suggested by Nunnally and Bemstein ( 1 994, p. 252). For the same reasons outlined in Set One, Study One, standard all-in regression was selected as the multivariate technique that would be most appropriate. The summary statistics in Table 1 6 display two indices of R2 adjusted for validity shrinkage (see Set One, Study One for a brief description). The adjusted R2 suggested that the set of predictors in Table 1 6 theoretically accounted for 23% (p < .0 1 ) of the 1 29 Table 1 6 Multiple Regression Analysis for the Prediction of OARs in Supplementary Set Two of the Military Sample Partial Regression Weights Predictors Raw (B) Standardised Beta 95% Confidence intervals SSE 0 .01 .06 -.04 < B < .06 GSE -0.02 -. 1 2 -.07 < B < .03 SM 0.07 .32* .01 < B < . 1 3 ?(;? TKIM 0.03 .42** .0 1 < B < .05 Intercept -8.52 Summary: R = . 550* * , R2 = .302, Adjusted R2 = .2304, Shrunken R2 = .0284 * p < .05 (2-tailed) ** p < .01 (2-tailed) variance in OARs in the population. This result was, with respect to overall magnitude, seemingly stronger than the effects found in the Set One. Thus, it is likely that the lack of reliability in the TKIM attenuated potential relationships with OARs. The TKIM was the strongest predictor in this model, coupled with SM, which also reached significance as a single predictor. The shrunken R2 suggested that in different samples, the set of predictors would account for 2. 84% (p < .0 1 ) of the variance in OARs. Note that a post-hoc power analysis r?vealed for an effect size of R2 = 0.302, a sample size of 33 would be needed to achieve power of .80 with four predictors in a regression model. In the case of this study, power was equal to 0.93, and thus the probability of making a type II error was 0.07 (Rosenthal & Rosnow, 1 99 1 ). 1 30 A residual analysis revealed no obvious threat to homoscedesticity assumptions. Evidence assuaging multicollinearity concerns was found with VIF indices being less than 1 0 (maximum = 1 .42, minimum = 1 .03) (Chatterjee, et al.,2000), and tolerance indices did not approach zero (minimum = 0. 70, maximum = 0.97) (Tabachnick & Fidell, 1 983). Additionally, eigenvalues did not differ greatly (maximum = .02, minimum = near zero) (Belsley, et al ., 1 980). The distribution of scores did not fit a perfect normal curve with a large cluster of scores toward the positive end of the distribution. This may again reflect an overuse of mid to upper gradings. The residual plots als@? showed evidence of minimal outliers. Reconsideration of these distributional problems did not alter conclusions regarding the non-significant outcomes. Nonparametric significance tests yielded similar outcomes regarding the view of linear relations between all measures (see Table 1 7), Table 1 7 Spearman 's Rho Between Measures Employed in Supplementary Set Two of the Military Sample Scale 2 3 4 5 1 . OAR (.95) a 2. SSE . 1 3 (.80) 3. GSE -.03 .5 1 * * (.73) 4. SM .25 .29 .03 (.62) 5 . TKIM .4 1 * * -.00 . 1 6 - .04 (.73) * p < .05; * * p < .0 1 (2-tailed) Cronbach's alpha is provided in parentheses. aWhile Cronbach's alpha is not a measure of inter-rater reliabil ity, it can be used as an estimate of intra-rater reliability (Schmidt & Hunter, 1 996). Information on the specific allocation of raters was not made available. The alpha provided for OAR reflects internal consistency with respect to two items that made up the overall rating. The reader is cautioned that only the correlations between GSE and SSE and between TKIM and OAR manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (atp < .05). 1 3 1 and inspections of scatterplots revealed no reason to suspect curvilinear outcomes. The considerations for correcting bivariate correlations for attenuation due to unreliability (Schmidt & Hunter, 1 996, p. 20 1 ) between the presumed predictors and OARs were the same as in Set One. The exception to this was the revised version of the TKIM, which yielded the following corrected relationship; TKIM and OAR (r = . 50,p < .0 1 ) . Organisational Sample The data for the 87 respondents in Study One were imputed for missing data using EM. Data were imputed for 1 7% of assessment ratings. This reflected that of the OARs, 1 7% were not completed by the assessor group. Of the remaining measures, missing data were evident for 1% of self-monitoring ratings, 5% of tacit knowledge ratings and 0% for both specific self-efficacy, and general self-efficacy ratings. Response rates from the individuals who participated in the AC were relatively low, and the human resource department of the department store under study reported that 429 individuals participated in the assessment process. Eighty-seven individuals, however, opted to participate in the present study, comprising a fairly low percentage of participation at approximately 20%. This may have been influenced to some degree by the length of the questionnaires in the study, particularly the TKIM. Caution must therefore be exercised with respect to non-response bias considerations in the present study. An a priori power analysis was performed for multiple regression analyses with four predictors. This analysis revealed that the number of cases in this study was near the 85 cases necessary for a 2-tailed test at the .05 level of significance, at a power level of .80 for medium effect sizes. The current analysis therefore achieved acceptable power, contingent on obtaining medium level effect sizes. 1 32 To investigate the possibility that those who did respond were a self-selected sample, and were not typical of the group as a whole, a z statistic was calculated to determine whether there was any difference between OARs of the sample individuals who participc;?.ted in Study One, and those of the entire population from which the sample was drawn. The aggregated mean scores were provided by the company under study to eliminate any issues associated with anonymity. A 2-tailed z test failed to reject the null hypothesis that t!le sample and population means were equivalent z(87, 429) = .07, ns. This provides some evidence, with respect to OARs, that non? response bias was not an issue. However, there was no possible control, in this regard, for the measures of self-efficacy, self monitoring and tacit knowledge that were assessed in Study One, which may have been afflicted by non-response problems. Table 1 8 shows the means and standard deviations for the measures used in the study. Particularly with respect to the OARs, small standard deviations could represent range restriction problems, as a small amount of variance in scores may not allow much opportunity for correlations to manifest. Table 1 9 shows the bi-variate correlations and internal consistencies for the same measures. All of the internal consistency coefficients were within the limits suggested by Nunnally and Bemstein ( 1 994, p. 252). Significant bivariate correlations were found within the variables. In particular, positive correlations were found between general self-efficacy and specific self-efficacy (r = .53, p < .0 1 ) . This relationship was the strongest with respect to overall magnitude, which could easily be expected of two measures of 1 3 3 Table 1 8 Overall Means and Standard Deviations for Measures Employed in the Organisational Sample Scale M SD OAR 3 .08 0.45 SSE 6.20 1 .04 GSE 6. 1 8 1 .34 SM 4.92 1 . 1 6 TKIM -?:?? 4.06 1 .96 Table 1 9 Bivariate Correlations Between Measures Employed in the Organisational Sample Scale 1 1 . OAR (.79) a 2 . SSE - .02 3 . GSE . 1 5 4 . SM -. 1 5 5 . TKIM .05 * p < .05; * * p < .0 1 (2-tailed) 2 3 (. 84) .53* * (.7 1 ) .24* . 39** -.09 .05 4 5 (. 8 1 ) -. 1 1 (.72) Cronbach's alpha is provided in parentheses. aWhile Cronbach's alpha is not a measure of inter-rater reliability, it can be used as an estimate of intra-rater reliability (Schmidt & Hunter, 1 996). Information on the specific allocation of raters was not made available. The alpha provided for OAR reflects internal consistency with respect to items that made up the overall rating. The reader is cautioned that only the correlations between GSE and SSE, and between SM and GSE manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (atp < .05). 1 34 self-efficacy. Self-monitoring correlated with specific self-efficacy (r = .24, p < .05) and general self-efficacy (r = .39, p < . 0 1 ). The reader is cautioned that only the correlations between GSE and S SE, and between SM and GSE manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (at p < .05). None of the predictors approached the conventional limits of significance when viewing the correlations between the set of presumed predictors and the DV, "f? OAR. The overall pattern of the presumed predictors however, suggests the possibility of confounding, and therefore the need for multivariate analysis. The current research question aimed to investigate the relationship (if any) between the set of presumed predictors and OARs. The directionality of this relationship was ostensibly controlled in a temporal manner, by having participants complete questionnaires prior to the assessment process. The theory and research question did not suggest nor assume any causal relations outside of this temporal ordering. Neither did the study aim to consider subsets ofvariables separately. Given this, and the restrictive sample size of the study, standard all-in regression was selected as the most appropriate method by which to investigate these relationships. The adjusted R2 in Table 20 indicated that the predictors in the model above theoretically accounted for 4% (ns) of the variance in OARs in the population. The shrunken R2 suggested that in different samples, the set of predictors would account for .04% (ns) of the variance in OARs. Note that a post-hoc power analysis revealed for an R2 = 0.085, a sample size of 1 34 would be needed to achieve power of .80 with 1 3 5 Table 20 Multiple Regression Analysis for the Prediction of OARs in the Organisational Sample Predictors SSE GSE SM TKIM Intercept Raw (B) -0.0 1 0.02 -0.0 1 -0.00 3 .06 Partial Regression Weights Standardised Beta - . 1 3 .3 1 * - .24* -.02 95% Confidence intervals - .04 < B < .01 .00 < B < .03 -.03 < B < -.00 -.0 1 < B < .0 1 Summary: R = .29 1 (ns), R2 = .085, Adjusted R2 = .0404, Shrunken R2 = .0004 * p < .05 (2-tailed) four predictors in a regression model . In the case of this study, power was equal to 0 .58 , and thus the probability of making a type II error was 0.42 (Rosenthal & Rosnow, 1 99 1 ). The effect found in this study was non-significant, although this study was afflicted with low statistical power associated with small sample sizes. The standardised partial regression weights suggest that GSE was the strongest predictor when applying this combination of measures, followed by SM, contrary to expectations, in a negative direction (although the sign of beta coefficients, of course, can be influenced by the model er's choice of predictor combinations). None of the other predictors in the model displayed significant relationships with the criterion. A residual analysis revealed no clear threat to homoscedesticity assumptions. Evidence assuaging multicollinerarity concerns was found with VIF indices being less 136 than 1 0 (maximum = 1 .58, minimum = 1 .04) (Chatterjee, et al., 2000), and tolerance indices did not approach zero (minimum = 0 .64, maximum = 0.96) (Tabachnick & Fidell, 1 983). Additionally, eigenvalues did not differ substantially (maximum = 0.0 1 , minimum = near zero) (Belsley, et al., 1 980) . The scores were not normally distributed, with residual plots displaying a large clustering of scores toward the centre of the distribution. This may have reflected an overuse of central gradings. However, the significance tests used in multiple regression analyses are reasonably robust against violations of the normality assumption (Bobko, 200 1 ). Reconsideration of these distributional problems did not alter conclusions regarding the non-significant outcomes. Nonparametric significance tests yielded similar outcomes regarding the view of linear relations between all measures (see Table 2 1 ), and inspections of Table 2 1 Spearman 's Rho Between Measures Employed in the Organisational Sample Scale 1 . OAR 2. SSE 3 . GSE 4. SM 5 . TKIM * p < .05; ** p < .01 (2-tailed) 1 (. 79) a .02 .23* - .08 .08 2 3 (.84) .47** (.7 1 ) .27* .28** -. 1 1 .06 4 (.8 1 ) - .05 5 (.72) Cronbach's alpha is provided in parentheses. aWhile Cronbach's alpha is not a measure of inter-rater reliability, it can be used as an estimate of intra-rater reliability (Schmidt & Hunter, 1 996). Information on the specific allocation of raters was not made available. The alpha provided for OAR reflects internal consistency with respect to items that made up the overall rating. The reader is cautioned that only the correlation between GSE and SSE manifests as a statistically significant outcome after the appropriate Bonferroni adjustments are applied, so as to maintain study-wise type I error risk (at p < .05). 1 37 scatterplots revealed no reason to suspect curvilinear outcomes. The bivariate correlations between the presumed predictors and OARs, corrected for attenuation due to unreliability (Schmidt & Hunter, 1 996, p. 201 ) were as follows. SSE and OAR (r = -.02, ns); GSE and OAR (r = .02, ns); SM and OAR (r = -. 1 9, ns); TKIM and OAR (r = .07, ns). These were considered to be similar to those correlations reported in the uncorrected bivariate correlations, thus no further correctional analyses were conducted. As previously mentioned, the corrected coefficients may have been higher than reported here, had a proper index of interrater reliability been available for OARs (Schmidt & Hunter, 1 996, p. 209). Discussion Study One sought to investigate the extent to which a specific set of traits that were not formally assessed in the AC context explained meaningful variance in OARs. The set of constructs included domain specific (SSE) and general self-efficacy (GSE), self-monitoring (SM) and managerial tacit-knowledge (TKIM), as suggested by Klimoski and Brickner ( 1 987), Chan ( 1 996) and Arthur et al. (2000). Overall, the results of Study One showed differential outcomes across the two samples employed in the study. In Set Two of the military sample, the full combination of these constructs explained slightly more variance in OARs in the population than in Set One of the military sample and the organisational sample. However, the only significant contributor in the model was tacit-knowledge. Furthermore, its manifesting measure in this analysis, the TKIM, exhibited poor internal consistency in this sample. Also, when correcting for validity shrinkage when generalising across samples, the combination of the constructs measured in Set Two explained very little variance in oARs. In the organisational sample, the combination of these constructs explained very little variance in OARs. Military Sample 1 3 8 Due to practical difficulties with respect to the TKIM, the military sample needed to be split into two sub-sets. Set One examined the full sample of 1 00 participants, and the relationship between the predictors SSE, GSE and SM with the criterion, OARs. At the bivariate level, no correlations of any magnitude or significance were seen between the 0.6 -0 Q) 0.4 "C ? .a ?c: Cl 0.2 <1l ? 0.0 p X i:x px pi:x,e Effect Figure 2. Variance Components and Confidence Intervals for Each Effect and Interaction in the Task-Specific AC. 1 .6 1 .4 Dimension-Specific AC c Q) c 1 .2 0 a. E 1 .0 0 u Q) u 0.8 c <1l I ?;:: <1l > 0.6 -0 Q) "C 0.4 I :E c Cl 0.2 I <1l ? 0.0 ? ? p X d px pd xd pxd,e Effect Figure 3. Variance Components and Confidence Intervals for Each Effect and Interaction in the Dimension-Specific AC. 1 90 compared with one another). An absolute decision is one in which a certain cut-off criterion is employed (e.g., a pass or fai l criterion for employment decisions). G theory provides two coefficients for the purposes of relative and absolute decisions that are analogous to reliability coefficients in classical test theory. Tables 35 and 36 provide the equations and calculations, for both types of AC, of cri.1 (relative error; all of the effects in the G study that contribute variance to relative decisions), cr?s (absolute error; all of the effects in the G study that contribute variance to absolute Table 35 Relative and Absolute Error, Generalizability and Phi Coefficients and Interrater Reliability for the Balanced Task-Specific AC Index 2 2 2 cr px cr pi:x,e (jRel = - + --nx n i:xnx cr 2 .!. - p "' ( 2 2 ) cr P + crAbs ICC( l l ) = BMS - WMS ' BMS+(k - l)WMS Result 0.20 0.22 0.72 0 .70 0 .93 \ 1 9 1 Table 36 Relative and Absolute Error, Generalizability and Phi Coefficients and Interrater Reliability for the Dimension-Specific AC cr2 .!. - p Index '!' - ( 2 2 ) (J p + 0' Abs ICC(1 1) = BMS - WMS ' BMS+(k - 1)WMS Result 0 .24 0.27 0 .7 1 0.68 0.82 decisions), Epie t (the Generalizability or G coefficient; for relative decisions), and ? (Phi coefficient; for absolute decisions). Tables 3 5 and 36 also provide equation ICC 1 , 1 from Shrout and Fleiss ( 1 979) as an estimate of interrater reliability across the two types of AC. Table 37 presents the variance component estimates for the full, unbalanced task-specific approach with item three of exercise one included in the analysis. As expected, the pattern of findings were similar to those in Table 34 for the task-specific approach . Thus, it appears unlikely that item three of exercise one contributed much Table 37 Generalizability Study Showing the Results of the Unbalanced Task-Specific ACfor the Organisational Sample 1 92 Effect df VC 90% Confidence Intervals Explained Variance (%) p(persons) 1 86 0.5 1 073 * 26.8 x( exercises) 2 0.05 125 2.7 '? i(items):x 37 0. 1 1 758 6.2 px 372 0.58488 30.7 pi:x, e 6882 0.63804 33 .5 * Confidence intervals were not provided for the task-specific procedure in this case, because the specification of a confidence interval for an unbalanced design is inappropriate (Brennan, 200 1 b). variance to the scores, and could safely be regarded as a redundant source of variation. Factor Analysis In the tradition of several other studies on AC ratings, including the seminal paper by Sackett and Dreher ( 1 982), a factor analysis was employed to evaluate the measurement models presented in the task-specific and the dimension-specific ACs (i .e. , to provide what might be viewed as a more traditional perspective on the same data). SPSS ver. 1 1 was employed to produce communalities and factor loadings for both types of AC. The same raw data as in the G study was used as input for the 1 93 factor analysis. Note that the full unbalanced data set was used for the task-specific AC. Principle axis factoring was employed as the method of extraction. In principle axis factoring, communality estimates are derived through an iterative procedure, using squared multiple correlations of each variable with all other variables as the starting point. The goal of principle axis factoring is to extract maximum orthogonal variance from the data with the extraction of each successive factor (Tabachnick & Fidell, 1 983). Principle axis factoring is widely employed, and was also used in Sackett and Dreher's original study. The present study used .40 as a criterion for an admissible factor loading, in congruence with Comrey and Lee ( 1 992), who suggest that factor loadings of .45 upwards are fair indicators of the overlap between a variable and a factor. All factor loadings are displayed in the analyses, however, as different researchers set different criteria for acceptable factor loadings (Tabachnick & Fidell, 1 983). Varimax rotation was employed to encourage simple structure in the ratings, to assist for comparison purposes with seminal pieces on this topic (e.g., Sackett and Dreher, 1 982) and for ease of interpretation. As practitioners ideally strive for simple factor structure as the basis for AC ratings, varimax was seen as most appropriate. Direct oblimin was also employed as a rotational method across both the task-specific and the dimension-specific models to allow for correlation among factors, and for comparison purposes. Varimax Rotation Three factors were extracted for the task-specific AC. This is because under the behavioural task-specific paradigm, each of the three exercises in the AC was viewed as a stand-alone work sample of behaviour. Table 38 shows the results for the 1 94 Table 38 Rotated Factor Matrix for the Task-Specific AC Ratings Factor Loadings Exercise Item 2 3 Communality M SD Approach I . 24 . 0 1 .80 . 7 2 4.29 1 . 1 7 Approach 2 . 20 . 1 8 .71 . 5 8 4. 1 4 1 .3 5 Approach 3 . 09 . 1 3 .70 . 5 2 3 . 3 6 1 .46 Approach 4 . 1 3 . 1 7 .76 .63 3 . 3 6 1 .3 7 Approach 5 . 1 9 . 1 4 .71 . 5 7 4.28 1 .32 Approach 6 . 1 2 .20 .58 . 3 9 4. 1 8 1 . 1 2 Approach 7 . 1 8 . 1 0 .68 . 5 0 4 . 2 7 1 .24 Approach 8 .24 . 1 0 .79 .69 3 . 7 1 1 .3 9 Approach 9 . 1 4 . 1 5 .78 .65 4.22 1 .26 Approach 1 0 , 1 4 . 1 2 .74 . 5 7 3 . 5 8 1 .44 Approach 1 1 ?3 2 . 1 3 .74 . 6 7 4.07 1 . 34 Approach 1 2 .29 . 1 4 .77 . 69 3 . 89 1 .3 1 Approach 1 3 . 1 2 . 1 1 .71 .53 4.43 1 . 1 9 Approach 1 4 .09 . 0 5 .64 . 42 4 . 5 6 1 .25 Closing 1 5 .28 .75 .2 1 . 69 3 . 89 1 .30 Closing 1 6 .3 1 .78 .25 . 7 6 3 . 9 1 1 . 3 7 Closing 1 7 . 1 8 .79 .26 .72 4 . 0 1 1 .3 8 Closing 1 8 . 3 3 .75 . 1 9 . 7 0 3 . 7 8 1 .3 5 Closing 1 9 . 1 6 .79 . 1 7 . 6 8 4 . 2 1 1 .29 Closing 20 .24 .80 . 1 7 . 7 2 3 . 72 1 .3 6 Closing 2 1 .22 .75 . 1 3 . 6 3 4.20 1 . 3 1 Closing 22 .22 .78 . 1 9 .68 3 .90 1 .32 Closing 23 . 1 8 .71 . 1 0 . 5 4 4.39 1 .27 Closing 24 . 1 0 .78 .06 . 62 4 . 3 8 1 . 1 6 Closing 25 . 1 7 .68 .03 .49 4.33 1 . 08 Closing 26 . 0 7 .71 . 0 8 . 52 4.67 1 .09 Closing 27 . 1 0 .83 .08 . 7 0 4.39 1 . 1 7 Returns 28 .85 .23 . 1 7 . 80 3 . 50 1 .3 5 Returns 29 .87 . 1 6 . 1 1 . 80 3 .3 5 1 .34 Returns 3 0 .81 . 1 5 . 1 3 .69 3 . 5 1 1 .40 Returns 3 1 .82 .22 .22 . 7 7 3 . 75 1 . 3 5 Returns 3 2 .79 .20 . 1 8 . 7 0 3 . 65 1 .40 Returns 3 3 .84 . 1 8 .23 . 8 0 3 .42 1 .3 6 Returns 3 4 .79 . 1 9 .23 . 72 3 .02 1 .3 9 Returns 3 5 .72 .26 .20 . 63 3 .90 1 .33 Returns 3 6 .77 .25 . 1 8 . 69 3 . 3 5 1 .48 Returns 3 7 .79 .23 .24 .74 3 . 82 1 .43 Returns 3 8 .78 .24 .25 . 73 3 . 62 1 .40 Returns ? 3 9 .66 . 1 9 .23 .52 4. 1 5 1 .28 Returns 40 .67 . 1 2 . 24 . 5 3 4.43 1 .3 0 Eigenvalue 9.09 8 . 3 3 8.26 % of variance explained 22.73 20.84 20.64 1 95 task-specific AC. In congruence with the results of the G study, clear loadings on exercises were found for the task-specific AC. Table 39 shows the results for the dimension-specific AC. Five factors were extracted for the dimension-specific AC, because five dimensions were included in the assessment. Relatively clear factor loadings on exercises, that is, three exercise factors were evident, in congruence with the G study perspective on these same data. Table 39 Rotated Factor Matrix for the Dimension-Specific AC Ratings Factor Loadings Dimension Exercise 2 3 4 5 Communality M SD Comprehension Approach . 2 8 . 1 5 .63 .58 . 06 . 83 4.20 1 . 1 5 Oral Expression Approach . 3 2 . 1 7 .72 .27 -.03 .73 3 .95 1 .32 Tolerance Approach . 1 0 . 1 8 .80 . 1 1 . 1 0 . 70 4.24 1 .2 1 Teamwork Approach . 1 9 .2 1 .90 -. 1 7 . 00 . 9 1 3 . 80 1 .30 Customer Focus Approach .20 .20 .77 . 03 -.07 .66 3 . 9 1 1 .3 5 Comprehension Closing . 1 4 .75 . 1 0 . 1 8 . 3 9 . 7 8 4 . 3 8 1 . 1 3 Oral Expression Closing . 2 1 .78 . 1 7 . 0 1 . 0 5 . 7 0 4.07 1 .30 Tolerance Closing . 1 7 .79 . 1 5 . 0 1 . 0 1 .68 4. 1 8 1 .25 Teamwork Closing .29 .86 . 1 8 . 04 -.20 .89 3 . 8 8 1 .34 Customer Focus Closing .23 .81 .22 .00 -.05 .75 3 . 89 1 .32 Comprehension Returns .77 .20 .20 . 1 0 -. 1 1 . 7 0 3 . 80 1 .3 1 Oral Expression Returns .81 .24 .20 .07 -.08 . 76 3 . 65 ! .43 Tolerance Returns .78 . 1 7 . 1 8 .02 .20 . 7 1 3 . 8 8 1 .36 Teamwork Returns .89 .20 .2 1 .03 .06 . 8 7 3 . 4 1 1 .38 Customer Focus Returns .84 .22 . 1 5 .03 - . 02 . 7 8 3.40 1 .37 Eigenvalue 3 . 8 5 3 . 56 3 .27 . 5 0 .28 % of variance explained 2 5 . 69 2 3 . 7 6 2 1 . 8 1 3 .3 6 1 . 8 6 1 96 Direct Oblimin Rotation In order to determine the appropriate delta parameter for direct oblimin rotation, PsWin ver. 2.0. 1 was utilised (Barrett, 1 996). PsWin has the capabil ity to assess the delta parameter that wil l maximise simple structure in a direct oblimin rotation. In both the task-specific and the dimension-specific data, a delta value of zero was assessed as optimal. For the task-specific AC, three factors were extracted . For the dimension specific AC, five factors were extracted. Table 40 shows the direct oblimin results for the task-specific AC. Table 4 1 shows the direct oblimin results for the dimension-specific AC. Fiye factors were extracted for the dimension-specific AC as five dimensions were included in the assessment. Relatively clear factor loadings on exercises were obtained for the task-specific approach, and although less clear than the varimax rotation, factor loadings were tending to load onto exercises in the dimension-specific approach also. The exceptions to clear exercise effects are summarized in the fol lowing. Oral expression tended to bleed across factors four and five in the approach exercise, and comprehension tended to remain in factor four for this exercise. Comprehension bled across factors one and two in the closing exercise. 1 97 Table 40 Rotated Pattern Matrix for the Task-Specific AC Ratings Factor Loadings Exercise Item 2 3 Communality M SD Approach 1 .82 .08 -. 1 0 .72 4 .29 1 . 1 / Approach 2 .72 -.06 - . 04 . 5 8 4. 1 4 1 .35 Approach 3 .75 -.03 . 0 8 . 5 2 3 .36 1 .4l Approach 4 .80 -.06 . 0 5 .63 3 .36 1 .3/ Approach 5 .73 -. 0 1 - .03 . 5 7 4.28 u; Approach 6 .59 -. 1 2 . 04 . 3 9 4 . 1 8 1 . 1 :; Approach 7 .70 -.0 1 -.03 . 50 4.27 1 .24 Approach 8 .81 -.05 -.08 . 69 3 . 7 1 1 .3S Approach 9 .81 - . 03 . 04 .65 4.22 1 .2? Approach 1 0 ? .77 - . 00 .02 .57 3 .58 1 .44 Approach 1 1 .73 . 02 -. 1 7 .67 4 . 07 1 .34 Approach 1 2 .77 .02 -. 1 3 . 69 3 . 89 1 .3 1 Approach 1 3 .75 -.00 . 04 . 53 4.43 1 . 1 9 Approach 1 4 .69 . 05 . 0 5 .42 4.56 1 .25 Closing 1 5 .07 -.74 -. 1 2 . 69 3 . 89 1 .30 Closing 1 6 . 1 0 -.76 - . 1 3 . 76 3 . 9 1 1 .37 Closing 1 7 . 1 4 - .80 .03 .72 4 . 0 1 1 .3 8 Closing 1 8 . 03 - .73 - . 1 8 . 70 3 . 7 8 1 .35 Closing 1 9 .05 -.82 .03 .68 4.2 1 1 .29 Closing 20 . 03 - .81 - .06 .72 3 . 72 1 .36 Closing 2 1 - . 00 -.77 -.06 .63 4.20 1 .3 1 Closing 22 . 1 0 -.79 - . 04 . 68 3 . 90 1 .3 2 Closing 2 3 - . 03 -.74 -.02 . 5 4 4.39 1 .27 Closing 2 4 -.05 -.83 . 0 7 . 62 4 . 3 8 1 . 1 6 Closing 2 5 -.09 -.71 -.04 .49 4.33 1 .08 Closing 2 6 -.02 -.76 .09 . 52 4.67 1 .09 Closing 2 7 - . 04 -.88 . 0 8 . 70 4.39 1 . 1 7 Returns 2 8 -.05 -.03 - .90 . 80 3 . 50 1 .35 Returns 2 9 -. 1 1 . 03 -.95 . 80 3 . 3 5 1 .34 Returns 3 0 -.07 . 03 -.87 .69 3 . 5 1 1 .40 Returns 3 1 . 02 - . 02 -.85 . 7 7 3 .75 1 .35 Returns 3 2 -. 0 1 -.02 -.83 . 70 3 . 65 1 .40 Returns 3 3 . 03 . 02 - .89 . 80 3 .42 1 .3 6 Returns 3 4 . 04 -.00 -.83 . 72 3 .02 1 .39 Returns 3 5 .02 -. I 0 -.73 . 63 3. 90 1 .33 Returns 3 6 -. 0 1 - . 0 8 -.80 .69 3 . 3 5 1 .48 Returns 3 7 . 05 -.04 -.82 .74 3 . 82 1 .43 Returns 3 8 . 0 6 - .05 -.80 .73 3 .62 1 .40 Returns 3 9 . 07 - . 03 -.67 . 5 2 4. 1 5 1 .28 Returns 4 0 . 1 0 . 0 7 -.71 . 5 3 4.43 1 .3 0 Eigenvalue 1 1 . 3 8 1 1 .44 1 3 . 02 % ofvariance explained 4 1 .3 4 1 2. 5 0 1 0. 3 5 1 9 8 Table 4 1 Rotated Pattern Matrix for the Dimension-Specific AC Ratings Factor Loadings Dimension Exercise 2 3 4 5 Communality M SD Comprehension Approach -.02 -. 0 1 -.08 .80 . 1 1 . 8 3 4.20 1 . 1 5 Oral Expression Approach . 0 5 -.02 -. 1 4 .42 .44 .73 3 .95 1 .3 2 Tolerance Approach -.09 -.03 . 0 7 . 2 1 .70 . 70 4.24 1 .2 1 Teamwork Approach -.00 -.05 - . 04 -. 1 3 .99 . 9 1 3 . 80 1 .3 0 Customer Focus Approach .08 -.03 -.04 . 1 2 .70 .66 3 . 9 1 1 .3 5 Comprehension Closing .rAO -.71 - . 0 1 . 1 5 -. 1 0 . 7 8 4 . 3 8 1 . 1 3 Oral Expression Closing -.06 -.80 - . 04 -.04 .05 . 7 0 4.07 1 .3 0 Tolerance Closing -.02 -.83 . 02 -.02 .03 .68 4. 1 8 1 .25 Teamwork Closing . 2 1 -.92 -.05 .04 -.02 . 89 3 . 88 1 . 3 4 Customer Focus Closing .05 -.83 -.03 -.03 . 09 . 75 3 . 89 1 .3 2 Comprehension Returns . 1 4 - . 05 -.77 . 1 1 -.04 . 7 0 3 . 80 1 .3 1 Oral Expression Returns . 1 1 -.08 -.81 .05 - . 02 . 7 6 3 .65 1 .43 Tolerance Returns -. 1 9 . 06 -.86 -.05 .06 . 7 1 3 . 8 8 1 .3 6 Teamwork Returns - . 04 -.02 -.94 - . 0 1 . 04 . 87 3 . 4 1 1 .3 8 Customer Focus Returns . 04 -.04 -.87 - . 0 1 -.02 . 7 8 3 . 40 1 .3 7 Eigenvalue .40 5 . 04 5 .46 3 .47 4 . 3 4 % of variance explained 47.46 1 2. 82 1 1 . 87 2 . 7 7 1 . 5 7 Confirmatory Factor Analysis AMOS (version 4) was employed to evaluate the fit onhree models that reflected the varying designs of AC mentioned or implied in this study. Each model is first represented graphically, then the associated factor loadings, together with goodness-of-fit indices, are presented in tabulated form. Raw data from the AC ratings was used as input for AMOS throughout the confirmatory factor analyses (CFA). 1 99 Model One: The Abridged Task Specific Model The task-specific model was tested first (see Figure 4). The task-specific model tested was an abridged version of the observed model summarized in the previous analyses. Items were removed from the task-specific CF A model in light of cautions surrounding sample size (relative to the number of parameters estimated) when employing structural equation models (SEM) (Bollen, 1 989; Klem, 2000). As a rule of thumb, Bentler and Chou ( 1 9 87) recommended that at a minimum, five cases should be present per parameter estimated, and it is recommended that there should be 1 0 cases per parameter,; The full task-specific model (i.e . , without items removed) contained 83 parameters, and therefore would have required a sample size of 4 1 5 at a bare minimum. Thus, the sample size of 1 87 in the present study fel l well short of Note: Exl = approach exercise; Ex2 = closing exercise; Ex3 = returns exercise. The observed variables ' i 3 ' through to ' i39 ' represent the randomly selected behavioural items associated with each exercise. Figure 4. Model One: Abridged Task-Specific CFA Model . 200 this criterion. It was decided, therefore, that for an abridged version of the task? specific model, five task-specific items should be retained for each exercise to attempt a reasonable comparison with the dimension-specific model. This number was considered suitable because it allowed direct comparison with the dimension-specific AC, which employed five trait j udgements per exercise. Items were retained in the full task-specific AC on a random basis to avoid bias in the selection of items to remove or retain. The random number table in Coolican ( 1 999, p. 448) was employed for this purpose. Figure 4 shows the items that were retained in the abridged model. The resulting model therefore ?omprised three latent, and fifteen observed variables. The three latent variables in this case represented behavioural performance on exercises. Table 42 presents the standardised parameter estimates for the model represented in F igure 4. It was found that the mean values from the abridged data set and the mean values from the total data set were strongly correlated (r = .99, p < .00 1 ). The model parameters shown in Table 42 indicate relatively clear factor loadings on exercises for the task specific model. This result was consistent with those reported in the generalizability study, and the factor analysis. Table 43 shows selected goodness-of-fit indices for the task-specific model. The ratio of case numbers to the number of parameters in the model was within the minimum limits suggested by Bentler and Chou ( 1 987). Model One contained 33 parameters and 1 87 cases. Note that the samples employed in this entire study were stil l small when considering Bentler and Chou's suggestions. The reader is therefore cautioned that certain goodness-of-fit indices tend to underestimate fit when the sample size is small (Byrne, 200 1 , MacCallum, Browne & Sugawara, 1 996) . 201 Table 42 Standardised Factor Loadings for Model One: The Abridged Task-Specific CFA Model Exercises Behavioural Items Ex l Ex2 Ex3 i3 . 6 1 i7 . 63 i 1 0 . 77 i 1 1 "? . 89 i 1 4 . 64 i 1 6 .92 i 1 8 . 89 .i22 . 7 6 i26 . 5 8 i27 . 7 5 i 3 0 . 7 0 i3 1 . 86 i33 .88 i3 7 . 86 i 3 9 . 80 Note: Exl = approach exercise; Ex2 = closing exercise; Ex3 = returns exercise. The observed variables ' i 3 ' through to ' i39' represent the randomly selected behavioural items associated with each exercise. Overall, goodness-of-fit indices presented in Table 43 were suggestive of a reasonable fit for the abridged task-specific model. GFI and AGFI indices were reasonable i n the present study (Byme, 2 00 1 ) . The CFI and TLI indices should approach .95 (Byme, 200 1 ) as a rule of thumb, and in this study, the CFI and TLI were again reasonable. Browne and Cudeck (1 993) suggested that RMSEA values as high as .08 indicate a reasonable fit, thus the RMSEA point estimate of .079 was also suggestive of a reasonable fit. 202 Table 43 Selected Goodness-OfFit Indices for Model One: The Abridged Task-Specific CFA Model Index Point Estimate GFI AGFI CFI TLI RMSEA* *90% Confidence Interval for RMSEA (.064 < RMSEA < .095) Model Two: The Dimension-Specific CF A Model .880 .835 .940 .928 .079 The second model tested was the dimension specific model (see Figure 5). Note that Models Two (Figure 5) and Three (Figure 6, discussed later) could have been combined to form a saturated model, however the restrictive sample size Note: TI = teamwork; T2 = customer focus; T3 = oral expression; T4 = tolerance; T5 = comprehension. The observed variables 'tl ' though to 't5 ' represent trait judgements that correspond to each associated latent trait. Figure 5. Model Two: Dimension-Specific CFA Model. 203 disallowed this. Figure 5 shows a graphical representation ofModel Two. Latent variables reflected trait-based dimensions, and observed variables reflected trait-based judgements made across the exercises in the AC. Table 44 shows the standardised factor loadings from the dimension-specific model. These look promising on initial inspection, however, the goodness-of-fit indices shown in Table 45 indicate a poor fit for the dimension-specific model (Byrne, 2001 ) . The number of parameters in Model Two totalled 40, which is near the minimum number, relative to sample size, suggested by Bentler and Chou ( 1 987). Table 44 Standardised Factor Loadings for Model Two: The Dimension-Specific CF A Model Trait Judgements Ex 1 , t 1 Ex 2 , t 1 Ex 3 , t 1 Ex 1 , t2 Ex 2, t2 Ex 3, t2 Ex 1 , t3 Ex 2, t3 Ex3, t3 Ex 1 , t4 Ex 2, t4 Ex 3, t4 Ex 1 , t5 Ex 2, t5 Ex 3, t5 T 1 .54 .53 .67 T2 . 6 1 .60 .70 Dimensions T3 .46 . 55 .60 T4 .58 .66 . 70 T5 .52 .6 1 .69 Note: Tl = teamwork; T2 = customer focus; T3 = oral expression; T4 = tolerance; T5 comprehension. Exl = approach exercise; Ex2 = closing exercise; Ex3 = returns exercise. The observed variables ' t l ' though to 't5 ' represent trait judgements made in each exercise and corresponding to each associated latent trait. 204 Table 45 Selected Goodness-Of-Fit Indices for Model Two: The Dimension-Specific CFA Model Index GFI AGFI CFI TLI RMSEA* *90% Confidence Interval for RMSEA (.233 < RMSEA < .26 1 ) Model Three: The Exercise Effer;t CFA Model Point Estimate . 524 .286 .60 1 .476 .247 The third model reflected the effect of different traits correlating highly within exercises (i.e., the exercise effect). Figure 6 shows a graphical representation of this model. In this case, latent variables represent heterotrait-monomethod correlations, and observed variables reflect trait judgements. Table 46 shows the standardised parameter estimates for the model represented in Figure 6. Note: Exl = approach exercise; Ex2 = closing exercise; Ex3 = returns exercise. The observed variables 'tl ' though to 't5 ' represent trait judgements made in each exercise and corresponding to each associated latent exercise, where: tl = teamwork; t2 = customer focus; t3 = oral expression; t4 = tolerance; t5 = comprehension . Figure 6. Model Three: The Exercise Effect CF A Model. 205 Table 46 Standardised Factor Loadings for Model Three: The Exercise Effect CF A Model Within Exercise Judgements Ex 1 , t 1 Ex 1 , t2 Ex 1 , t3 Ex 1 , t4 Ex I , t5 Ex 2, t 1 Ex 2, t2 Ex 2, t3 Ex 2, t4 Ex 2, t5 Ex 3, t l Ex 3 , t2 Ex 3, t3 Ex 3, t4 Ex 3, t5 Ex l .74 .84 .83 .87 . 8 1 Exercises Ex2 .73 .83 . 83 .92 . 8 8 Ex3 .82 .86 .82 . 93 . 89 Note: Exl = approach exercise; Ex2 = closing exercise; Ex3 = returns exercise. The observed variables 'tl ' though to 't5 ' represent trait judgements made in each exercise and corresponding to each associated latent exercise, where: tl = teamwork; t2 = customer focus; t3 = oral expression; t4 = tolerance; t5 = comprehension. Table 46 shows relatively clear factor loadings for traits on exercises, consistent with the generalizabi lity analysis . Table 47 shows selected goodness-of-fit indices for Model Three. Overall, the indices presented here suggested a mediocre fit between the proposed model and the observed data. The number of parameters relative to the number of cases was within the l imits suggested by Bentler and Chou ( 1 987). 206 Table 47 Selected Goodness-OfFit Indices for Model Three: The Exercise Effect CFA Model Index GFI AGFI CFI TLI RMSEA* *90% Confidence Interval for RMSEA (.072 < RMSEA < .I 02) Point Estimate . 877 .830 . 946 .934 .087 The reader is cautioned about sample characteristics in the study above. Byrne (200 1 ) and Raykov and Marcoulides (2000) note that normality assumptions are integral to SEM procedures, and are often neglected in practice. To gain some idea of sample characteristics, indices of univariate skewness and kurtosis were calculated for the abridged task-specific and dimension-specific data (see Cramer, 1 994 for a discussion on how these indices are calculated). For the abridged task-specific data, skewness values ranged from -1 .339 to 0 . 1 1 9, with a mean value of -0.679 (standard error = 0. 1 78). Using the mean value as an estimate of overall skewness, the task- specific data were found to be significantly asymmetrical (z = -3 .8 1 5, p < . 00 1 , 2- tailed), and thus, positively skewed. Kurtosis ranged from - 1 .009 to 2. 1 53, with a mean value of -0.036 (standard error = 0.354). Using the mean value as an estimate of overall kurtosis, a significance test revealed that the data were not significantly platykurtic (where there are too few cases at the centre of a distribution), and therefore suggested evidence in favour of normality (z = -0. 1 02, ns, 2-tailed). For the dimension-specific data, univariate skewness values ranged from -1 .026 to -0.038, with a mean value of -0.059 (standard error = . 1 7 8) . Using the mean value as an 207 estimate of overall skewness, the dimension-specific data were not found to be significantly asymmetrical (z = -0.3 3 1 , ns, 2-tailed). Kurtosis for the dimension? specific data ranged from -0. 8 1 8 to 1 . 1 1 9, with a mean value of -0. 1 87 (standard error = 0.3 54). Using the mean value as an estimate of overall kurtosis, a significance test revealed that the data were not significantly platykurtic, and therefore suggested evidence in favour of normality (z = -0.528, ns, 2-tailed). The deviations against normality in these data were therefore not deemed to be catastrophic. The task? specific data were found, on average, to be significantly positively skewed, however there was no salient evidehce to suggest that kurtosis was problematic in these data. The reader is cautioned that both the task-specific and the dimension-specific data sets fai led to meet the assumptions of multivariate normality. Multivariate normality was assessed using Mardia' s ( 1 970) coefficient for the task-specific data = 22.757, and the dimension-specific data = 22.657. Values of 1 . 96 or less indicate non-significant multivariate kurtosis. Byme (200 1 ) asserts that in SEM, deviations from multivariate normality can lead to spuriously large r! values, modest underestimation with respect to fit indices (particularly the TLI and the CFI), and spuriously low standard errors which may render spuriously significant regression paths in structural models. Note, "in practice, most data fai l to meet the assumption ofmultivariate normality" (Byrne, 2001 , p. 268). Discussion Generalizability Study Study Three generally shows evidence in support of Hypothesis Two. Note that the task-specific and dimension-specific approaches in the study were compared in a 208 repeated measures design so as to hold raters, participants and assessment content constant. As can be seen in Table 34, clear exercise effects were found across both types of AC. This was evidenced by the px i nteraction (Kane, 1 982; Kraiger & Teachout, 1 990; Lievens, 200 1 b), which explained 30.7% of the variance in the task? specific model, and 33 .6% of the variation in the dimension-specific model. Clearly under the dimension-specific model, this comparatively high source of variation makes l ittle conceptual sense, and generally reflects a lack of evidence in favour of convergent and discriminant validity. Under a trait paradigm, one expects to measure variables that will endure in a relatively stable fashion across different situations. In the l ight of the previous discussion, even under a trait-paradigm, some variation across exercises might therefore be expected. However, in the dimension-specific approach, px was the greatest contributor to variance in scores. Note that under a task -specific model, the finding of a large amount of variation being attributable to px does indeed make conceptual sense. A detection of the profound effect of the situation and its influence on behaviour is considered integral and adaptive under the task-specific paradigm (Hartman, Roper & Bradford, 1 979). As with most forms of assessment in the selection context, the focus is on person variation across the various facets, because of the notion that assessment procedures of this nature aim to differentiate among people for decision purposes. Therefore, the principal focus in the present study concerns interactions between persons and facets and variance component estimates for the object of measurement. As mentioned, of particular interest in the present study is the interaction term px for both types of AC (Kane, 1 982; Kraiger & Teachout, 1 990; Lievens, 200 1 b). Lievens (200 1 a; 200 1 b) suggests that the interaction term pd, in the dimension-specific AC, reflects the extent to which dimensions (as a set) are useful for discriminating between 209 persons, that is, pd represents the extent to which the procedure holds a form of discriminant utility under a traditional trait paradigm. In Study Three this interaction term explained comparatively little of the variation in scores, at 1 .8%. Thus, further evidence was found suggesting that the dimension-specific approach was not measuring trait-based variables because the pd interaction implies that dimensions were not comparatively useful for making differentiations among people. Person variation is an important source of variance that needs to be giyen attention in ACs. An AC process must be efficacious in discriminating among people for decision purposes. The ..effect for the object of measurement, p, for the task? specific approach explained 27.4% of the total variance. The object of measurement, p, for the dimension specific approach was marginally higher, and explained 3 1 .9% of the variance in scores. Note the slightly wider confidence intervals for this effect in Figures 2 and 3, suggesting some level ofuncertainly in this variance estimate. The propensity for distinguishing among people for the two processes remains at a comparable level within the bounds of the respective confidence intervals for these person effects across the task-specific and dimension-specific processes. The reasons for these processes being able to discriminate among people in this way remains conceptually challenging for the dimension-specific approach, and conceptually comfortable for the task-specific approach, as evidenced in the highpx interactions across the two approaches, and the low pd interaction in the dimension-specific approach. Of particular interest are the similarities in patterning across the two types of AC. This is best seen in Figures 2 and 3 where similarities between the task-specific and dimension-specific approaches ?can be easily compared. Starting with the effects for p, x, i:x and their dimension-specific counterparts d, and px across ACs . The 2 10 similarities among analogous contributors to variance in the two Figures suggests that perhaps both ACs are isomorphic, or are at least similar, in their measurement outcomes: The major difference between the two models is that the task-specific approach makes conceptual sense, while the dimension-specific approach does not, as detailed earlier. Speaking speculatively, it is possible that the managerial assessors in this study are indeed treating the exercises in both of the AC models as stand alone work samples of situationally specific behaviour. This would be at odds with any form of trait-based measurement in ACs. Figures 2 and 3 suggest that credence may be given to most of the variance estimates in Study Three, apart from that for the exercise facet alone. The uncertain estimate of variance for the main effect for exercises suggests that it is difficult to draw conclusions with respect to this effect. It should be noted by the reader that, given the results of the G study, the use of E P?ei and

was calculated at 0 .68. Thus, the task-specific AC was found to be a marginally more dependable form of assessment than the dimension-specific model. The interpretation of E p?. 1 and ? necessitates some deliberation at this point. While Lievens (200 1 a) cites Marcoulides ( 1 989) and states that "Values equal or above .80 are considered to be acceptable" (Lievens, 200 1 a, p. 260), Marcoulides ( 1 989) actually sets no such strict criterion for the interpretation of these coefficients. Indeed, Marcoulides has commented that he does not necessarily agree with such steadfast criteria for these indices (G. A . Marcoulides, personal communication, November 23rd, 2002). E P?. 1 and ? are Decision study (D study) values that should ideally be viewed in terms of the extent to which they increase relative to the costs associated with changing aspects of the facets of measurement in a particular model, for example changing the number of items or the number of dimensions. These coefficients can be examined by a researcher for the sole purpose of investigation into the values associated with a particular G study, rather than exclusively with comprehensive D studies, which look at the effects of changing the number of levels of particular facets so as to determine effects on dependability. As such, the use of E P?ei and ? in this context is acceptable, and has been employed successfully in research on ACs (Arthur et al, 2000; Lievens 200 1 a) . Shavelson and Webb ( 1991 ) suggest that E P?ei and ? are analogous to reliability coefficients in classical test 212 theory. As a very general idea of the criteria for acceptability in cases such as those in the present study, it is probably more accurate to follow Shavelson and Webb ( 1991) than to follow Lievens (200 1 a) in the interpretation of Ep?e i and ? . As summarised by Aiken (2003), the acceptability of a reliability coefficient can lie anywhere from .,. between .60 or .70 and upwards, depending on the use of the data. As a general heuristic, the higher the coefficient, the better. Table 35 also shows the interrater reliability for the task-specific model, ICC 1 , 1 , calculated as 0.93 . Overall inter-rater agreement on the task-specific model was found to be higher than that obtained for the dimension-specific model, ICC 1 , 1 calculated as 0.82. This finding i s congruent with Lowry ( 1 995) who reported that task-specific ACs yielded interrater reliability coefficients exceeding .80. Factor Analysis The factor analyses provided further evidence in favour of Hypothesis Two, and reinforced the findings in the G study. Table 3 8 shows the varimax rotated factor matrix for the task-specific AC. Table 40 shows the direct oblimin rotated factor matrix. Goodness of fit was reasonable for the three-factor solution, which accounted for 64.2% of the variance in the variables. Most of the communalities suggested that the variables were, by and large, well embedded within the factor structure. Item 6 on the Approach exercise could be regarded as an exception to this, with a comparatively low communality at .39. Only 1 7% of the residuals in the reproduced correlation 2 1 3 matrix were greater than .05 in absolute terms. As stated earlier, a cut-off of .4 was selected for noteworthy factor loadings. Given this criterion, relatively clear loadings ofvariables on exercises were evident. This is in congruence with the theoretical expectations of task-specific ACs, which consider exercises to act as stand-alone work samples of situationally specific performance. Table 39 shows the varimax rotated five-factor solution for the dimension? specific AC. Table 4 1 shows the direct oblimin rotation. On initial inspection, the five-factor model accounts for a sizable amount of the variance in the variables at 76.5%. Communalities suggest that all of the variables are reasonably well embedded in the overall factor structure. Only one of the residuals in the reproduced correlation matrix had a value greater than .05 in absolute terms. Where goodness of fit appears to be promising on the surface, the evidence suggests that the factor structure of the ratings is conceptually problematic. Given a cut-off value of .4 for notable factor loadings, all of the variables load clearly onto three, as opposed to five, factors in Table 39. The exception to this is the variable 'comprehension' measured in the approach exercise, which bleeds across two factors. Aside from this, the factor loadings, clearly interpretable as the three exercises, are relatively clean. Generally, the fourth and fifth factors are redundant. This finding i s typical of the heavily deliberated exercise effect seen in ACs. Different traits correlated highly within exercises, and same traits barely correlated across exercises . Thus, the dimension? specific AC displayed poor discriminant and convergent validity, when viewed from the traditional trait-based paradigm under which these processes operate. This exercise effect is less clear in the direct oblimin rotated pattern matrix shown in Table 4 1 . This said, only three variables bleed across factors, and there is still a tendency 214 for variables to load onto exercises. Indeed, there is no clear evidence for trait-based measurement in Table 4 1 . These results of the factor analyses were congruent with those in the G study. In a dimension-specific AC, the intention is to measure trait-based variables under the trait-paradigm. This is why behaviours are classified under headings such as 'Comprehension' or 'Oral Expression' , which are often referred to as 'competencies' or 'dimensions'. The reality is, no matter how they are termed, there is a trait-based expectation that raters will find some cross-situational patteming in behaviour. In ACs, this translates into a set of identical trait judgements, which should theoretically correlate highly across different exercises. However, the analysis in this, and other, studies suggests method variance in the dimension-specific AC. This finding does not correspond with the hypothetical expectation of the dimension-specific AC, thus, it makes little conceptual sense in that context. Turning to the alternative task-specific paradigm, one treats each exercise as a stand-alone work sample of behaviour. No inference of stable traits is ever made. Thus, one would expect to obtain high correlations between the different behavioural items within an exercise under this paradigm. Where high factor loadings on exercises are problematic for the trait paradigm, for the behavioural paradigm, however, they are conceptually expected, adaptive, and admissible. Under the behavioural paradigm, high factor loadings on exercises reflect true variance in terms of situational specificity in behavioural responses. Confirmatory Factor Analysis The results of the CF A added emphasis to the results found in the previous analyses . The dimension-specific model (Model Two, Figure 5) emerged as the 2 1 5 poorest fitting model i n the analysis (see Table 45). Model Three (see Figure 6), specifically investigated the extent to which heterotrait-monomethod correlations fitted the data. Overall, the goodness-of-fit indices indicated a mediocre fit for the exercise effect model (see Table 47). The alternative to the dimension-specific models (Models Two and Three) was the task-specific model, Model One. An abridged version of this model was derived due to the restrictive sample size. Factor loadings for the task-specific model were high, and were consistent with the previous analyses on these data (see Table 42). Overall, the goodness-of-fit indices shown in Table 43 indicated a reasonable fit for the task-specific model, in line with the suggestions ofByrne (200 1 ). Comparatively, the task-specific model was the best fitting of the three models tested. Considerations First and foremost, it could be argued that the present study employed a repeated measures design with no form of matching or counterbalancing the order of conditions (that is, the presentation of a task-specific followed temporally by a dimension specific approach). Thus, the order in which these conditions were presented may have affected the results obtained to some degree. However, there was only one logical order in which the conditions under study could be directed. Under a process such as an AC, in order to make a trait-based judgement of an individual, one must first witness a behavioural manifestation of that trait. This behavioural manifestation i s then followed by a trait-based judgement. The reverse contingency cannot, and does not in practice, logically apply, as an assessor can neither reasonably nor defensibly make a behavioural judgement on the basis of a trait assumed to exist prior to the behavioural evidence. Because the initial step in AC methodology is to 2 1 6 document behaviours, a behavioural assessment i s a natural consequence of having observed behavioural responses. The following natural progression in AC methodology and practice is to categorise these behaviours into a class of related behaviours . Thus, a behavioural assessment followed by a dimensional assessment is the natural order of events that transpires in an AC. In addition to the above argument, it should be noted that, overall, more attention was given to the measurement of trait-based variables in this AC than to the measurement of behaviours in exercises . The behavioural checklists in the task? specific component of the AC displayed specific dimensions that were associated with each behavioural item. Thus, these checklists could be viewed as acting to maximise the possibility of trait measurement. The literature on ACs suggests that the presence of behavioural checklists should act to maximise conditions for trait measurement (Lievens, 1 998). Training in the present study focused primarily on behaviour as a manifestation of trait variables. While the present study attempted to facilitate trait measurement in this regard, further credence could, perhaps, be given? to the evidence in favour of a behavioural assessment as opposed to a trait-based assessment in this process. In a similar vein, the exercises employed in this study were of a relatively similar format. This design feature was intended to facilitate the manifestation of trait variables. The results would suggest that relatively minor fluctuations across exercises have an effect on behaviour, in line with Michel ' s theory (Michel, 1 984). The FOR procedure employed focused on the manifestation of dimensions only. Future research should look into whether FOR training, targeted at agreement in the ratings of task-specific measurements, could assist in improving the task-specific measurement model. The focus in task-specific AC training should shift away from a focus on trait manifestations across exercises, and should concentrate on behavioural 2 1 7 performance within an exercise itself. This approach may facilitate the measurement accuracy of the task-specific AC. Consideration in this study should also be given to the restrictive sample size and the entry-level position under scrutiny, which may limit the level of ecological validity that the study might hold. Nevertheless, it is argued that 1 87 participants is a large group for an AC process, which tend to use much smaller numbers of people on a given assessment occasion (Ballantyne & Povah, 1 995). ACs are often used for the selection of managerial personnel (Woodruffe, 1 993) . The generality of the above findings to higher-lev?J positions cannot be definitively ascertained from the results of this study. Generality in this regard is suggested as a route for future research. The set of dimensions and the set of behavioural responses used in this study also require further research on different dimensions and behavioural responses to ensure generality across samples. In the CF A, consideration also needs to be given to restrictive sample sizes when employing SEM analyses. Of import in SEM is the number of cases relative to the number of parameters estimated in a given model . As previously mentioned, Bentler and Chou ( 1 987) suggested that at a minimum of five cases per parameter should be included in a given study. Byme (2001 ) suggests that in small samples, goodness-of-fit indices (particularly the RMSEA and the TLI) can underestimate the true fit of a model . Small case numbers relative to the number of parameters estimated afflicted the full task-specific Model One. Therefore, an abridged version of the task-specific model was derived by randomly selecting items to create a smaller subset per exercise. The reader is therefore cautioned about possible limitations in the generality of this structural model to the entire task-specific data set. Sample size 2 1 8 restrictions rendered impossible the analysis of a fully saturated model that incorporated both exercise effects and dimensions effects (as in Arthur et al. , 2000). Theoretical Implications The findings of Study Three are suggestive of a redefinition of the paradigm under which ACs currently operate. The suggestion is made that a task-specific paradigm may be more appropriate and theoretically justified than its dimension? specific counterpart. A multitude of past studies on ACs have viewed exercise effects as being indicative of halo effects;. method effects, or measurement error (Carrick & Williams, 1 999; Hough & Oswald, 2000; Schmidt & Ones, 1 992). The present study viewed such effects as indicative that AC architects may have applied an inappropriate paradigm to a particular measurement instrument, thereby creating expectations that have not been upheld in the data on ACs to date. The exercise effect commonly observed in ACs appears to support this contention (Chan, 1 996; Hough & Oswald, 2000; Schmidt & Ones, 1 992). The findings of the present study suggest that not only did the task-specific AC tend to produce ratings that made more sense psychometrically, the task-specific ratings also tended to be somewhat more dependable and reliable than the dimension? specific process. Such psychometric advantages imply that AC ratings can potentially become more useful to practitioners . Employment decisions related to development, selection and/or promotion based on AC ratings are more likely to be precise. The fairness with which such decisions are made under a task-specific process is more likely to be reinforced and justified over and above the dimension-specific process. Feedback on the basis of ratings that are anchored to specific tasks, rather than to nebulous dimensions, are more likely to lead to greater behavioural change (Thomton, 2 1 9 et al. , 1 995) in task-specific development centres, as opposed to the traditional dimension-specific approach. Moreover, the task-specific approach may be more justifiable in court cases relating to employment decisions. Specific behavioural anchors specifying job related behaviours could reasonably be presented as a justification for employment decisions. Such information presents a less nebulous view of a person than does a trait-related assessment. Such suggestions should not be taken lightly as very few companies internationally investigate the extent to which their dimension-specific ACs are measuring constructs as int,ended (Spychalski, et al . , 1 997). A reasonable take on the literature would suggest that if a company is employing managers as assessors, which most do (Lowry, 1 996; Muchinsky, 2000; Spychalski, et al. , 1 997), then the likelihood is that their AC will yield poor evidence of construct validity (Hough & Oswald, 2000; Schmidt & Ones, 1 992), and therefore may be difficult to justify in court (Lowry, 1 996; Norton, 1 977) . Not only is this important from a legal perspective, but it appears unethical to provide data for people on the basis of a model that is not psychometrically supported. Given concerns about the cognitive load upon assessors in ACs (Lievens & Klimoski, 200 1) , and the limitations of managers as trait-based raters (Sagie & Magnezy, 1 997), the task-specific approach to AC design possibly presents a straightforward treatment for problems associated with cognitive load and non? psychologist assessor panels . The very notion o.f finding classifications for behaviours under trait classes presents a highly complex task to a group of assessors who, primarily in practice, are not trained as psychological experts (Lowry, 1 996; Muchinsky, 2000; Spychalski, et al . , 1 997). No such classification is necessary under a task-specific approach. Thus, cognitive load upon assessors is, by design, also likely to be minimised under a task-specific model. 220 In the CF A, an abridged version of the task -specific model yielded the best fit when compared to the dimension-based and exercise effect models. With respect to AC data and structural models, Arthur et al. 's (2000) study is worthy ofnote. Arthur et al. tested only one CF A model from their data, which consisted of a saturated model incorporating exercise and dimension effects. Overall, they found an excellent fit for their mixed dimension/exercise model. The reason that mixed models fit well may be because such an array of V()J'iables are entered into such models . The practical use of mixed models in AC contexts is, however, questionable. Utilising the effects of monotrait-heteromethod and heterotrait-monomethod correlations for decision purposes appears overly burdensome. Also, there are currently no guidelines to show which effect (i .e. , exercises or dimensions) should be given more or less weighting, other than the literature on exercise effects (heterotrait-monomethod correlations) commonly found in ACs (Hough & Oswald, 2000). Additionally, under the trait paradigm, the notion of relying on heterotrait-monomethod correlations to make decisions about people at all remains uncomfortable, and conceptually difficult to justify. In this study, the results pertaining to Models Two and Three suggest that heterotrait-monomethod correlations should be given more weighting than monotrait? heteromethod correlations. The abridged task-specific Model One emerged as a reasonable fit, and was the best fitting of the three models tested. This may be considered encouraging in terms of a potentially practical model for AC evaluation methodology. Also, despite the fact that training did not focus on effects within exercises, both of the exercise centred models (Model One and Model Three) emerged as better fits than the 221 dimension based model. The dimension-based model on which AC training was focu&ed emerged as a poor fit (Model Two). Thus, it would appear that more investigation is required into ACs that investigate exercise-centred performance. The most conceptually sound of the two models that concentrated on exercise performance would appear to be the task-specific AC. As argued elsewhere in this thesis, future research should look at methods to refine this approach to obtain a practical tool that could be used for reasonable employment decisions. 222 Chapter Five: General Discussion A well-documented quandary in the AC literature is the lack of propensity for AC ratings to display the measurement of the trait-based variables they are intended to measure (Bycio et al., 1 987; Carrick & Williams, 1 999; Chan, 1 996; Fleenor, 1 996; Jones et al . , 1 99 1 ; Joyce et al. , 1 994; Russell, 1 987; Lievens, 2002; Robertson et al. , 1 987; S ilverman et al . , 1 986; Spector, 2000; Turnage & Muchinsky, 1 982; Turnage & Muchinsky, 1 984). The enigmatic nature ofthis finding is compounded by the notion that ACs tend to predict certain c.riteria, particularly related to promotion, yet the reasons for this predictive utility remain unidentified (Chan, 1 996). The present set of three studies attempted to find new ways of interpreting the AC puzzle by investigating the notions of unintended latent trait measurement (Study One), a preliminary investigation into the assessment perceptions held, particularly by assessors (Study Two), and the primary study; an alternative to the prevailing paradigm underlying AC assessment (Study Three). Study One generally found evidence against the contention that latent traits are unintentionally measured in ACs. Only one of the variables studied, tacit knowledge, in one of the two samples could cogently be argued as a meaningful theoretical predictor of OARs. This variable was the lone significant contributor to variance in a model that explained only 1 6% of the variance in OARs. The measurement of tacit knowledge in this sample was also found to be unreliable, making it difficult to ascertain the unified nature ofthe construct. Additionally, 1 6% ofthe variance associated primarily with tacit knowledge does not paint a particularly convincing picture as to the notions underlying AC measurement. As only 1 6% ofthe variance in OARs was explained in one sample, it would seem unlikely that the composite model ------ 223 of self-efficacy, self-monitoring and tacit-knowledge would constitute a reasonabh substitute for OARs. Another sample in Study One did not find any meaningful relationship between the composite model and OARs, showing further evidence th< is unlikely that these variables act as the primary contributors to AC validity. From another perspective, it could be argued that 1 6% of the variance explained in scores could be construed as fairly sizeable, given that the correlation was with a construct that was not intentionally measured in the AC. However, it is also a sizable inferential leap and probably incorrect to suggest, on the basis of this finding, that managers anhapping into managerial intelligence during the AC proce Given the poor record that ACs hold for measuring trait-based variables, this appear questionable. In any case, no matter how this relationship is construed, 1 6% of the variance explained in OARs is not a cogent enough explanation to warrant a replacement of the AC with a paper test of managerial tacit knowledge. Moreover, when correcting for validity shrinkage when generalising across different samples, tl variance in OARs explained by this composite dropped to around 1%. This could suggest that the composite external measures are not implicated in OAR derivation generally. Note, however, that managerial tacit knowledge has been found to be unrelated to traditional intelligence (IQ) test scores (Wagner & Sternberg, 1 99 1 ). Cook (1 998) reports findings that suggest IQ scores relate to OARs .. Thus, the combination of managerial tacit knowledge and IQ might yield more substantial level of relationship with OARs in certain samples that have strong requirements for managerial tacit knowledge. While these relationships potentially hold interest, nothing in Study One suggested that the relationships between self-efficacy, self? monitoring and tacit-knowledge with OARs would definitively explain what it is that ACs actually measure. 224 Study Three investigated the extent to which an alternative paradigm might assist in making sense of AC ratings. The suggestion that an alternative to the prevailing trait paradigm should be introduced into AC construction has been conveyed by a small faction of researchers (Gorham, 1 978; Herriot, 1 986; Klimoski & Brickner, 1 987; Lowry, 1 997; Robertson et al . , 1 987). These researchers, by and large, have suggested that treating AC exercises as stand-alone work samples of behaviour would be a more adaptive approach to the treatment of AC ratings, as research suggests that perhaps assessors treat AC exercises as behavioural samples anyway. No known research has xompared the psychometric properties of a task? specific with that of a dimension-specific AC. Study Three sought to find some preliminary solutions to the question of construct validity in ACs by exploring the possibility that the ratings in ACs might reflect groups of situationally specific work samples. In the AC in Study Three, it was found that exercise effects endured across the repeated measures task-specific and dimension-specific processes, as evidenced by strongpx interactions and factor loadings on exercises. Various levels of other facets mirrored each other across the two processes, as can be seen across Figures 2 and 3 . Also, the dimension-specific process showed a relatively low pd interaction, indicating that dimensions were not useful criteria for making decisions among candidates. This is of great concern, because ACs are frequently used to make decisions about the varying performances of different people and, in practice, these decisions are most commonly based on dimensions (Lowry, 1 996; Sackett & Harris, 1 988; Spychalski et al . , 1 997). Thus, when managers are employed as assessors, as is most commonly the case (Lowry, 1 996; Muchinsky, 2000; Spychalski et al. , 1 997), there remains the likelihood that managers will not measure trait-based variables, as 225 evidenced in the comparatively large px interaction term. Thus, trait-based variab become conceptually problematic foundations for decision purposes in ACs. The results of the G study in Study Three suggested that both the task -spe< process and the dimension-specific process were useful for making distinctions among people, as evidenced by the similarly high component of variance for the object of measurement. To reiterate on the. argument presented above, the probler that in a dimension-specific AC, decisions about people are likely to be made on t basis of dimensions, which evidently do not contribute a great deal to person variation. Person variation...;across the exercises themselves contributed a great de< more to variation in ratings across the dimension-specific and task-specific ACs ir Study Three. Therefore, performance on exercises possibly constitutes a more meaningful basis for decision purposes in this AC than dimensions do. The dimension-specific approach does not promote such bases for decisions under its t foundations. Thus it would seem that the task-specific model, which actively encourages person variation as a function of varying exercises, is worthy of future research concerning its practicability and generality across different samples. A CF A added further evidence that dimensions were not useful criteria for decision making purposes, as the dimension-based model presented in Figure 5 emerged as a poor fit overall (see Table 45). The abridged task-specific model (se Table 43) emerged as a reasonable fit, and was the best fitting of the three models tested. The exercise effect model (see Table 47) emerged as a mediocre fit (accon to the guidelines summarised in Byme, 2001 ). Future research with larger subject numbers will be necessary to verify these results, however the CF A gained promis evidence for the task-specific approach. 226 In New Zealand, the findings of Study Three present just as much concern as they do for the rest of the AC using world. In a recent newspaper article, top New Zealand consulting companies gave obvious credence to the trait-based nature of the ratings obtained in ACs (McCarthy, 2003). The comments made in this article implied that employment decisions were being made for people on the basis of their scores on competencies treated as trait-based categories . Comments were also made about ACs being useful developmentally in terms of contributing towards the improvement of an individual 's skill base. There is a multitude of evidence to suggest that this is misleading, given the lack of support for the measurement of any relatively stable and enduring characteristic in a manager-assessed AC. To elaborate on the findings in Study Three, there was no evidence to suggest the successful measurement of trait-based variables in the dimension-specific AC, as shown in the high px and low pd interaction terms, the relatively clear factor loadings on exercises, the poor fit of the dimension-based and the reasonable fit of the task? specific structural model. The results of the dimension-specific AC appear to mimic the patterns expressed in the task-specific approach (see Figures 2 and 3), suggesting that the two forms of assessment are measuring something isomorphic, or at least similar. The difference between the two approaches is that the psychometric patterns found in the dimension-specific AC make no clear conceptual sense, under the notion that the process was not measuring the trait-based categories it was intended to measure. Rather, the results are suggestive of the highly deliberated exercise effect found in ACs. Efforts were made to maximise the possibility of trait-measurement in this regard, with the employment of behavioural checklists displaying appropriate trait categories, the use of fewer dimensions to reduce cognitive load, the use of frame of reference training, and the use of exercises of a similar format. Under the task- 227 specific model, however, the psychometric properties do make conceptual sense, in that situationally specific responses were expected under this approach. Thus, the suggestion drawn from Study Three is that when the task-specific and the dimension- specific processes are used to measure the same behavioural output, the task-specific approach holds a stronger theoretical justification over the dimension-specific approach. The reader is warned, however, that further research is needed for the generality of these conclusions. Particularly, attention should be drawn to model effects that may be specific to this sample, for example the position being assessed, the set of dimensiohs, the set of behaviours, and the set of exercises employed. Psychometrically, the task-specific model makes more conceptual sense than the dimension-specific approach in Study Three, and moreover, the task-specific approach yielded slightly greater dependability and inter-rater agreement than its dimension-specific counterpart (see Tables 35 and 36). Given these findings, it appears that the task-specific model of assessment may be more appropriate in the more common situation where managerial assessors are employed. Further research will be needed to confirm this suggestion. As discussed earlier, there may be other gains associated with the task-specific approach in addition to psychometric arguments, including, as detailed in the introduction, developmental feedback advantages (Adams, 1 990; Mueller & Dweck, 1 998), legal defensibility, ease of training, increased measurement precision, and the related potential for improvements to the AC process . These features could be aided with an understanding of what AC ratings actually mean. Table 48 details other advantages associated with the task- specific approach to AC design, relative to the traditional dimension-specific approach. Table 48 Advantages of the Task-Specific Approach Relative to the Dimension-Specific Approach to AC Design Task-Specific Design Can potentially use psychologist or non? psychologist assessors to yield construct evidence Lower number of inferences as behavioural checklists are used as the primary data set for decisions Can potentially use very different exercises without undermining the validity of the ., assessment Can assess 8- 1 5 behavioural items per exercise Can assess different behaviours in each exercise Training is simplified by a focus on behaviours only Less cognitive demands on assessors due to less complex inferences Evidence in this study suggests that construct valid ratings are obtained Less time consuming and therefore less costly, because there are fewer steps in the assessment process Developmental feedback more likely to lead to adaptive behavioural change Renders task-based training needs readily identifiable Situationally specific responses, consistent patterns ofbehaviour, and/or combinations ofthese are considered conceptually acceptable More likely to be justified in court because measurement intentions are more likely to be reflected in ratings Dimension-Specific Design Should ideally employ psychologist assessors to yield construct evidence - likely to incur greater costs as a result Higher number of inferences, as one extrapolates trait-based variables from behavioural checklists Restricted to the use of very similar exercises only Should ideally assess 4-5 traits per exercise Should ideally repeat the measurement of a trait at least three times across exercises Training is complicated by trait extrapolations from behaviours More cognitive demands on assessors due to complex trait inferences A multitude of evidence suggests that construct valid ratings are not obtained More steps in the assessment process, therefore more time consuming and costly Developmental feedback less likely to lead to adaptive behavioural change Renders training needs in more vague, categorical terms Consistent patterns of behaviour under trait categories are considered acceptable Less likely to be justifiable in court, because dimension-specific ACs have a history of psychometric problems 228 229 In relation to ACs that are commonly used in practice, attention should be drawn to the notion that there is some debate and confusion surrounding the intentim when using dimensions in dimension-specific ACs. Byham ( 1 980) states that the dimension categories act as nominal classes only, and describes them as "a descriptic under which behaviour can be reliably classified" (p. 29). This definition could lead ' to confusion. If behaviours were to be classified under some form of nominal category, then surely one would expect reasonable correlations between the behaviours within a category label? The very definition of a category implies that its function is to provide a c1ass or division for a subset of related elements. Even before Sackett and Dreher' s ( 1 982) seminal paper, Gorham ( 1 978) was aware of the confusion that such categories might instigate, and suggested that dimensional categories in ACs should be abandoned completely. Indeed, the very notion of a categorical label may well lead assessors to expect a trait-based judgement (Sackett, 1 987). This is because category titles probably promote the idea that decision makers should seek to make a judgement of characteristics that are relatively stable and enduring on the basis of behavioural elements that appear to be meaningfully related to one another. Such is the basis for trait categorisations that form their origins from observable behavioural responses, and by design, ACs have influenced, motivated, and encouraged such a classification. Sackett (1 987) deliberates on the intention of construct measurement in ACs, and states that the "ratings of a dimension across exercises aren't intended as merely repeated measures that should correlate perfectly" (p. 1 9) . Instead, the intention, according to Sackett, is to measure partially overlapping behavioural samples across exercises. This said, Sackett argues further that if there are near zero correlations between the measurements of the same construct across different exercises, then the 230 notion of an overall score based on dimensions becomes problematic. This commonly appears to be the case with the widely deliberated exercise effect finding. A great deal of the literature to date appears to have focused on maximising the possibility that trait-based variables will be measured. Some of these methods have been imaginative, inventive, and even curious. However, as a body of literature, neither definitive nor completely cogent solutions to the AC enigma have been supplied from a trait-based perspective. The results of Study Two suggested that the non-psychologist assessors in a nationally respected AC employed.,in Auckland, New Zealand, did not tend to differentiate the paradigm under which they were assessing. This might suggest that they were not aware of the way in which they should approach the assessment, or on which foundation they should base their assessment. These findings add colour to the picture presented by Sagie and Magnezy ( 1 997) and Lievens and Conway (200 1 ) who found that managerial ratings did not tend to reflect trait-based variables . It should be noted that even experienced clinical psychologists display limitations in the reliability of their assessment of individuals (Persons & Bertagnolli, 1 999; Persons, Mooney & Padesky, 1 995). These studies found that on some of the particular factors under scrutiny, clinical psychologists displayed moderate and even poor inter-rater agreement. It was also found that whether the psychologist held a Ph.D. was an important determinant of the level of accuracy in assessment. The expectation that managers should be able to perform an assessment of an individual on the basis of complex notions such as traits may be unrealistic, given this comparison. The use of I/0 psychologists as assessors in ACs appears to be unrealistic with the associated cost, the absence of the psychosocial advantages associated with having managers assess their own staff, and the exclusion of job-specific/employer-specific subject 23 1 matter expert knowledge that managers possess (relative to external consultant I/0 psychologists). The focus on traits alone in the AC literature appears to be restrictive when comparisons are made with clinical assessment, which formed the very origins of psychological assessment in Western society (Anastasi & Urbina, 1 997). Contemporary clinical approaches to assessment take a much more holistic view of behavioural responses, and acknowledge such factors as behaviour, physiological responses, cognitions, stimuli and situations, rather than merely focussing on trait? based categorisations alone (Bond, 1 998). While the clinical approach to assessment rightfully attempts to tap a comprehensively rich source of information about a particular individual, it is probable that such an in-depth analysis is not necessary in the organisational arena. This said, the essence of the clinical form of assessment suggests that the contemporary approach is not to focus specifically on trait-based variables on which to base decisions, and that a more holistic view is necessitated. It is argued that, perhaps specifically for managerial assessors, a paradigm encompassing behavioural responses contingent on situations is appropriate. The resulting information is likely to provide a rich and useful assessment on which to base decisions. Such an assessment would, by design, acknowledge variation among individuals' behaviour, and the effect of the situation on that behaviour, rather than investigating the extent to which an individual varied on trait-based variables for which there is little empirical evidence in the AC context. The foundations of the task-specific assessment will form its bases on information that is more likely to be justifiable and assured. Study Three demonstrated evidence in favour of a paradigm that rejects the use of category labels, and instead focuses on the operational definitions of behaviour 232 relative to situational contingencies . This study suggests that while a task-specific approach may make more conceptual sense psychometrically, it has the potential to increase the quality and decrease the ambiguity and subjectivity associated with developmental feedback given to employees. It also has the potential to increase the quality and precision with which selection decisions are made. The move to the task? specific approach is a radical leap from the existing dimension-specific paradigm. Some may argue that such an alternative is impractical because it will entail overly detailed job analyses and a tailored AC for each organisation. It is acknowledged that it is practically difficult to maintain such bespoke detail in real-world scenarios. However, it is argued that job analyses should always be a defining feature in the development of assessment programs so as to maintain job relevance and defensibility. While such analyses may not always be at the level of detail required for the construction of a task-specific AC, it is possible that taxonomies of tasks that relate to specific positions could be made available through an item bank. These could be applied in relation to a job analysis that would potentially require less task? related detail than the inductive approaches described in Lowry ( 1 997). Additionally, some practitioners may feel uncomfortable about discarding competency categories. In actual fact, the competency categories could still exist in the background in a task-specific AC, but would be treated as labels for groups of behaviours only. In practice, the very operational definitions of these categories would be applied in the AC. The major difference under the task-specific, when compared to the dimension-specific paradigm, would be the omission of any inference of stable traits . A person's performance on a given exercise would become the new unit of measurement, rather than the label attached to a set of behaviours said to underlie a given competency. 233 It will be desirable for future studies to investigate the generality of the findings in Study Three, due to the possibly sample-specific considerations detaile earlier. Also, further research into the predictive validity of the task-specific apprc wil l be vital to ensure its worth as a tool for decision-making. While Study Three J found preliminary evidence that the task-specific approach is conceptually sound, i remains silent on the notions surrounding whether this approach can explain simila a greater amounts of variation in criterion scores such as work performance or promotability. If a task-specific approach can explain variation in scores for these criteria, then the evidence in this study suggests that the reasons for this relationshi: will ultimately be less of an enigma. 234 References Aamodt, M. G. ( 1 999). Applied industrial/organizational psychology (3rd ed.). CA: Wadsworth. A dams, K. A. ( 1 997). The effect of the rating process on construct validity: Reexamination of the exercise effect in assessment center ratings. Unpublished master's thesis, University of Houston, Houston, TX. Adams, S. R. ( 1 990). Impact of assessment center method and categorization scheme on schema choice and observational, classification, and memory accuracy. Unpublished doctoral dissertation, Colorado State University, Ft. Collins, CO. Ahmed, Y., Payne, T. & Whiddett, S. ( 1 997). A process for assessment exercise design: A model of best practice. International Jo?rnal of Selection and Assessment, 5( 1 ), 62-68. Aiken, L. R. (2003). Psychological testing and assessment ( 1 1 th ed.). Boston: Allyn and Bacon. Anastasi, A. & Urbina, S. ( 1 997). Psychological testing (7th ed.). NJ: Prentice Hall. Anderson, N., Silvester, J., Cunningham-Snell, N. & Haddleton, E. ( 1 999). Relationships between candidate self-monitoring, perceived personality, and selection interview outcomes. Human Relations, 52, (9), 1 1 1 5- 1 1 3 1 . Arkin, R. M. ( 1 98 1 ). Self-presentational styles. In J. T. Tedeschi (Ed.), Impression management theory and social psychological research (pp 3 1 1 - 330). New York: Academic Press. Arthur, W., Woehr, D. J. & Maldegen, R. (2000). Convergent and discriminant validity of assessment center dimensions: A conceptual and empirical re? examination of the assessment center construct-related validity paradox. Journal of Management, 26(4), 8 1 3-835. Arvey, R. D. & Murphy, K. R. ( 1 998). Performance evaluation in work settings. Annual Review of Psychology, 49, 1 4 1 - 1 68 . Asher, J. J . & Sciarrino, J. A. ( 1 974). Realistic work sample tests : A review. Personnel Psychology, 27, 5 1 9-533 . Ballantyne, I . & Povah, N. ( 1 995). Assessment and development centres. Hampshire: Gower. Bandura, A. ( 1 977). Social learning theory. Englewood Cliffs, NJ: Prentice-Hall . 235 Bandura, A. ( 1 982). Self-efficacy mechanism in human agency. American Psychologist, 37, 122- 1 47. Bandura, A. ( 1 986). Social foundations of thought and action: A social cognitive theory. Eng1ewood Cliffs, NJ: Prentice-Hall . Bandura, A. ( 1 997). Self-efficacy: The exercise of control. New York: W. H. Freeman. Baron, H. & Janman, K. ( 1 996). ?Faimess in the assessment centre. In C. L . Cooper & I . T . Robertson (Eds.), International review of industrial and organizational psychology, Vol. 1 1, (pp. 6 1 - 1 1 3) . Chichester: John Wiley ru Sons. Barrett, P. ( 1 996). PsWin psychometric software suite for Windows [Computer program] . University of Liverpool, Department of Psychology. Barrick, M. R. & Mount, M. K. ( 1 99 1 ). The big five dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1 -26. Baumeister, R. F. ( 1 982). A self-presentational view of social phenomena. Psychological Bulletin, 91, 3-26. Belsley, D. A., Kuh, E. & Welsch, R. E. ( 1 980). Regression Diagnostics. John Wile: and Sons: New York. Bern, D. J. & Funder, D. C. ( 1 978). Predicting more ofthe people more of the time: Assessing the personality of situations. Psychological Review, 85(6), 4 85-50 1 Bentler, P . M. & Chou, C. P. ( 1 987). Practical issues in structural modeling. Sociological Methods and Research, 16, 78- 1 1 7. Bemardin, H. J. & Buckley, M. R. ( 1 98 1 ). Strategies in rater training. Academy of Management Review, 6, 205-2 12. Blanchard, P . N. & Thacker, J. W. ( 1 998). Effective training: Systems, strategies, am practices. NJ: Prentice Hall . Bobko, P . ( 1 990). Multivariate correlational analysis. In M. D . Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology, Vol l (pp. 637-686). Palo Alto, CA: Consulting Psychologists Press . Bobko, P . (200 1 ) . Correlation and regression: Applications for industrial organizational psychology and management (2nd ed.). Thousand Oaks, CA: Sage. Bollen, K. A. ( 1 989). Structural equations with latent variables. New York: Wiley. Bond, F . W. (1 998). Util ising case formulations in manual-based treatments. In M. Bruch & F. W. Bond, Beyond diagnosis: Case formulation approaches in CBT (pp. 1 85-206). Chichester: Wiley & Sons Ltd. Borman, W. C. ( 1 977). Consistency of rating accuracy and rater errors in the in the judgement of human performance. Organizational Behavior and Human Performance, 20, 238-252. 236 Bosscher, R. J. & Smit, J. H. ( 1 998). Cqnfirmatory factor analysis of the general self? efficacy scale. Behaviour Research & Therapy, 36(3), 339-343. Brannick, M. T., Michaels, C. E. & Baker, D. P. ( 1 989). Construct validity of in? basket scores. Journal of Applied Psychology, 74, 957-963. Bray, D. W. & Grant, D. L. (1 966). The assessment center in the measurement of potential for business development. Psychological Monographs, 80, 1 -27. Brennan, R. L. (2000). (Mis)conceptions about generalizability theory. Educational Measurement: Issues and Practice, 16(4), 1 4-20. Brennan, R. L. (2001 a) . Generalizability theory. New York: Springer Verlag. Brennan, R. L. (200 1b). Manual for urGENOVA. Iowa City, IA: Iowa Testing Programs, University of Iowa. Briggs, S. R., Cheek, J. M. & Buss, A. H. (1 980). An analysis ofthe self- monitoring scale. Journal of Personality and Social Psychology, 38, 679-686. Browne, M. W. & Cudeck, (1 993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 445-455). Newbury Park, CA: Sage. Buckner, M. ( 1 984). An evaluation of the effectiveness and reliability of a videotaped assessment center as compared to a live assessment center. Unpublished doctoral dissertation, Georgia State University. Bycio, P . , Alvares, K. M. & Hahn, J. (1 987). Situation specificity in assessment center ratings: A confirmatory factor analysis. Journal of Applied Psychology, 72, 463-474. Byham, W. C. ( 1 970). Assessment centers for spotting future managers. Harvard Business Review, 48, 1 50- 1 60 . Byham, W. C . ( 1 980, February). Starting an assessment center the right way. Personnel Administrator, 27-32. Byrne, B . M. (200 1 ). Structural equation modeling with AMOS: Basic concepts, applications, and programming. Mahwah, NJ: Lawrence Erlbaum Associates. 237 Campbell, D. T. & Fiske, D. W. ( 1 959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 8 1 - 1 05 . Campion, J . E. (1 972). Work sampling for personnel selection. Journal of Applied Psychology, 56(1 ), 40-44. Carrick, P. & Williams, R. ( 1 999). Development centres: A review of assumptions. Human Resource Management Journal, 9(2), 77-92. Cascio, W. F. & Phillips, N. F. ( 1 979). Performance testing: A rose among the thorns? Personnel Psychology, 32, 75 1 -766. Chan, D. ( 1 996). Criterion and construct validation of an assessment centre. Journal of Occupational and Organizational Psychology, 69, 1 67- 1 8 1 . Chatterjee, S . , Hadi, A. S . & Price, B . (2000). Regression analysis by example (3rd ,. ed.). New York: John Wiley & Sons. Chow, S. L. ( 1 996) Statistical significance: Rationale, validity and utility. CA: Sage Publications. Cohen, B . , Moses, J. L. & Byham, W. C. ( 1 974). The validity of assessment centers: A literature review. Pittsburgh, P A: Development Dimensions Press. Cohen, J. ( 1 988) . Statistical power analysis for the behavioural sciences. London: Lawrence Erlbaum Associates. Colonia-Willner, R. ( 1 998). Practical intelligence at work: Relationship between aging and cognitive efficiency among managers in a bank environment. Psychology and Aging, 13, 45-57. Comrey, A . L. & Lee, H. B. ( 1 992). A first course in factor analysis (2nd ed.). New Jersey, Lawrence Erlbaum. Cook, M. ( 1 998). Personnel selection: Adding value through people. Chichester: John Wiley & Sons. Coolican, H. ( 1 999). Research methods and statistics in psychology (2nd ed.) . London: Hodder & Stoughton. Cramer, D . (1 994). Introducing statistics for social research: Step-by-step calculations and computer techniques using SPSS. New York: Routledge. Cronbach, L. J. ( 1 955). Processes affecting scores on "understanding of others" and "assumed similarity." Psychological Bulletin, 52, 1 77- 1 93 . Cronbach, L . J. , Gleser, G. C ., Nanda, H . & Rajaratnam, N. ( 1 972). The Dependability of Behavioral Measurements: Theory ofGeneralizability for Scores and Profiles. New York: John Wiley. 238 Cronbach, L. J. & Meehl, P. E. (1 955). Construct validity in psychological tests. Psychological Bulletin, 52, 28 1 -302. Day, D. V., Schleicher, D. J. & Unckless, A. L. ( 1 996). Self-monitoring and work? related outcomes: A met a-analysis. Paper presented at the 1 1 th Annual Conference of the Society for Industrial and Organizational Psychology, San Diego, CA. Delprato, D. J. & Midgley, B. D. ( 1 992), Some fundamentals of B. F. Skinner' s behaviourism. American Psychologist, 47, 1 507- 1 520. Donahue, L. M., Truxillo, D. M., Comwell, J. M. & Gerrity, M. J. ( 1 997). Assessment center construct validity and behavioral checklists : Some additional fmdings. Journal ofSocial Behaviour and Personality, 12, 85- 1 08. Faul, F . & Erdfelder, E. ( 1 992). GPOWER: A priori, post-hoc, and compromise power analyses for MS-DOS [Computer program]. Bonn, FRG: Bonn University, Department of Psychology. Felson, R. B . ( 1 98 1 ). An interactionist approach to aggression. In J. T. Tedeschi (Ed.), Impression management theory and social psychological research (pp 1 8 1 - 1 99) . New York: Academic Press. Feltham, R. T. ( 1 989). Assessment centres. In P . Herriot (Ed.), Handbook of assessment in organizations (pp. 40 1 -4 1 9). London: John Wiley & Sons. Fleenor, J. W. (1 996). Constructs and developmental assessment centres: Further troubling empirical findings. Journal of Business and Psychology, 3, 3 1 9-335 . Fletcher, C. & Anderson, N. (1 998). A superficial assessment. People Management, May, 44-46. Fletcher, C . & Kerslake, C . ( 1 992). The impact of assessment centers and their outcomes on participants ' self assessments. Human Relations, 45, 28 1 -289. Furnham, A., Crump, J. & Whelan, J. ( 1 997). Validating the NEO Personality Inventory using assessor' s ratings. Personality and Individual Differences, 22, 669-675 . Gabrenya, W. K. , Jr. & Arkin, R. M. ( 1 980). Self-monitoring scale: Factor structure and correlates. Personality and Social Psychology Bulletin, 6, 1 3-22. Gaugler, B . , Rosenthal, D., Thomton, G. & Bentson, C. (1 987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72, 493-5 1 1 . 239 Gaugler, B. & Thomton, G. C. Ill ( 1 989). Number of assessment dimensions as a determinant of assessor accuracy. Journal of Applied Psychology, 74, 61 1 - 6 1 8 . Goffin, R . D., Rothstein, M. G. & Johnston, N . G. ( 1 996). Personality testing and 1 assessment center: Incremental validity for managerial selection. Journal Q Applied Psychology, 81, 746-756. Gold, M. S . & Bentler, P . M.' (2000). Treatments of missing data: A Monte Carlo comparison ofRBHDI, iterative stochastic regression imputation, and expectation maximization. Structural Equation Modeling, 7(3), 3 1 9-355. Gorham, W. A. ( 1 978). Federal executive agency guidelines and their impact on th assessment center method. Journal of Assesment Center Technology, 1 (1 ) , ? 8 . Green, S . B . & Stutzman, T . ( 1 986). An evaluation of methods to select respondents to structured job-analysis questionnaires. Personnel Psycholog) 39, 543-564. Guilford, J. P . ( 1 959). Personaiity. New York: McGraw-Hill. Halman, F . & Fletcher, C. (2000). The impact of development centre participation and the role of individual differences .in changing self-assessments. Journal 1 Occupational and Organizational Psychology, 73, 423-442. Handyside, J . & Duncan, C . ( 1 954) . Four years later on: A follow up of an experiment in selecting supervisors . Occupational Psychology, 28, 9-23 . Harris , M. M., Becker, A. S . & Smith, D. E. ( 1 993). Does the assessment ceriter scoring method affect the cross-situational consistency of ratings? Journal oj Applied Psychology, 78(4), 675-678 . Hartmann, D. P. , Roper, B . L. & Bradford, D. C . ( 1 979) . Some relationships between behavioral and traditional assessment. Journal of Behavioral Assessment, 1( 1 ), 3-2 1 . Harvey, R. J. ( 1 99 1 ). Job analysis. In M . D . Dunnette & L . M. Hough (Eds.), Handbook of industrial and organizational psychology (2"d ed.) (pp. 7 1 - 1 63). Palo Alto, CA: Consulting Psychologists Press. Herriot, P. ( 1 986). Assessment centres revisited. Guidance and Assessment Review, 2(3), 7-8. Highhouse, S . & Harris, M. M. ( 1 993). The measurement of assessment center situations : Bern's template matching technique for examining exercise similarity. Journal of Applied Social Psychology, 23(2), 1 40- 1 55 . 240 Hough, L. M. & Oswald, F. L. (2000). Personnel selection: Looking toward the future - remembering the past. Annual Review of Psychology, 51, 63 1 -664. Howard, A. ( 1 997). A reassessment of assessment centers, challenges for the 2 1 st century. Journal of Social Behavior and Personality, 12, 1 3-52. Hunter, J. E. & Hunter, R. F . ( 1 984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96( 1 ), 72-98 . International Task Force on Assessment Center Guidelines. (2000). Guidelines and ethical considerations for assessment center operations. Public Personnel Management, 29, 3 1 5-33 1 :- Joiner, D. (2002). Assessment centers : What's new? Public Personnel Management, 31 (2), 1 79- 1 85 . Jones, A . , Herriot, P . , Long, B. , & Drakeley, R . ( 1 99 1 ). Attempting to improve the validity of a well-established assessment centre. Journal of Occupational Psychology, 64, 1 -2 1 . Joyce, L . W., Thayer, P. W., & Pond, S . B. ( 1 994). Managerial functions: An alternative to traditional assessment center dimensions. Personnel Psychology, 47, 1 09- 1 2 1 . Kane, M . T. ( 1 982). A sampling model for validity. Applied Psychological Measurement, 6, 125- 1 60. Kaplan, R. M. & Saccuzzo D. P . (2001 ). Psychological Testing: Principles, applications, and issues (5th ed.). Belmont, CA: Wadsworth. Kenrick, D. T. & Funder, D. C . ( 1 99 1 ). The person-situation debate: Do personality traits really exist? In N. J. Derlega, B. A. Winstead, & W. H. Jones (Eds.), Personality: Contemporary theory and research (pp. 1 50- 1 74). Chicago: Nelson-Hall . Kleinmann, M. ( 1 993). Are rating dimensions in assessment centers transparent for participants? Consequences for criterion and construct validity. Journal of Applied Psychology, 78, 988-993 . Klem, L. (2000). Structural equation modeling. In L. G. Grimm, G. Laurence & P. R. Yarnold (Eds.), Reading and understanding more multivariate statistics (pp. 227-260). Washington, DC: American Psychological Association. 24 1 Klimoski, R. J. & Brickner, M. ( 1 987). Why do assessment centers work? The puzzle of assessment center validity. Personnel Psychology, 40, 243-260. Klimoski, R. J. & Strickland, W. J. ( 1 977). Assessment centers - valid or merely prescient. Personnel Psychology, 30, 353-361 . Kolb, J. A. ( 1 998). The relationship between self-monitoring and leadership in student proj ect groups. Journal of Business Communication, 35(2), 264-282. Kraiger, K. & Teachout, M. S. '( 1 990). Generalizability theory as construct-related evidence for the validity of job performance ratings. Human Performance, 3, 1 9-35. Kudisch, J. D., Ladd, R. T. & Dobbins, G. H. ( 1 997). New evidence on the construct validity of diagnostic assessment centers: The findings may not be so troubling after all. In R. E. Riggio & B . T. Mayes (Eds.), Assessment centers: Research and applications [Special Issue] . Journal of Social Behavior and Personality, 12, 1 29- 1 44. Lance, C. E., Newbolt, W. H., Gatewood, R. D. , Foster, M. R. , French, N. R. & Smith, D. (2000). Assessment center exercise factors represent cross? situatiortal specificity, not method bias. Human Performance, 13(4), 323-353. Lane, J. (1 992). Methods of assessment. Health Manpower Management, 18(2), 4-6. Lebreton, J. M., Binning, J. F. & Hesson-Mcinnis, M. S. ( 1 998). The effects of measurement structure on the validity of assessment center dimensions: The clinical-statistical debate revisited. Paper presented at the Annual Meeting of the Academy of Management, Sa Diego, CA. Lennox, R. D. & Wolfe, R. N. ( 1 984). Revision of the self-monitoring scale. Journal of Personality and Social Psychology, 46, 1 349- 1 364. Licht, M. H. {1 995). Multiple regression and correlation. In L. G . Grimm, G. Laurence & P . R. Yamold (Eds.), Reading and understanding multivariate statistics (pp. 1 9-64). Washington, D C : American Psychological Association. Lievens, F. (1 998). Factors which improve the construct validity of assessment centers : A review. International Journal of Selection and Assessment, 6, 1 4 1 - 1 52. Lievens, F. (200 l a) . Assessor training strategies and their effects on accuracy, interrater reliability, and discriminant validity. Journal of Applied Psychology, 86(2), 255-264. Lievens, F. (200 1 b). Assessors and use of assessment center dimensions: A fresh look at a troubling issue. Journal of Organizational Behavior, 22, 203-22 1 . 242 Lievens, F. (2002). Trying to understand the different pieces of the construct validity puzzle of assessment centers : An examination of assessor and assessee effects. Journal of Applied Psychology, 87(4), 675-686. Lievens, F. & Conway, J. M. (2001 ) . Dimension and exercise variance in assessment center scores: A large-scale evaluation ofmultitrait-multimethod studies. Journal of Applied Psychology, 86(6), 1 202- 1 222. Lievens, F. & Klimoski, R. J. (200 1 ). Understanding the assessment center process: Where are we now? In C.L. Cooper & I.T. Robertson (Eds.), International Review of Industrial and Organizational Psychology, Vol. 16, (pp. 245-286). Chichester: John Wiley and Sons. Lievens, F. & V an Keer, E. (200 1 ) . The construct validity of a Belgian assessment centre: A comparison of different models. Journal of Occupational and Organizational Psychology, 74, 373-378. Lopez, F. M. ( 1 988). Threshold traits analysis system. In S . Gael (Ed.), The job analysis handbook/or business, industry, and government (Vol .2, pp. 880- 90 1 ). New York: Wiley. Lopez, F. M., Kesselman, G. A. & Lopez, F. E. ( 1 98 1 ) . An empirical test of a trait? oriented job analysis technique. Personnel Psychology, 34, 479-502. Lowry, P. E. ( 1 988). The assessment center: Pooling scores or arithmetic decision rule? Public Personnel Management, 1 7( 1 ), 63-7 1 . Lowry, P . E . ( 1 995). The assessment center process: Assessing leadership i n the public sector. Public Personnel Management, 24(4), 443-450. Lowry, P. E. ( 1 996). A survey of the assessment center process in the public sector. Public Personnel Management, 25(3), 307-32 1 . Lowry, P. E. ( 1 997). The assessment center process: New directions. In RE. Riggio & B.T. Mayes (Eds.), Assessment centers: Research and applications [Special issue]. Journal of Social Behavior and Personality, 12(5), 53-62. MacCallum, R. C. , Browne, M. W. & Sugawara, H. M. ( 1 996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1, 1 30- 1 49. Marcoulides, G. A. ( 1 989). The application of generalizability analysis to observational studies. Quality and Quantity, 23, 1 1 5- 1 27. 243 Mardia, K. V. (1 970). Measures ofmultivariate skewness and kurtosis with applications. Biometrika, 57, 5 1 9-530. Matthews, G. & Deary, I . J. ( 1998). Personality traits. Cambridge: Cambridge University Press. McCarthy, .A. (2003, January 20). Applicants grab chance to shine. The New Zeala. Herald, p. E 1 . Miller, G . A . ( 1 956). The magical number seven, plus or minus two: Some limits c our capacity for processing information. Psychological Review, 63, 8 1 -97. Mischel, W. ( 1 968). Personality and assessment. New York: Wiley. Mischel, W. ( 1 973). Toward a cognitive social learning conceptualization of personality. Psychological Review, 36, 1 63-1 83. Mischel, W. ( 1 984). >Convergences and challenges in the search for consistency. American Psychologist, 39(4), 35 1 -364. Moser, K., Diemand, A. & Schuler, H. ( 1 996). Inconsistency and social skills as tw components of self-monitoring. Diagnostica, 42, 268-283. Muchinsky, P. M. (2000). Psychology applied to work (6th ed.). Belmont, CA: Wadsworth. Mueller, C. M. & Dweck, C. S . ( 1 998). Praise for intelligence can undermine children's motivation and performance. Journal of Personality and Social Psychology, 75(1 ), 33-52. Murphy, K. R. & Cleveland, J. N. ( 1 995). Understanding performance appraisal: Social, organizational, and goal-based perspectives. Thousand Oaks, CA: Sage Publications. Murphy, K. R. & Davidshofer, (200 1 ). Psychological testing: Principles and applications. (4th ed.). NJ: Prentice Hall . Murphy, K. R. & Myors, B . ( 1 998). Statistical power analysis: A simple and genera model for traditional and modern hypothesis tests. Mahwah, NJ: Lawrence Erlbaum Associates. Neidig, R. D. & Neidig, P. J. ( 1984). Multiple assessment center exercises and job relatedness . Journal of Applied Psychology, 69, 1 82- 1 86. Norton, S . D . ( 1 977). The empirical and content validity of assessment centers vs. traditional methods for predicting managerial success. Academy of Management Review, 2, 442-445 . 244 Norton, S . D . ( 1 98 1 ). The assessment center process and content validity: A reply to Dreher and Sackett. Academy of Management Review, 6, 56 1 -566. Nunnally, J. C . & Bernstein, I. H. ( 1 994). Psychometric Theory (3rd ed.). New York: McGraw Hill . O 'Cass, A. (2000). A psychometric evaluation of a revised version of the Lennox and Wolfe revised self-monitoring scale. Psychology & Marketing, 1 7(5), 397-4 1 9. Paton, D. & Jackson, D. J. R. (2002). Developing disaster management capability: An assessment centre approach. Disaster Prevention and Management, 1 1 (2), 1 1 5- 1 22 . Paulhus, D . L. (2002). Socially desirable responding: The evolution of a construct. In Braun, H. I. & Jackson, n:?N. (Eds.), The role of constructs in psychological and educational measurement (pp. 49-69). Mahwah, NJ, Lawrence Erlbaum Associates Persons, J. B . & Bertagnolli, A. (1 999). Inter-rater reliability of cognitive? behavioral case formulations of depression: A replication. Cognitive Therapy and Research, 23(3), 27 1 -283. Persons, J. B., Mooney, K. A. & Padesky, C. A. ( 1 995). Interrater reliability of cognitive-behavioral case formulations. Cognitive Therapy and Research, 19( 1 ), 2 1 -34. Peterson, N. G. & Jeanneret, P . R. (1 997). Job analysis: Overview and description of deductive methods. In D. L. Whetzel & G. R. Wheaten (Eds.), Applied measurement methods in industrial psychology (pp. 1 3-50). Palo Alto, CA: Davies-Black Publishing? Pynes, J. & Bemardin, H. J. ( 1 992). Mechanical vs. consensus-derived assessment center ratings: A comparison of job performance validities. Public Personnel Management, 21, 1 7-28 . Pynes, J . , Bemardin, H. J . , Benton, A . L. & McEvoy, G. M. ( 1 988). Should assessment center dimension ratings be mechanically-derived? Journal of Business and Psychology, 2(3), 2 17-227. Raykov, T. & Marcoulides, G. A. (2000). A first course in structural equation mode ling. Mahwah, NJ: Lawrence Erlbaum Associates. 245 Reilly, R. R. , Henry, S . & Smither, J . W. ( 1 990). An examination of the effects of using behavior checklists on the construct validity of assessment center dimensions. Personnel Psychology, 43, 7 1 -84. Robertson, I . T., Gratton, L. & Sharpley, D . ( 1 987). The psychometric properties and design of managerial assessment centres : Dimensions into exercises won'1 go. Journal of Occupational Psychology, 60, 1 87- 1 95 . Robertson, I . T. & Kandola, R. S . ( 1 982). Work sample tests: Validity, adverse impact and applicant reaction. Journal of Occupational Psychology, 55, 1 7 1 - 1 83 . Robie, C. , Adams, K . A . Osbum, H . G. , Morris, M . A . & Etchegaray, J . M. (2000). Effects of the rating process on the construct validity of assessment center dimension evaluations. Human Performance, 13(4), 355-370. Rosenthal, R. & Rosnow, R. ( 1 99 1 ) . Essentials ofbehavioral research: Methods and data analysis (2nd ed.). New York: McGraw Hill. Russell, C. J. ( 1 987). Person characteristic versus role congruency explanations for assessment center ratings. Academy of Management Journal, 30, 8 1 7-826. Russell, C. J., & Domm, D. R. ( 1 995). Two field tests of an explanation of assessment centre validity. Journal of Occupational and Organizational Psychology, 68, 25-47. Ryan, A., Daum, D . , Bauman, T. , Grisez, M. , Mattimore, K., Nalodka, T. & McCormick, S . ( 1 995). Direct, indirect and controlled observation and rating accuracy. Journal of Applied Psychology, 80(6), 664-670. Sackett, P. R. ( 1 987). Assessment centers and content validity: Some neglected issues. Personnel Psychology, 40, 1 3 -25. Sackett, P . R. & Dreher, G. F. ( 1 982). Constructs and assessment center dimensions : Some troubling empirical findings. Journal of Applied Psychology, 67, 401 -4 1 0. Sackett, P . R. & Dreher, G. F. ( 1 984). S ituation specificity ofbehavior and assessment center validation strategies : A rejoinder to Neidig and Neidig. Journal of Applied Psychology, 69, 1 87- 1 90. Sackett, P . L. & Hakel, M. D . ? ( 1 979). Temporal stability and individual differences in using assessment center information from overall ratings. Organizational Behavior and Human Performance, 23, 1 20- 1 37. Sackett, P. R. & Harris, M. M. (1 988). A further examination ofthe constructs underlying assessment center ratings . Journal of Business and Psychology, 3(2), 2 14-229. Sadri, G. & Robertson, I. T. ( 1993). Self efficacy and work-related behaviour: A review and meta-analysis. Applied Psychology: An International Review, 42(2), 1 39- 1 52. Sagie, A. & Magnezy, R. ( 1 997). Assessor type, number of distinguishable dimensions categories, and assessment centre construct validity. Journal of Occupational and Organizational Psychology, 70, 1 03- 1 08 . Sanchez, J . I . & Frazer, S . L. ( 1 992). On the choice of scales for task-analysis. Journal of Applied Psychology, 77, 545-553. Sanchez, J. I . & Levine, E. L. ( 1 989). Determining important tasks within jobs : A policy capturing approach. Journal of Applied Psychology, 74, 336-342. Sartre, J. (1 964). Huis clos. London: Methuen. Saxe, R. & Weitz, B. A. (1 982). The SOCO scale : A measure of the customer orientation of salespeople. Journal of Marketing Research, 19, 343-35 1 . Schleicher, D . J. ( 1 999). A new 'frame' for frame of reference training: Enhancing the construct validity of assessment centers. Dissertation Abstracts International: Section B: The Sciences & Engineering, 60, ( 1 -B), 0384. 246 Schleicher, D. J. & Day, D. V. (1 998). A cognitive evaluation of frame-of? reference rater training: Content and process issues. Organizational Behavior and Human Decision-making Processes, 73(1 ), 76- 1 0 1 . Schleicher, D . J . , Day, D. V., Mayes, B. T. & Riggio, R. E. (2002). A new frame for frame-of reference training: Enhancing the construct validity of assessment centers. Journal of Applied Psychology, 87(4), 735-746. Schmidt, F. L. ( 1 996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 1 1 5 - 1 29. Schmidt, F . L . & Hunter, J. E. ( 1 996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1 (2), 1 99-223 . Schmidt, F . L. & Hunter, J. E. ( 1 998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262-274. 247 Schmidt, F. L., Ones, D. S. & Hunter, J. E. ( 1 992). Personnel selection. Annual Review of Psychology, 43, 627-670. Schmitt, N. , Ford, J. K. & Stults, D. M. ( 1 986). Changes in self-perceived abili? as a function of performance in an assessment centre. Journal of Occupational Psychology, 59, 327-335. Schmitt, N. , Goading, R. , Noe, R. & Kirsch, M. ( 1 984). Meta-analysis ofvalidi? studies published between 1 964 and 1 982 and the investigation of study characteristics. Personnel Psychology, 37, 407-422. Schmitt, N. , & Ostroff, C. ( 1 986). Operationalising the "behavioral consistency" approach: Selection test development based on a content-oriented strateg; Personnel Psychology, 39, 9 1 - 1 08. ,. Schmitt, N., Schneider, J. & Cohen, S . ( 1 990). Factors affecting validity of a regionally administered assessment center. Personnel Psychology, 43, 1 - l Schneider, J . & Schmitt, N. (1 992). An exercise design approach to understandin; assessment center dimension and exercise contructs. Journal of Applied Psychology, 77, 32-4 1 . Scholz, G . & Schuler, H. ( 1 993). Das nomologische netzwerk des assessment centers : eine metaanalyse. Zeitschriftfur Arbeits- und Organisationspsychologie, 37(2), 73-85. Seegers, J . ( 1 997). What is an assessment centre? In P. Jansen & F. de Jongh (Eds.), Assessment centres. A practical handbook. Chichester: John Wiley and Sons. Shavelson, R. J. & Webb, N.M. ( 1 99 1 ). Generalizability theory: A primer. Newbt Park, CA: Sage Publications. Sherer, M. , Maddux, J. E. Mercandante, B . , Prentice-Dunn, S ., Jacobs, B. & Roge R. W. ( 1 982). The self-efficacy scale: Construction and validation. Psychological Reports, 51, 663-671 . Shore, T. H., Thomton, G. C. Ill & Shore, L . M. ( 1 990). Construct validity oftwo categories of assessment center ratings. Personnel Psychology, 43, 1 0 1 - 1 1 Shrout, P . E. & Fleiss, J. J. ( 1 979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-428 . Silverman, W. H. , Dalessio, A. , Woods, S . B. & Johnson, R. L. ( 1 986). Influence of assessment center methods on assessors' ratings. Personnel Psychology, 39, 565-578. Skinner, B. F. ( 1 974). About behaviorism. New York: Knopf. Snyder, M. ( 1 974). Self-monitoring of expressive behavior. Journal of Personality and Social Psychology, 30, 526-537. Snyder, M. ( 1 987). Public appearances I private realities: The psychology of self? monitoring. New York: W. H. Freeman and Company. Spector, P. E. (2000). Industrial and organizational psychology: Research and practice (2nd ed.). New York: Wiley. Spencer, L. M. & Spencer, S . M. (1 993). Competence at work: Models for superior performance. New York: Wiley. Spychalski, A. C., Quinones, M. A., Gaugler, B. B. & Pohley, J. ( 1 997). A survey 248 of assessment center practices in organizations in the United States. Personnel Psychology, 50, 7 1 -90. Sternberg, R. J. , Forsythe, G. B., Hedlund, J., Horvath, J. A., Wagner, R. K., Williams, W. M., Snook, S. A., & Grigorenko, E. L. (2000). Practical intelligence in everyday life. Cambridge: Cambridge University Press . Sulsky, L . M. & Balzer, W. K. ( 1 988). The meaning and measurement of performance rating accuracy: Some methodological concerns. Journal of Applied Psychology, 73, 501 -5 1 0. Sulsky, L . M. & Day, D. V. ( 1 992). Frame-of-reference training and cognitive categorization: An empirical investigation of rater memory issues. Journal of Applied Psychology. 77(4), 50 1 -5 1 0. Tabachnick, B . G. & Fidell, L. S . ( 1 983). Using multivariate statistics. New York: Harper & Row. Task Force on Assessment Center Standards. ( 1 989). Guidelines and ethical considerations for assessment center operations. Public Personnel Management, 18, 457-470. Taylor, P . , Keelty, Y. & McDonnell, B. (2002). Evolving personnel selection practices in New Zealand organisations and recruitment firms. New Zealand Journal of Psychology, 31(1), 8- 1 8 . Tedeschi, J . T. & Riess, M. ( 1 98 1 ). Self-presentational styles. In J. T. Tedeschi (Ed.), Impression management theory and social psychological research (pp 3 -20). New York: Academic Press. Tenopyr, M. ( 1 977). Content-construct confusion. Personnel Psychology, 30, 47- 54. 249 Tett, R. P. & Gutennan, H. A. (2000). Situation trait relevance, trait expression, and cross-situational consistency: Testing a principle of trait activation. Journal of Research in Personality, 34, 397-423 . Thornton, G. C. HI (1 992). Assessment centers in human resource management. New York: Addison-Wesley. Thornton, G. C. Ill & Byham, W. C. ( 1 982). Assessment centers and managerial performance. San Diego, CA: Academic Press. Thornton, G. C. Ill, Kaman, V., Layer, S., & Larsh, S. ( 1 995, May). Effectiveness of two forms of assessment center feedback: A ttribute feedback and task feedback. Paper presented at the 23rd International Congress on the Assessment Center Method, Kansas City, Kansas. Thomton, G. C. Ill, Tziner, A., Dahan, M., Clevenger, J. P. & Meir, E. ( 1 997). Construct validity of assessment center judgments. Journal of Social Behavior and Personality, 12, 109- 1 28 . Ting, N. , Burdick, R. K., Graybill, F. A. Jeyaratnam, S . & Lu, T . C. ( 1 990) . Confidence intervals on linear combinations of variance components that are unrestricted in sign. Journal of Statistical Computational Simulation, 35, 1 35- 1 43. Tovey, R. C. (2001 ). Anxiety and assessment centre performance. Unpublished doctoral dissertation, Goldsmiths College, University of London, New Cross, London. Tumage, J. J. & Muchinsky, P. M. (1 982). Trans-situational variability in human performance with assessment centers. Organizational Behavior and Human Performance, 30, 1 74-200. Tumage, J. J. & Muchinsky, P. M. ( 1 984). A comparison of the predictive validity of assessment center evaluations versus traditional measures in forecasting s upervisory job performance : Interpretive implications of criterion distortion for the assessment paradigm. Journal of Applied Psychology, 69, 595-602. Wagner, R. K. (1 985) . Tacit knowledge inventory for managers: Test booklet. Unpublished manuscript, Department of Psychology, Florida State University, Tallahassee, Florida 32306- 1 270. Wagner, R. K. & Sternberg, R. J. (1 985). Practical intelligence in real-world pursuits : The role of tacit knowledge. Journal of Personality and Social Psychology, 49, 436-458. 250 Wagner, R. K. & Sternberg, R. J. ( 1 986). Tacit knowledge and intelligence in the everyday world. In R. K. Wagner & R. J. Sternberg (Eds.), Practical intelligence: Nature and origins of competence in the everyday world (pp. 5 1 - 83). Cambridge: Cambridge University Press . Wagner, R. K. & Sternberg, R. J . ( 1 990). Street smarts. In K. E. Clark & M. B. Clark (Eds.), Measures of Leadership (pp. 493-504). West Orange, NJ: Leadership Library of America. Wagner, R. K. & Sternberg, R. J. ( 1991 ) . Tacit knowledge inventory for managers. San Antonio: Harcourt Brace & Company. Warech, M. A., Smither, J. W., Reilly, R. R., Millsap, R. E. & Reilly, S . P. ( 1 998). S elf -monitoring and 360-degree ratings. Leadership Quarterly, 9(4), 449-473 . Whitmore, M . D. & K.limoski, R. J. ( 1 984). Leader emergence and self-monitoring behavior under conditions of high and low motivation. Paper presented at the annual meetings of the Midwestern Psychological Association, Chicago. Williams, K. M. & Crafts, J. L. ( 1997). Inductive job analysis: The job/task inventory method. In D . L. Whetzel & G. R. Wheaton (Eds.), Applied measurement methods in industrial psychology (pp 5 1 -88). Palo Alto, CA: Davies-Black Publishing. Woodruff, S . & Cashman, J. (1 993). Task, domain, and general efficacy: A reexamination of the self-efficacy scale. Psychological Reports, 72, 423-432. Woodruffe, C. ( 1 993). Assessment centres: Identifying and developing competence (2nd ed.). London: Institute of Personnel Development. 25 1 Appendix I Appendix 1: Pilot For Study Three Method Participants Data were collected from a development centre (DC) that was constructed for a large cal l centre in a government-based organisation in Auckland, New Zealand. The centre was used three times over a two-year period between 2001 and 2002 for the training and development of call centre workers. Fifteen organisational members participated, consisting of 1 1 females and four males with ages ranging from between 26 and 30 . Nationality was not recorded, as it was felt the small numbers used in the present study could lead to the identification of individuals. Al l respondents reported that they held bursary (high school leaving) qualifications. Assessors The assessors were 5 managerial staff members per DC from a government-based organisation ( 1 male and 4 females), with a mean age of 3 1 .60 (SD 5.24) located in Auckland, New Zealand. The assessors remained the same throughout the duration of the 3 runs of the DC, except for the last two runs in which 2 assessors had to be replaced. Al l assessors had previous experience in assessing participants in multiple ACs for selection, although none had previously received either FOR or psychological training. Al l participants had considerable experience (over 2 years), and were regarded as subject matter experts of the position being assessed. Appendix I 252 The DC After developing a policy statement outlining the purpose of this particular DC, and who would be involved in the process, the initial step in the construction of the DC was to execute a competency analysis of the target position. As the purpose of the current study was to compare a task-specific model with a dimension-specific model, the competency analysis involved a two-tiered process of producing a detailed task-analysis (gathering information on the tasks that organisational members perform), and then a classification and, moreover, aq extrapolation of these tasks into dimensions. Inductive job analyses use various methods to find new and specific information about a given job (Peterson, & Jeanneret, 1 997). This approach was taken in the present study for a number of reasons. Firstly, the intention ofthe study was to focus on the collection of new, detailed infoimation about a particular job, in order to construct a unique and highly detailed account of the competencies involved in the job. Peterson and Jeanneret ( 1 997) suggest that, in such situations, inductive methods are more appropriate, rather than the deductive methods which yield more general information. Additionally, as there were small numbers of subject matter experts (SMEs) in this sample, using job analysis questionnaires may have been problematic in terms of displaying high levels of error and inflated standard deviations re.flected in job analysis questionnaires, which may have otherwise been abated with larger numbers. The last reason was that the particular organisation involved in this study had it's own agenda as to its required developmental specifications. Thus, as the will and developmental needs ofthe organisation was of great consequence in the construction of the DC, it was felt that such information should 253 Appendix ! be driven to some degree by the subject matter experts who had knowledge of the areas that required performance development. Task Analysis The first stage of the competency analysis involved utilising job analysis methods to identi fy the key tasks that made up the call centre position. This involved a review of the current job descriptions already in existence, interviews and critical incident interviews with incumbeilts and supervisors; the SMEs. The SMEs group (which comprised the same sample as the assessors) was interviewed . To give the analysis a strategic outlook, interviewees were also asked to give their views on what tasks they thought might be important for the call centre position in the future, as suggested by Thornton ( 1 992) and Woodruffe ( 1 993). In accordance with the guidelines set out by Lowry ( 1 997), a questionnaire was developed listing the tasks derived from the information obtained in the task analysis to determine the relative rank and importance of particular tasks. Three questions were asked of SMEs with respect to each task, ? including: (a) the criticality of this task relative to others for successful operations, (b) time spent on this task relative to other tasks, and (c) the difficulty of this task relative to others. The last item differed from Lowry's suggested third item (relative importance of being able to perform this task correctly on entry into the job). This was because the intention of the present DC was for development, and it was reasoned, in agreement with the subject matter experts, that job entry requirements were not involved in developmental aims. Incumbents were already famil iar with the task, and the importance Appendix I 254 of tasks for entry level may be important for recruitment and selection, but this may not be relevant to development. From this information, the most critical tasks were selected for inclusion in the DC exercises. In concurrence with the suggestions of SMEs, a checklist of the typical actions that would be required to successfully perform each of the tasks was developed. These typically involved short checkl ists of around 8- 1 5 actions considered important for the successful completion of a given DC exercise. As Lowry emphasised, no inference of the existence of complex constructs was made at this stage, and it was at this point that the competency analysis for the task-specific DC model concluded. Classification and Extrapolation ofTasks into Dimensions In a traditional dimension-specific DC, the pure or raw tasks obtained from the task analysis are then classified into dimension categories. This involves a process of subjectively identifying and then coding the task with a dimension that is thought to underpin the performance of that task (Bal lantyne & Povah, 1 995). A dimension was assigned for all tasks with guidance from the generic dimensions suggested by Thomton and Byham ( 1 982), and in concurrence with SMEs in the present study. Because some evidence suggests that using small numbers of performance dimensions may act to increase the construct val idity ofDCs (Lievens, 1 998), the current DC followed the growing body of l iterature suggesting that DC architects should limit the number of dimensions assessed (Gaugler & Thomton, 1 989; Lievens & Klimoski, 200 I ; Sackett & Hackel , 1 979). Upon reviewing the l iterature, Arthur et al . (2000) decided on a manageable set of 9 performance dimensions, in line with human information processing 255 Appendix I capacity. Note that Arthur et al. cited Miller ( 1 956) on the issue of cognitive capacity. Arthur et al. looked at the number of dimensions assessed over 1 9 DC studies and found that on average, 1 1 .01 (SD = 5.24) dimensions were assessed across the processes. The present DC, in the light oft?ese findings, limited the number of performance dimensions to seven. These top seven dimensions were identified by the SME panel as being the most important for the purposes of the current development process. This number was . within the optimal l imits suggested by Gaugler and Thomton ( 1 989) ofbetween 5 to 7 dimensions. DC Task Ratings and Dimensions Task checklists were also developed with specific behavioural indicators of successful performance on an exercise were provided for assessors to mark. Assessors marked each specific task on a scale ranging from 1 (Performance was very much below standard) to 5 (Performance was very much above standard) . Each task statement had a dimension name written next to it, to give the assessors guidance on which specific behaviours might relate to which dimension or competency trait. Participants were rated on the following 7 dimensions: Process Uti lisation; Conflict Resolution; Communication; Technical and Professional Knowledge; Customer Service Orientation; Stress Tolerance and Innovation. It was the intention to assess al l dimensions across all exercises, except for Customer Service Orientation and Conflict Resolution which were not formally assessed in the Group Analysis Exercise. Spaces for marks for these dimensions were left on the forms for the raters . At the end of DC, it was evident that as a group, the raters felt that the Group Analysis Exercise afforded Appendix I 256 opportunities for participants to manifest behavioural examples of Customer Service Orientation and Conflict Resolution. It was found that 80% of the raters had included ratings for Customer Service Orientation and Conflict Resolution for the Group Analysis Exercise. These data were included in the analysis, and thus produced a fully crossed dimension by exercise design for analysis. Dimensions were assessed on a scale ranging from I (unacceptable level of abil ity) to 5 (very high level of ability). The following definitions were assigned to these diinensions: Process Utilisation: The extent to which an individual gains as much benefit as possible from their use of existing resources. Conflict Resolution: The extent to which a CSR can effectively manage a situation so as to diffuse the escalation of conflict. Communication: The extent to which an individual effectively and accurately conveys oral or written information and responds to questions and challenges. Technical and Professional Knowledge: The level of understanding of relevant technical and professional information. Customer Service Orientation: The extent to which ari individual is willing to provide proactive, efficient and effective fulfillment of customer requests over and above expectations. Stress Tolerance: The extent to which an individual maintains a consistent level of performance under the stress of confrontation, tight time-frames and/or uncertainty. 257 Appendix I Innovation: The extent to which an individual generates new or creative ideas and solutions, and uses available resources in new and more efficient ways. DC Exercises Four s imulation exercises were employed to assess the tasks and dimensions in the DC. Three ofthese were high-fidelity call centre simulations that aimed to simulate calls from customers who had specific challenging issues that the CSR had to resolve. All exercises were set up so that the CSRs were positioned at computers which had standard databases installed, mirroring the computers that the CSRs had been trained on in their actual positions. Each computer was linked to a telephone station, where a role player sat. Each role player had been given a script, and was instructed to keep to the script as much as possible during the exercise. At an appointed time, the role payers called the assessment stations for each CSR. One assessor per CSR was assigned for the first three exercises. The last simulation was a lower fidelity group-exercise, where two assessors were assigned to one CSR. The simulations included the following exercises : The Walkway Simulation: Portrayed a situation where a dissatisfied customer was calling about a large hedge that was blocking a walkway that the customer frequented. To add further challenge, the customer could not remember the specific name of the location of the walkway, nor were they aware of the actual definition of the term 'walkway' . Recycling Bin Simulation: Involved another simulation where a dissatisfied customer gave the CSR unnecessary information, from which the CSR was expected to extract the information necessary to resolve the real issue that the customer had. The central issue involved the replacement of a government-owned recycling bin. Appendix I 258 The Rates Simulation: Involved an inquiry into rates. Three specific issues needed to be contended with, including answering a customer enquiry relating to how rates were calculated and what rates actually paid for; payment options and changing addresses. The Group Analysis Exercise: Attempted to assess an individual 's contribution to a group exercise relating to a job relevant scenario . The scenario involved a customer em ail enquiry into rates, parks, rubbish collections, out-of-zone areas, and d isaster information. The participant was rated on their input into discussions on the issues, utilisation of computer resources, and uti l isation.?ofthe Internet to solve the issues presented. Evaluation Approach Several studies have sought to evaluate the relative efficacy of evaluating performance dimensions after the completion of each exercise (within-exercise rating), or waiting until the completion of the entire DC, and then making an evaluation of the dimensions concerned (within-dimension rating). As previously discussed, the evidence for the efficacy of one approach over the other remains unclear (Harris, et al., 1 993; Silverman, et al., 1 986). As the two approaches appear to contribute relatively l ittle to the facilitation of the construct validation ofthe DC process, it was decided that the within-exercise approach would be used. Additionally, this approach was used because the DC used in this study was developmental in nature. Feedback, therefore, needed to be given to participants as an ongoing process throughout the DC. As previously discussed, there is a large body of evidence to suggest that DCs typically show more method than dimensional variance in ratings. To ensure that participants were given appropriate feedback, the within-exercise approach was favoured. Participants were 259 Appendix ! given feedback on behaviours (tasks pertaining to a particular exercise) as suggested by Lowry ( 1 997). Also, feedback was given on the basis of ability traits that were assessed in particular exercises, rather than giving feedback to candidates on the basis of dimensions assessed across th? different exercises, as suggested by Feltham ( 1 989). If ability dimensions in DCs were to be conceptualised as relatively stable, enduring characteristics, then theoretically, they should stand up to being rated in individual exercises (as in Campbell & Fiske, 1 959). Note that in any case, the treatment ofratings with respect to feedba?k, and for determining OARs was secondary to the principle aims of the study. Assessor Training and the Assessment Procedure Assessors were trained on the DC exercises using behavioural observation training (Ballantyne & Povah, 1 995) coupled with guidance on how to use behavioural checklists to assist the process (Lowry, 1 997). It has been suggested that frame of reference (FOR) training (Bemardin & Buckley, 1 98 1 ) enhances the appraisal ofhuman performance (Murphy & Cleveland, 1995). Additionally FOR training has been suggested as a factor that may act to increase the construct and criterion validity of DC ratings (Arthur, et al . ; Lievens, 1 998; Schleicher, 1999). The present study used frame of reference notions as an integral aspect of the training procedure. Prior to the DC, assessors were trained in how to assess participants using a frame of reference training procedure that has been suggested by Lievens ( 1998) for use with DCs. This involved a training session with assessors that covered some basic principles in assessing behaviour, and familiarised the assessors with the exercises and the rating Appendix I 260 instruments that would be used. The FOR component of the training involved having assessors rate the performance of assesses on a contrived CSR written about in a short vignette. Both behavioural and trait ratings were then displayed on a white board together with the mean and standard deviation of the ratings provided by each assessor. The assessors were then invited to discuss the ratings they had given. These discussions focussed on relatively large standard deviations, and why some raters might deviate from others, in the hope that a shared schema could be constructed for what was construed as good versus poor performance on a given exercise. This procedure was an abbreviated version of procedures that have been recommended in the l iterature (Lievens, 200 1a), due to the strict time demands enforced by the organisation under study. The suggested FOR format was fol lowed more closely in Study Three proper. In sequence, the process involved firstly a general explanation of the DC process, and the benefits to the organisation of utilising this procedure. Next, a general description was given of the two that would be involved in the assessment process, including the observation and rating of behaviours, and from those behaviours, the inference of dimensions or traits could theoretically be made. In congruence with the guidelines of Ballantyne and Povah ( 1 995), assessors were then shown how to assess behaviours with no construct inference. This process involved observation, and the recording of behaviours on notepaper and then a checklist (Lowry, 1 997) to obtain a score relating to behavioural performance. This increased objectivity, and guided the scoring process, whi lst allowing for a numerical rating to be allocated to the key behaviours in the D C . The inference of abi lity traits over and above these behavioural 261 Appendix I ratings was the next stage of classification. Here, assessors were shown examples of the behaviours representing, or theoretically underlying, each dimension. For each exercise, assessors were trained to rate behaviours first using the behavioural rating checklist. Eaph behaviour was denoted as being a possible underlying factor for a superordinate dimension. Assessors gave an inferred score for these dimensions on a 5-point scale (Ballantyne & Povah, 1 995) within each exercise. Assessors were then trained in the consensus discussion procedure for dimensional ratings, where at the corl'clusion ofthe DC, all assessors presented evidence and critically discussed the ratings they had obtained to form OARs for each participant. Assessors looked at each participant individually, and assessed their performance on each dimension individually. Evaluation on each dimension was backed up by reported behavioural observations by each assigned assessor for each exercise and each participant. Once the procedure was completed, assessors gained mastery experiences (Bandura, 1 982; 1 986) through rating role players in two simulated DC exercises: the walkway exerci se, and the recycling bin exercise. This allowed an opportunity for assessors to compile their own behavioural ratings, from which they extrapolated trait ratings. As Arthur et al. suggested, the assessors then discussed their findings to work towards building a common frame of reference for performance on the exercise or dimension. Appendix I 262 Procedure The DC ran between late 200 1 and early 2002 for the period of about Y2 a day for three separate sessions. The centre was run according to a schedule where each group of participants perfonned each exercise in turn. Each candidate was given behavioural and trait-based ratings on their perfonnance during the DC. At the conclusion of each exercise, participants were given coaching feedback by their assessors on their performance, and what they could have done to improve their perfonnance on a given simulation exercise. Results and Discussion Although the government-based organisation in the present study was originally intended as a full investigation into DC ratings, the organisation under scrutiny opted out of the project after the DC had been constructed, assessor training had been completed, and 1 5 participants had completed the DC. It was decided that while the results of 1 5 participants could not possibly constitute a meaningful investigation into DC ratings, the sample could act as a pilot study, and indeed a great deal of information, in terms of process improvement, was gained from this precursor. The reader is urged not to draw conclusions based on the following analyses. The results of this pilot should be regarded as a learning device and a precursor to the actual Study Three. A briefer version of the analysis presented in Study Three is, therefore, presented in this pilot study. The data for the Pilot to Study Three were imputed for missing data using EM (expectation maximisation), which uses an iterative process, by which to estimate missing values . Out of a total of 1 380 potential scores across the two DCs, eight behavioural 263 Appendix I scores and one trait score were missing. Of an additional set of two traits that were included for analysis, eight scores were missing. More detail on this addition is given below. Thus, in total, 1 5 scores were missing (nearly a 99% response rate) . As previously discussed in the method section, i t should be noted that some data were added to the total set, which were not originally intended for inclusion. The two traits ' conflict resolution' and 'customer service orientation' were not originally intended for assessment in the group discussion exercise in the DC. The subject matter experts who rated the DC argued that they saw manifestations of these traits in the exercise, and 80%,of the raters scored these traits anyway. It was decided that these ratings should be included, as DC design commonly relies heavily on the opinions of subj ect matter experts (Ballantyne & Povah, 1 995; Lowry, 1 997) and the situation was beneficial for the ANOV A used in a G study. With the inclusion of these additional ratings, exercises and traits could be considered fully crossed, which meant that the variance attributed to exercises and traits could be considered independent! y. Table 49 shows the grand means and standard deviations for the task-specific DC presented for each exercise. Under the task-specific approach, performance on particular exercises is considered the most important unit of measurement. All mean scores vacillated around the 2"d and 3 rd points on the rating scale. Standard deviations for the task-specific ratings fluctuated around one rating. Table 50 shows the grand means and standard deviations for the dimension specific DC. Under the dimension- specific approach, performance on particular dimensions is considered the most important unit of measurement. Like the task-specific DC, average dimensional Appendix I Table 49 Grand Means and SDs of the Behavioural Ratings (Within Exercises) in the Task? Specific DC Exercise Walkway Recycling Bin Rates Group Exercise M 2. 10 2.50 2.45 2. 1 2 SD 1 .00 0.97 1 .0 1 0.97 264 ratings centred around the 2"d and 3rd points on the rating scale. The mean for the last dimension ' Innovation' was slightly lower than the others at 1 . 88 . Standard deviations for the dimension-specific ratings fluctuated around one rating. The present study employed Generalizability Theory (G theory, see Brennan, 200 1 a; Cronbach, Gleser, Nanda & Rajaratnam, 1 972) to analyse data. G studies utilise variance components models that are derived from the mean squares calculated in factorial ANOVAs. Although statistical significance is not generally considered to be of importance in G theory (Brennan, 2000), the confidence limits within which one computes estimates of components of variance can be calculated using confidence intervals designed specifically for variance component estimates (Brennan, 200 1a) . Such confidence intervals cannot be theoretically justified for designs that are unbalanced with respect to nesting (Brennan, 2001 b). The task-specific DC was unbalanced with respect to nesting to the effect that it was not viable to extract items 265 Appendix I Table 50 Means and SDs of the Dimension ratings (Across Exercises) in the Dimensior. Specific DC Dimension M SD Process Utilisation 2.50 0.70 Conflict Resolution 2.53 0.79 Communication 2.62 0.92 Technical and Professional Knowledge 2.67 0.77 Customer Service Orientation 2.72 1 .4 1 Stress Tolerance 2.45 0.8 1 Innovation 1 .88 0.87 in order to contrive a balanced design. Therefore, confidence intervals were not calculated for the task-specific design. Confidence intervals were, however, ? calculated for the fully-crossed dimension specific design. The effects of differences between raters both within and between the DCs were not thought to be of great concern in the present study, because the same raters were used for the same participants across the two DCs. This was so that the effect attributable to raters would be held constant over the task-specific and dimension? specific administrations. Also, different raters assessed different participants in a rotation system in accordance with the suggestions of Lievens ( 1 998). It was hoped that such a system would randomise error associated with rater idiosyncrasy to the greatest extent possible. However, during the course of the DC, this allocation was not always systematic as raters changed their order and some assessors rated more Appendix I 266 participants than others. The complexities of the unsystematic nesting of assessors disallowed their inclusion in the present G study. To gain an estimate of interrater reliability, equation 1 , 1 from Shrout and Fleiss ( 1 979) was employed for each DC. Equation 1 , 1 was relevant to the present sample because each participant was rated by a random combination of assessors who were selected from a larger population of judges. Specific facets were included in the G study that were instrumental in addressing the research issue at hand. The task -specific DC was a partially nested design, in that each exercise had its own specific set of items. The facets included in the task-specific process included exercises (e), items nested within exercises (i:e), and an estimate of the variance attributable to the object of measurement, persons (p). All interaction terms were also analysed. The dimension-specific DC employed a fully crossed design incorporating the facets exercises (e) , dimensions (d), and an estimate of the variance attributable to the object of measurement, persons (p ). All interaction terms were also analysed. Table 5 1 shows the G study for the fully balanced comparison between the task-specific and dimension-specific DCs. All variance components and confidence intervals were calculated using urGenova ver. 2 . 1 (Brennan, 200 1b). Listed for each type of DC are the object of measurement, facets, and interactions (effects), degrees of freedom (d.!J, variance component estimates (VC), 90% confidence intervals and the percent of explained variance (explained variance %) as a heuristic for identifying the proportional contribution of various facets to variation in scores (Shavelson & Webb, 1 99 1 ) . While the effects in the task-specific DC x, i:x and in the dimension specific DC x, d, and xd are presented in Table 5 1 , these facets alone provide little . Table 5 1 Pilot Generalizability Study Comparing a Task-Specific with a Dimension-Specific DC in a Repeated Measures Design Effect df VC p(persons) 1 4 0.0202 x( exercises) 3 0.0 1 39 i(items) :x 60 0 . 1 879 px 42 0.2682 pi:x, e 840 0.5295 Task-Specific DC 90% Confidence Intervals * Explained Effect Variance (%) 2.0 p(persons) 1 .4 x( exercises) 1 8 .4 d( dimensions) 26.3 px 5 1 .9 pd xd pxd,e df 1 4 3 6 42 84 1 8 252 Dimension-Specific DC VC 0.0774 0.0207 0.0694 0.2801 0.0068 0.0 1 44 0.2928 90% Confidence Explained Intervals Variance (%) 0.0008 < VC < 0.2570 1 0.2 0.0000 < VC < 0 .3521 2.7 0.0276 < VC < 0.2778 9 . 1 0. 1 904 < VC < 0.4386 36.8 0.0000 < VC < 0.0337 0.9 0.00 1 1 < VC < 0.0455 1 .9 0.2544 < VC < 0.341 2 38 .4 Note: Confidence intervals were calculated using the Ting et al. (1 990) procedure described in Brennan (200 la). Ting et al's procedure is recommended for random, balanced designs so as to avoid the computation of inaccurately wide intervals. * Confidence intervals were not provided for the task-specific procedure because the specification of a confidence interval for an unbalanced design is inappropriate (Brennan, 200l b). The task-specific design in this case was too unbalanced to viably contrive a balanced design by removing items. N 0\ -...1 > "'0 "'0 n ::l c. x? Appendix I 268 information of interest to the present study. As with any form of assessment in the selection context, the focus is on person variation across the various facets, because of the notion that assessment procedures aim to discriminate between people for decision purposes. Therefore, the focus in the present study concerns interactions between persons and facets and variance component estimates for the object of measurement. Of particular interest in the present study is the interaction term px for both types of DC (Kane, 1 982; Kraiger & Teachout, 1 990; Lievens, 200 1b). In the task-specific approach, the px interaction was a comparatively high contributor, explaining 26.3% of the variance in the model. A .. proportionately highpx interaction in the task-specific approach is defmed by variation in the candidate' s performance according to different situations (exercises) presented to them. In the dimension-specific approach, the px interaction was also comparatively high at 36.8%. Again, this interaction reflects the extent to which candidate performance varied across exercises. The interaction term pd, in the dimension-specific DC, reflects the extent to which dimensions are useful for discriminating between persons (Lievens, 2001 a; 200 1b). This interaction term explained 1 .0% ofthe variance in the dimension specific model. Additionally, the effect for the object of measurement, p, was estimated for the task-specific approach, and explained 2.0% of the total variance. The object of measurement, p, for the dimension specific approach was higher, and explained 1 0.2% of the variance in scores. This was thought to be influenced by training and design issues that needed rectification. Also, the htck of person discriminability in the task-specific approach could have been influenced by poor interrater reliability discussed later. Indeed, person variation was poorly estimated in the dimension-specific study, as evidenced by the corresponding wide confidence interval in Table 5 1 . The termspi:x, e andpxd,e in the task-specific and dimension- 269 Appendix I specific processes, respectively, are difficult to interpret purely as they contain the interactions between all facets and the object of measurement in the model, plus undifferentiated random error. Confidence intervals are presented in Table 5 1 for the dimension-specific model. All confidence intervals were calculated using the method suggested by Ting, Burdick, Graybill, Jeyaratnam, and, Lu ( 1 990), which is generally recommended for random, balanced designs so as to avoid the computation of inaccurately wide intervals for variance component estimates (Brennan, 2001 a) . Table 5 1 suggests that particular variance component estimates in the dimension-specific model were poorly estimated as evidenced by wide confidence intervals, including the effect for p, x, and px in particular. In all probability, poor estimation was also theoretically obtained for the task-specific approach. These are further reasons that the reader should not place a great deal of confidence in the findings from this study. G theory acknowledges that in practice, relative and absolute decisions are often made about individuals on the basis of a psychological measure. A relative decision is one in which the performance of individuals are compared with other individuals (e.g . , norm comparisons present relative decisions where people are compared with one an()ther). An absolute decision is one in which a certain cut-off criterion is employed (e.g., a pass or fail criterion for employment decisions). G theory provides two coefficients for the purposes of relative and absolute decisions that are analogous to reliability coefficients in classical test theory. Tables 54 and 55 provide the equations and calculations, for both types of DC, of crie1 (relative error; all of the effects in the G study that contribute variance to relative decisions), cribs (absolute error; all of the effects in the G study that contribute variance to absolute decisions), Epie l (the Generalizability or G coefficient; for relative decisions), and ? Appendix I 270 (Phi coefficient; for absolute decisions). Tables 54 and 55 also provide equation ICC 1 , 1 from Shrout and Fleiss (1 979) as an estimate of interrater reliability across the two types of DC. Table 52 shows that for relative decisions, Ep?et was calculated at 0.23, and for absolute decisions, ? was calculated at 0.2 1 for the task-specific model. I Additionally, ICC 1 , 1 was calculated as 0.42 for the task-specific model. For the dimension-specific model in Table 53, Ep?er was calculated at 0.49, ? was calculated at 0.44, and ICC 1 , 1 was calculated as 0.45. It should be noted by the reader that Table 52 Relative and Absolute Error, Generalizability and Phi Coefficients and Interrater Reliability for the Pilot Task-Specific DC Index 2 2 crP EpRel = 2 2 ) (cr p + (JRel cr 2 "' - p '1' - ( 2 2 ) (J p +(JAbs ICC(1 1) = BMS - WMS ' BMS +(k - l)WMS. Result 0.07 0.08 0.23 0 .21 0.42 271 Appendix I Table 53 Relative and Absolute Error, Generalizability and Phi Coefficients and Interrater Reliability for the Pilot Dimension-Specific DC cr2 tl. - p Index 't' - ( 2 2 ) cr P + crAbs ICC(l l) = BMS - WMS ' BMS+{k - l)WMS. Result 0.08 0. 1 0 0.49 0.44 0.45 given the results of the G study, the use of Epiet artd ? in this context is somewhat debateable. It is argued in the original monograph on G theory "While it is not assumed that p [the variance attributable to the object of measurement] is completely stable during the period to which the universe definition applies, it is taken for granted thatp's characteristics fluctuate around a typical value" (Cronbach et al. , 1 972, p. 363). That is to say, there is at least some stability of responding assumed when employing G and Phi. The use of these coefficients is perhaps questionable because the evidence from the G study suggests, in line with previous research, that the DCs ratings reflect situationally specific responses, rather than stable characteristics. Appendix I 272 However, Cronbach et al. suggest that when the occasions of assessment are considered as samples of behaviour, it is "mathematically sound to defme the universe score as the average over the time span [over which behavioural measurements will be made]" (p. 363). This might reflect overall performance on the exercises as samples of behavioural performance, a conception that seems acceptable in the role of task-specific DCs, where it is necessary to pool results at the end of the process to provide a summary rating for selection purposes (Lowry, 1 997). Again, the low results for the indices presented in Tables 54 and 55 suggest that the dependability and reliability of measurement in the pilot study was low, and therefore should not be used for decision-making purposes. The pilot study did; however, lead to process gains, and aided the researcher in developing the AC in Study Three. 273 Appendix II Appendix 11: Introduction to Generalizability Theory Study Three utilised Generalizability Theory (G theory) (Cronbach, Gleser, Nanda & Rajaratnam, 1 972) as a paradigil! under which to analyse assessment centre data. This paper provides an opportunity to elucidate G theory for those not accustomed to its alternative view on the concept of dependability and reliability. The focus in this short . paper is not to provide a comprehensive account of what has become the holistic tapestry that G theory is today. Such an account is, to date, most fully described in Brennan (200 1 ) and Marcoulides ( 1 998). Rather, the focus is on some of the theoretical aspects of G theory that are often not dealt with in-depth, and to aid the reader to form a conceptual grounding that will aid an interpretation and understanding of the foundations of G theory. G Theory The Theoretical Stance Underlying G Theory Cronbach et al . ( 1 972) originally conceptualized G theory as a model for understanding the dependability of behavioral measurements. They remarked, "The decision maker is almost never interested in the response given to the particular stimulus, objects or questions, to the particular tester, at the particular moment of testing. Some, at least, of these conditions of measurement could be altered without making the score any less acceptable to the decision maker" (p. 1 5). Thus, it is the score that is considered integral in G theory. The means by which the individual came to earn that score are considered exchangeable with some other, just as acceptable, means. To illustrate: Appendix ll 274 consider an item on a given test. G theory suggests that this item might just as easily be replaced with any other item that could reasonably be expected to measure the same construct. The test designer would deem such an alternative item acceptable. Thus, Cronbach et al . maintain "The ideal datum on which to base the decision would be something like the person's mean score over all acceptable observations" (p. 1 5). Under the notions presented above, G theory presents an alternative view of the dependability of psychological measurement. A dependable measure, under this viewpoint, is one that can accurately generalize from a person's observed score on a test, to that person 's mean score under all possible conditions that would be acceptable to the test user or decision maker. The interest lies in obtaining a dependable score for a person here: the means by which the person came to gain that score can be altered and changed. In this sense, the question asked by G theory is 'Can this person 's score, that is, the observed score, generalize to an idealistic score that reflects that person's average over all the possible conditions under which this score could be obtained?' The idealistic score mentioned here is a hypothetical construct, called a universe score. Note that a person 's measured attributes are considered relatively stable and enduring under this paradigm, i.e., as though they were trait-based, and differences in scores across different occasions of measurement, e.g., across items in a test, or across exercises in an AC, are attributable to one or several sources of error. Both items and exercises from the previous example could be considered as potential sources of error variance. It is these sources of error that G theory first attempts to isolate, and then quantify in terms of their relative contribution to the variance in the scores gained by a person. 275 Appendix ii Trait-based vs. Situationally Specific Forms Of Assessment Because G theory assumes some kind of situational stability in responding, measures that are intended for responses to specific situations, e.g., task-specific ACs (Lowry, 1 997) or work sample exercises, become theoretically problematic on first inspection. Such forms of assessment are task based, in that they do not make the inference of any stable underlying characteristics inherent within an individual, and are often employed in the practice of personnel psychology (Schmidt & Hunter, 1 998). As will be seen later, this possible l imitation is not problematic when one i s at the stage of identifying the various sources of error that contribute to scores. That i s to say, regardless of any trait-based assumptions, G studies can be performed on practically any personnel data. The only time when G theory becomeS conceptually challenging, in this regard, is when generalizabil ity coefficients are calculated. It is argued in the original monograph on G theory "While it is not assumed that p (the variance attributable to the object of measurement] is completely stable during the period to which the universe definition applies, it is taken for granted that p' s characteristics fluctuate around a typical value" (p. 363). That is to say, there is at least some stability of responding assumed when employing G theory. This could be regarded as a limitation of the G study approach when it comes to analyzing task-specific ratings, in that the expected score, in G theory, under any condition is assumed, to some degree, to be the same. Consider the previous example of the task-specific AC, in which assessment exercises are treated as though they are stand- Appendix 11 276 alone work samples of situationally specific behavior. Cronbach et al. suggest that when the occasions of assessment are considered as samples ofbehavior, it is "mathematically sound to define the universe score as the average over the time span [over which behavioral measurements will be made]" (p. 363). This might reflect overall ' performance on the exercises as samples of behavioral performance, a conception that seems acceptable in the role of task-specific ACs, where it is necessary to pool results at the end ofthe process to provide a summary rating for selection purposes (Lowry, 1997). As Cronbach et al . mention, the concept of a universe score becomes dubious when an individual 's performance is changing appreciably in a regular trend. Certainly no regular trend, for instance performance worsening or improving dramatically, is necessarily intended in a task-specific AC. The wider intention of G theory is to identify relatively stable differences between people on the basis of some measure. Because task- specific ACs include an overall score, it could be argued that there is some general level assumed in performance across exercises that contain similar assessment content. This does not infer the existence of a trait; indeed, it is not necessary to make such an inference in behavioral model under which task-specific forms of assessment operate. Rather, this could be conceptualized as a general response to a set of readily exchangeable situations that contend with similar subject matter. In effect, the logic presented here suggests that even with situationally based responding, similar situations will tend to elicit responses from individuals that could be seen to fluctuate around a typical value. The fact remains that the subject matter across the exercises in a task-specific AC (i.e., the situations) are likely to hold similarities. Ahmed, Payne, and Whiddett ( 1 997) 277 Appendix II suggest in their guidelines for AC exercise construction that the exercises should be related to one another. As such, there might be some generality in responses to such similar situations, i .e., a universe of similar responses elicited by similar situations exist for the type of circumstances assessed. Thus, under a behavioral paradigm, it is arguable that a person 's behavior will fluctuate to some degree around a central value, in a task? specific AC, if the situations hold similar characteristics. Indeed the task-specific ACs in the present dissertation hold very similar characteristics across exercises. Thus, in keeping with the assumptions of Cronbach and his colleagues, G theory should be applicable even to task-specific ACs of this type. As an aside, given the assumptions of G theory, one might ask why multiple exercises are included in an AC, when one exercise might suffice. This argument goes back to a paradox in classical test theory that is made clear through G theory. In classical test theory, one could quite possibly increase the reliability of an AC by reducing the number of exercises, even down to a singular exercise. The less variance attributable to different exercises in this model, the higher the reliability of measurement. This would, in all probability, lead test designers to feel insecure with the assessment of an individual, because the assessment would be confined to the idiosyncrasies of a particular exercise. In G theory, the concept of reliability resolves into an argument for the accuracy of generalization. One exercise will generalize accurately to a very narrow universe: a universe pertaining to a certain type of exercise. The use of multiple exercises will allow generalization to a much more important universe in practice: a universe of the use of multiple exercises for assessment (Shavelson, Webb & Rowley, 1 989). Appendix ii 278 G Studies G studies utilize factorial ANOV A models to derive a comprehensive dissemination of the facets that contribute to variance in the scores obtained on a measure. Factorial ANOVA A fundamental tool in? univariate G theory is factorial ANOV A. Factorial ANOV A can be used to partition the variance in scores into various components. The variables that contribute to variance are called 'factors' in ANOVA and 'facets' in G theory. G theory uses the term 'facet' as opposed to 'factor' to avoid evoking associations with factor analysis (Cronbach et al . , 1 972). The variance components that are calculated can be used to indicate the relative contribution of a particular facet, or the interactions between multiple facets, to scores. Factorial ANOV A looks at the variance components attributable to singular facets (main effects) and interactions, as well as all of? the facets in the specified model in combination with one another. Some of these constructs can be isolated as contributors to error variance. The term that is identified for the interaction between all of the facets in a model is usually defined as the error term, and represents the effect for all of the interactions, plus undifferentiated error. Undifferentiated error is defined by contributors to variance that are unsystematic and are unable to be isolated. For example, someone might be distracted during their completion of a personality test by a loud noise. The loud noise thus presents an uncontrol led source of error variance that is unsystematic and therefore 279 Appendix II cannot be accounted for. Factorial ANOVA is used as the tool with which G studies separate potential systematic sources of variance. G studies use the information from a factorial ANOV A to partition error variances and to calculate coefficients, including G Coefficients. Facets in Generalizability Theory In contrast to classical test theory, which is confined to estimating true scores and then combines together al1 sources of error variance, G theory aims to isolate individual contributors (facets) to the error variance in scores in a single analysis. Indeed, it is this simultaneous partitioning of the sources of error variance that distinguishes G theory from Classical Test Theory. The individual sources of error variance found in a G study can then be used to glean information about how to maximize the dependability of a particular test or measure in a Decision study (D study), by calculating various Generalizability Coefficients. Aspects ofD studies are discussed later. The Universe of Admissible Observations G theory defines what is labeled a universe of admissible observations. The universe of admissible observations is a set of "observations that a decision maker is will ing to treat as interchangeable for the purposes of making a decision" (Shavelson & Webb, 1 99 1 , p. 3). Thus, it is the wider set of observations that a test user would find equally acceptable for a given purpose (Cronbach et. al, 1 972). Any given observation is treated as a sample from the theoretical universe of observations deemed admissible by a test user or test developer. Note that G theory specifically uses the term population to Appendix 11 280 describe a set of subjects or participants, and uses the tenn universe to describe a set of facets (Cronbach et. al, 1 972). To exemplify, consider a universe that has one facet, an identified source of error, called items. The universe of admissible observations in this case would be the potentially endless set of items that could replace the observed set of items, i .e., the items currently in the test, with the caveat that it must be reasonable to assume that all of these items measure the same construct. That is, they would need to be deemed admissible by the test developer, or test user. Other generalizations about facets can be made in similar ways. There might be a universe of possible fonns of a test, or a universe of potential test administrators. For instance, if a test measures intelligence, the s'core attributed to the internal attr?bute "intelligence" is thought not to be restricted to the results of one test. It is presumed that the aspects of the test should generalize to aspects of tests purporting to measure the same construct. If this generalization is made, then there is evidence that the test is dependable or generalizable (hence the term 'Generalizability Theory'). It is the facets of a test, (e.g., items, fonns, administrators) which can leaa to errors in generalizing from the test to the universe. Take ' items' for example. If all of the items in the universe of admissible observations for items tend to measure the same trait, and a person's score on those trait items are similar, then one might expect generalization from a sample of those items to a universe of those items. If the items are not measuring the same construct, and a person 's scores differ enonnously from one item to the next, generalization from the sample to the universe wil l be hazardous. This will lead to error in generalizations made about an individual's level on a particular measure. Thus, if items do generalize from a sample of test items to a universe of items deemed to 281 Appendix II be measuring the same construct, then assumptions can be made as to the efficacy of a test in terms of its abil ity to make generalizations about a person's level of a particular construct. As a concrete example, ponsider an AC. When conducting a G study on an AC, one would specify the universe of admissible observations broadly, so as to encompass as many facets as possible. This is so that the chosen model reflects the reality of-the measurement device and so that one can identify which facets actual ly contributed to the variance in scores. Note that the broader the definition, the more sources of variance that are included in the assessment practice, the more difficult it will be to generalize from the sample to G theory's theoretical ideal score, the universe score variance. The universe of admissible observations for a given study, whatever its definition, must reflect the set of observations that would be equally acceptable for the test user' s purpose. It is an operational definition of the class of procedures considered in the measurement model (Cronbach et al, 1 972). A less elegant way of describing this term would be to label it the perpetual set of exchangeable conditions of facets which implies in the same way that there is a larger set of conditions of facets that could theoretically be exchanged with the ones actually observed. Defining the universe of admissible observations is al l about specifying which facets should be included in a study. A universe of admissible observations can be defined by one facet, two facets, or more. The more facets included in the model, the more complex the model becomes. Firstly, a researcher might reason that the different traits specified for measurement in this particular AC might produce error variance in scores. Thus, ' traits' can be specified as the first facet. Secondly, it could be argued that different raters might Appendix 11 282 produce error variance in scores. Therefore, 'raters' could convincingly become the second facet. Error variance might also be attributable to different simulation exercises that are used in an AC. 'Exercises' would be the third facet. Similarly, the different occasions on which an AC is run might present some form of error variance. 'Occasions ' becomes the fourth facet. The facets prescribed or specified in a G study define the universe of admissible observations. Thus, the model described above presents a complex model to prescribe for a G study. The definition of the universe of admissible observations in this AC would be defined by all acceptable traits that could be assessed by al l acceptable raters across all acceptable exercises at all acceptable points in time. As can be seen, this definition could easily apply to nearly any dimension-specific AC. There could also be other facets that might sensibly be included in the model. G studies not only consider the main effects of all of the facets incorporated into a model, but also look at all the possible interactions that could occur between them. As an example, one might consider the effect of an interaction between different raters and ? different occasions. Interaction effects reveal that main .effects are modified by the presence of interactions with other facets in the specified model . They suggest that the main effect cannot be interpreted alone, but should be considered also in terms of its relationship with other facets. It might be that one AC was run on Thursday, and another was run on Friday. On Friday, the raters as a group did not concentrate properly due to eager feel ings with regard to the potential activities of the coming weekend. Thus, it is l ikely that in this case, there will be an interaction between the effects of raters and occasions because rater behavior altered across different days. 283 Appendix II The Object of Measurement Another important aspect that has not yet been considered is that pertaining to the effect of the object of measurement. Indeed the effect obtained for the variance attributed to the object of measurement is an integral component in G theory. The object of ' measurement is the person, animal, or object that is actually being observed and rated. In studies of 110 psychology, thi s is usually the variance component or effect attributable to persons, or participants. The object of measurement in a G study is initially treated in the same way as the facets are treated: as a source of variance. G studies also look at the interaction between the object of measurement and the other sources of variance in the model. Fundamentally, under the G theory paradigm, the variance attributed to the main effect ofthe object of measurement is not considered as a source of measurement error. The whole aim, intention and meaning behind the study of individual differences is to evaluate diversity across individuals on the basis of certain measured characteristics. Psychological tests and ACs constitute popular methods by which to assess individual differences in 110 psychology. Thus, the variance arising from differences between the objects of measurement will define a crucial element of G theory. This will be detailed in the section dealing with G Coefficients. Crossed and Nested Designs for G Studies Two kinds of research designs are generally considered by G theory; crossed and nested designs. A crossed design occurs when every condition of one facet is observed with every condition of another facet. For example if, in the AC mentioned earlier, every trait were assessed in every exercise, this would mean that traits and exercises were 1 ! ' Appendix 11 284 crossed. This is because every condition of one facet (traits) was observed with every condition of another facet (exercises). Crossed designs are more desirable because they ensure that the individual effects of the facets can be separated from one another. From the above example, one would be able to differentiate the individual influence that traits and exercises had on the ratings in the AC. The second type of design, a nested design, occurs when two or more conditions of one facet occur with only one specific condition of another facet. To i l lustrate, if it was decided that three ACs weuld be run over the course of three days, different participants could be evaluated on each day that the AC was run. Thus, each day will have its own specific set of participants. In such a scenario, participants are said to be nested within days. In effect, nesting produces independent groups that could each contribute to variation in scores. Nested designs are less desirable than crossed designs in G theory, because if one facet is nested within another, it becomes difficult to disentangle the individual effects of the nested facet. The effect of the nested facet becomes inextricably l inked with the facet within which it is nested. As such, one cannot obtain a clear idea of the individual influence of the nested facet. However, nested designs are often chosen out of practicality. Crossed designs are often by no means practical, however they yield a richer analysis. There is a trade-offwhen choosing either form of research design. In the specification of designs for analysis, crossed and nested facets utilize certain symbols to indicate their status. When a facet is crossed with another, the symbolization for persons crossed with test items (i.e., every person completed every test item) would look l ike: p X i (in that p = persons and i = items). If items were nested 285 Appendix II within people (i .e. , particular groups of people completed particular groups of items) the symbolization would look like: i:p. If p and i were the only facets to be included in the model, then the error term would look l ike pi,e where e indicates undifferentiated error. Each facet has a variance component attached to it in a G study. For the calculation of variance components, the interested reader is directed to Shavelson and Webb ( 1 99 1) . The SPSS or SAS Windows based statistical programs can also compute variance components for G studies. GENOV A, a DOS based program devoted to research using G theory, is also available for these calculations. Random and Fixed Facets Under G Theory G theory takes a distinctive perspective on what it considers to be a random and a fixed sample. It is important to note that in G theory generally, most facets are assumed to have been sampled at random, and thus G theory is, essentially, a random effects model. If a facet is considered to have been sampled at random, then the sample is smaller than the universe of that facet. Take, for example, an AC that has three different simulation exercises. G theory wil l ordinarily treat these exercises as though they have been sampled from a possibly endless universe of simulation exercises that could have potentially been used in the AC. Thus, exercises would ordinari ly be considered as a random facet, contingent on the nature of the exercises, and the extent to which the justification for defining them as random is cogent. Shavelson, Webb, and Rowley ( 1 989) warn that any inference that is made from the sample should be only directed at the population from which that sample was drawn. An argument that is often employed to justify G theory's assumption of random variables Appendix II 286 comes from Bayes' Theorem, from a notion labeled exchangeability. This concept suggests that although the facets have not been sampled in a purely random manner, they may be considered as being sampled at random if the facets that are not included in a given G study could be exchanged with or are equally acceptable in comparison to the ' facets that are included in the G study (Shavelson & Webb, 1 98 1 ; Shavelson, Webb & Rowley, 1 989; Shavelson & Webb, 1 99 1 ). Thus, if the designer of an AC would be content with exchanging the exercises in the AC with some other exercises that might perform the same function (aMhe same level of acceptability), the exercise facet can be considered as being sampled at random. This is an assumption that is made by the theory from the outset, and could present a possible limitation in the theory. One would not realize the true reality of the nature of the exchangeability of the facets without further research into this notion. A facet is considered fixed in G theory when the conditions relating to it exhaust all of the conditions in the universe of generalization. Thus generalization from the sample to the universe is not relevant because the entire universe has already been captured by the conditions of the facet. For example, consider research on the effect of the day of the week on AC ratings. If every day of the week were included in the facet, days of the week would need to be considered as a fixed variable because there would not be any other conditions (i .e., days) to make generalizations to. D Studies D studies utilize the information gleaned from G studies, to make decisions about the dependabil ity of a given measure. While the purpose of a G study is to estimate 287 Appendix ll variance components; the purpose of a D study is to estimate quantities specific to a particular measurement procedure, and its relationship to a universe of generalization. Relative and Absolute Decisions D studies use two different kinds of coefficient that pertain to two different kinds of decision that a test user may wish to engage in. Both of these decisions have wide applications in employment. The first is referred to as a relative decision. This involves situations where the decision maker is interested in how the individual performed relative to other people. This is analogous to the concept of using norms, where one might claim that an individual scored higher than 60% of his or her peer group. The second kind of decision that D studies acknowledge is that pertaining to absolute decisions. In G theory, absolute decisions are ones in which no comparison to any peer group is necessitated. These are decisions where a person either passes or fails, or is awarded some score on the basis of a criterion that has nothing to do with the relative standing of individuals. An example of this might be a driving test, where the criterion is set for a person to pass if they answer more than 90% of the test items correctly. This score has nothing to do with how others have performed on the test. It is an absolute decision as opposed to a relative one. Brennan and Kane ( 1 977) are credited with some aspects of applying G theory to absolute decisions. Universes of Generalization One of the most important considerations in a D study concerns the universe to which a researcher wishes to generalize, on the basis of the results derived from a Appendix 11 288 particular measure (Brennan, 200 1). The universe of generalization is defined as the specific universe to which the researcher wishes to make generalizations to. This consideration relates whether a given measurement model is considered random or fixed. In sum, considerations given to the universe of generalization inquire as to whether the researcher wishes to generalize to a much larger group. To illustrate, for a development center, the researcher might be interested in generalizing from the scores obtained on the basis of exercises and dimensions used in the process, to those same scores obtained on a greater population of exercises, and dimensions. This model, as mentioned earlier, i s considered random. The universe of generalization will be the direct consideration when calculating and interpreting G Coefficients. The Generalizability Coefficient Closely related to the notion of the universe of generalization is the Generalizability Coefficient (G Coefficient). On a 0-1 scale, a G Coefficient reflects the l ikelihood that the measure will be able to locate individuals relative to other members in? the population. Thus, the G Coefficient focuses on the object of measurement, which usually constitutes individuals. The G Coefficient represents how generalizable the score for an individual would be over exhaustive measurement in a measurement model . Universe score variance is the variance attributable to the ideal score that one wishes to obtain . This ideal score is the average score that an individual would obtain across all the possible measurement conditions in the universe of admissible observations in a particular measurement model. ------- ------ 289 Appendix ll In the original monograph written on G theory, Cronbach and his colleagues ( 1 972) stated "the tester is interested chiefly in the person tested and only secondarily in the conditions of observation" (p. 2). As stated earlier, the variance in scores that is attributable to the object of measurement, usually the person tested, is fundamental to G ' theory. In fact, G theory uses the variance attributable to the object of measurement to estimate universe score variance when calculating a G Coefficient. The variance component for the object of measurement is considered as a representative sample of that object of measurement in the universe. This is considered as the numerator in the calculation ofthe G Coefficient. As previously mentioned, it could be argued that it is desirable to explain variance through certain facets, commonly through trait-based dimensions in an AC. However, the ultimate aim in the study of individual difference is to locate disparity between individuals in order to characterize their various areas of strength and weakness. The means by which the tester came to conclusions about the differences between individuals are considered secondary to the point that disparity was actually found. The denominator in the G Coefficient reflects those secondary sources of variance. This is labeled 'expected observed-score variance' , and is estimated by combining the variance component for the object of measurement with the other sources of measurement variance included in the definition of the universe of admissible observations. The choice of effects included in the denominator of this equation depends on the type of decision to be made. There are two kinds of G Coefficient, the choice of which depends on the type of decision that will be made with a particular assessment process. As mentioned Appendix 11 290 previously, G theory recognizes two such decisions: relative and absolute. Error variance is different for the two kinds of decision, and therefore the G Coefficient is calculated differently for one decision over another. Measurement Error To calculate a G Coefficient, two indices of measurement error are initially calculated for inclusion into the generalizability coefficient formulae, for each respective decision. The facets contributing to variance for relative decisions include al l of the interactions between the object of measurement and the facets, plus undifferentiated error. This does not include the variance component for the object of measurement, which as mentioned previously, is an estimate of universe score variance. Relative error includes all of the interactions showing how people differed with each other on the various facets. These features wil l affect the relative standing of individuals. For a random model and absolute decisions, all of the variance components in the model except the variance component for the object of measurement are included in the reliability formula. Figure 1 shows sources of error for relative and absolute decisions in a random design where persons (p) are crossed with items (i), taken from an example in Shavelson and Webb ( 1 99 1 , p. 86). The facet pi,e refers to the interaction between persons and items, together with undifferentiated error (e). The shaded parts indicate which components should be included in the calculation of measurement error for each respective decision. The concepts presented in Figure 1 can be taken as a rule of thumb, and although the design in Figure l is reasonably simplistic, the rules are applicable to other, more complex designs. 29 1 Relative Error Absolute Error Figure 1. Sources of error for Relative and Absolute Decisions for a Random p X i Design. Appendix 11 Note: From Generalizability Theory (p. 86) by R. J. Shavelson & N. M. Webb, 1991, CA: Sage Publications. Copyright 199 1 , Sage Publications. Reprinted with permission. For a relative decision ( cr2Rel ) with the same design as in Figure 1 , the equation for the estimated relative error variance would be: 2 ? 2 cr pz, e