Browsing by Author "Smith HL"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- ItemLost in the Forest(Cold Spring Harbor Laboratory, 2022) Smith HL; Biggs PJ; French NP; Smith ANH; Marshall JCTo date, there remains no satisfactory solution for absent levels in random forest models. Absent levels are levels of a predictor variable encountered during prediction for which no explicit rule exists. Imposing an order on nominal predictors allows absent levels to be integrated and used for prediction. The ordering of predictors has traditionally been via class probabilities with absent levels designated the lowest order. Using a combination of simulated data and pathogen source-attribution models using whole-genome sequencing data, we examine how the method of ordering predictors with absent levels can (i) systematically bias a model, and (ii) affect the out-of-bag error rate. We show that the traditional approach is systematically biased and underestimates out-of-bag error rates, and that this bias is resolved by ordering absent levels according to the a priori hypothesis of equal class probability. We present a novel method of ordering predictors via principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are designated an order according to their similarity to each of the other levels in the training data. We show that the PCO method performs at least as well as the traditional approach of ordering and is not biased.
- ItemLost in the Forest: Encoding categorical variables and the absent levels problem(Springer Nature, 2024-04-10) Smith HL; Biggs PJ; French NP; Smith ANH; Marshall JC; Gama JLevels of a predictor variable that are absent when a classification tree is grown can not be subject to an explicit splitting rule. This is an issue if these absent levels are present in a new observation for prediction. To date, there remains no satisfactory solution for absent levels in random forest models. Unlike missing data, absent levels are fully observed and known. Ordinal encoding of predictors allows absent levels to be integrated and used for prediction. Using a case study on source attribution of Campylobacter species using whole genome sequencing (WGS) data as predictors, we examine how target-agnostic versus target-based encoding of predictor variables with absent levels affects the accuracy of random forest models. We show that a target-based encoding approach using class probabilities, with absent levels designated the highest rank, is systematically biased, and that this bias is resolved by encoding absent levels according to the a priori hypothesis of equal class probability. We present a novel method of ordinal encoding predictors via principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are encoded according to their similarity to each of the other levels in the training data. We show that the PCO-encoding method performs at least as well as the target-based approach and is not biased.
- ItemOut of (the) bag—encoding categorical predictors impacts out-of-bag samples(PeerJ Inc., 2024-01-01) Smith HL; Biggs PJ; French NP; Smith ANH; Marshall JC; Aleem MPerformance of random forest classification models is often assessed and interpreted using out-of-bag (OOB) samples. Observations which are OOB when a tree is trained may serve as a test set for that tree and predictions from the OOB observations used to calculate OOB error and variable importance measures (VIM). OOB errors are popular because they are fast to compute and, for large samples, are a good estimate of the true prediction error. In this study, we investigate how target-based vs. target-agnostic encoding of categorical predictor variables for random forest can bias performance measures based on OOB samples. We show that, when categorical variables are encoded using a target-based encoding method, and when the encoding takes place prior to bagging, the OOB sample can underestimate the true misclassification rate, and overestimate variable importance. We recommend using a separate test data set when evaluating variable importance and/or predictive performance of tree based methods that utilise a target-based encoding method.
- ItemTransmission pathways of Campylobacter jejuni between humans and livestock in rural Ethiopia are highly complex and interdependent(BioMed Central Limited, London, United Kingdom, 2025-12-01) Singh N; Thystrup CAN; Hassen BM; Bhandari M; Rajashekara G; Hald TM; Manary MJ; McKune SL; Hassen JY; Smith HL; Marshall JC; French NP; Havelaar AH; Mekuria ZH; Weldesenbet YD; Yang Y; Li X; Gebreyes W; Shaikh N; Bhrane M; Dawid MM; Usmail MM; Deblais L; Mechlowitz K; Umer KA; Roba KT; Hassen KA; Amin JK; Usmane IA; Ahmed IA; Yimer G; Yusuf EA; Chen D; Saleem C; Ahmedo BU; Ojeda AE; Ibrahim AM; Seran AJBackground: Campylobacter jejuni and C. coli are the most common causes of bacterial enteritis worldwide whereas symptomatic and asymptomatic infections are associated with stunting in children in low- and middle-income countries. Little is known about their sources and transmission pathways in low- and middle-income countries, and particularly for infants and young children. We assessed the genomic diversity of C. jejuni in Eastern Ethiopia to determine the attribution of infections in infants under 1 year of age to livestock (chickens, cattle, goats and sheep) and other humans (siblings, mothers). Results: Among 287 C. jejuni isolates, 48 seven-gene sequence types (STs), including 11 previously unreported STs were identified. Within an ST, the core genome STs of multiple isolates differed in fewer than five alleles. Many of these isolates do not belong to the most common STs reported in high-resource settings, and of the six most common global STs, only ST50 was found in our study area. Isolates from the same infant sample were closely related, while those from consecutive infant samples often displayed different STs, suggesting rapid clearance and new infection. Four different attribution models using different genomic profiling methods, assumptions and estimation methods predicted that chickens are the primary reservoir for infant infections. Infections from chickens are transmitted with or without other humans (mothers, siblings) as intermediate sources. Model predictions differed in terms of the relative importance of cattle versus small ruminants as additional sources. Conclusions: The transmission pathways of C. jejuni in our study area are highly complex and interdependent. While chickens are the most important reservoir of C. jejuni, ruminant reservoirs also contribute to the infections. The currently nonculturable species Candidatus C. infans is also highly prevalent in infants and is likely anthroponotic. Efforts to reduce the colonization of infants with Campylobacter and ultimately stunting in low-resource settings are best aimed at protecting proximate sources such as caretakers’ hands, food and indoor soil through tight integration of the currently siloed domains of nutrition, food safety and water, sanitation and hygiene.