Out of (the) bag—encoding categorical predictors impacts out-of-bag samples

dc.citation.volume10
dc.contributor.authorSmith HL
dc.contributor.authorBiggs PJ
dc.contributor.authorFrench NP
dc.contributor.authorSmith ANH
dc.contributor.authorMarshall JC
dc.contributor.editorAleem M
dc.date.accessioned2024-12-09T21:00:47Z
dc.date.available2024-12-09T21:00:47Z
dc.date.issued2024-01-01
dc.description.abstractPerformance of random forest classification models is often assessed and interpreted using out-of-bag (OOB) samples. Observations which are OOB when a tree is trained may serve as a test set for that tree and predictions from the OOB observations used to calculate OOB error and variable importance measures (VIM). OOB errors are popular because they are fast to compute and, for large samples, are a good estimate of the true prediction error. In this study, we investigate how target-based vs. target-agnostic encoding of categorical predictor variables for random forest can bias performance measures based on OOB samples. We show that, when categorical variables are encoded using a target-based encoding method, and when the encoding takes place prior to bagging, the OOB sample can underestimate the true misclassification rate, and overestimate variable importance. We recommend using a separate test data set when evaluating variable importance and/or predictive performance of tree based methods that utilise a target-based encoding method.
dc.description.confidentialfalse
dc.edition.edition2024
dc.format.pagination1-18
dc.identifier.citationSmith HL, Biggs PJ, French NP, Smith ANH, Marshall JC. (2024). Out of (the) bag—encoding categorical predictors impacts out-of-bag samples. PeerJ Computer Science. 10. (pp. 1-18).
dc.identifier.doi10.7717/peerj-cs.2445
dc.identifier.eissn2376-5992
dc.identifier.elements-typejournal-article
dc.identifier.issn2376-5992
dc.identifier.numbere2445
dc.identifier.urihttps://mro.massey.ac.nz/handle/10179/72242
dc.languageEnglish
dc.publisherPeerJ Inc.
dc.publisher.urihttps://peerj.com/articles/cs-2445/#
dc.relation.isPartOfPeerJ Computer Science
dc.subjectAbsent levels
dc.subjectCategorical predictors
dc.subjectLabel encoding
dc.subjectOut-of-bag error
dc.subjectRandom forest
dc.subjectVariable importance
dc.titleOut of (the) bag—encoding categorical predictors impacts out-of-bag samples
dc.typeJournal article
pubs.elements-id492364
pubs.organisational-groupOther
Files
Original bundle
Now showing 1 - 4 of 4
Loading...
Thumbnail Image
Name:
492364 PDF.pdf
Size:
2.34 MB
Format:
Adobe Portable Document Format
Description:
Published version.pdf
Loading...
Thumbnail Image
Name:
supp1.pdf
Size:
17.86 KB
Format:
Adobe Portable Document Format
Description:
Evidence 3.pdf
Loading...
Thumbnail Image
Name:
OOB_simulation_code.txt
Size:
12.39 KB
Format:
Plain Text
Description:
Evidence 1.txt
Loading...
Thumbnail Image
Name:
supp2.pdf
Size:
51.08 KB
Format:
Adobe Portable Document Format
Description:
Evidence 2.pdf
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
9.22 KB
Format:
Plain Text
Description:
Collections