An empirical study on the effectiveness of data resampling approaches for cross-project software defect prediction

Bennin KE; Tahir A; MacDonell SG; Börstler J

doi:10.1049/sfw2.12052

An empirical study on the effectiveness of data resampling approaches for cross-project software defect prediction

dc.citation.issue	2
dc.citation.volume	16
dc.contributor.author	Bennin KE
dc.contributor.author	Tahir A
dc.contributor.author	MacDonell SG
dc.contributor.author	Börstler J
dc.date.accessioned	2023-11-23T01:28:26Z
dc.date.accessioned	2024-07-25T06:45:20Z
dc.date.available	2021-11-28
dc.date.available	2023-11-23T01:28:26Z
dc.date.available	2024-07-25T06:45:20Z
dc.date.issued	2022-04
dc.description.abstract	Cross-project defect prediction (CPDP), where data from different software projects are used to predict defects, has been proposed as a way to provide data for software projects that lack historical data. Evaluations of CPDP models using the Nearest Neighbour (NN) Filter approach have shown promising results in recent studies. A key challenge with defect-prediction datasets is class imbalance, that is, highly skewed datasets where non-buggy modules dominate the buggy modules. In the past, data resampling approaches have been applied to within-projects defect prediction models to help alleviate the negative effects of class imbalance in the datasets. To address the class imbalance issue in CPDP, the authors assess the impact of data resampling approaches on CPDP models after the NN Filter is applied. The impact on prediction performance of five oversampling approaches (MAHAKIL, SMOTE, Borderline-SMOTE, Random Oversampling and ADASYN) and three undersampling approaches (Random Undersampling, Tomek Links and One-sided selection) is investigated and results are compared to approaches without data resampling. The authors examined six defect prediction models on 34 datasets extracted from the PROMISE repository. The authors' results show that there is a significant positive effect of data resampling on CPDP performance, suggesting that software quality teams and researchers should consider applying data resampling approaches for improved recall (pd) and g-measure prediction performance. However, if the goal is to improve precision and reduce false alarm (pf) then data resampling approaches should be avoided.
dc.description.confidential	false
dc.edition.edition	April 2022
dc.format.pagination	185-199
dc.identifier.author-url	http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000723085500001&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=c5bb3b2499afac691c2e3c1a83ef6fef
dc.identifier.citation	Bennin KE, Tahir A, MacDonell SG, Börstler J. (2022). An empirical study on the effectiveness of data resampling approaches for cross-project software defect prediction. IET Software. 16. 2. (pp. 185-199).
dc.identifier.doi	10.1049/sfw2.12052
dc.identifier.eissn	1751-8814
dc.identifier.elements-type	journal-article
dc.identifier.issn	1751-8806
dc.identifier.uri	https://mro.massey.ac.nz/handle/10179/70798
dc.language	English
dc.publisher	John Wiley and Sons Ltd on behalf of The Institution of Engineering and Technolog
dc.publisher.uri	https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/sfw2.12052
dc.relation.isPartOf	IET Software
dc.rights	(c) 2021 The Author/s
dc.rights	CC BY-NC-ND 4.0
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/deed.en
dc.subject	class imbalance
dc.subject	defect prediction
dc.subject	software metrics
dc.subject	software quality
dc.title	An empirical study on the effectiveness of data resampling approaches for cross-project software defect prediction
dc.type	Journal article
pubs.elements-id	451776
pubs.organisational-group	Other

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Published version.pdf
Size:: 1.31 MB
Format:: Adobe Portable Document Format

Download

Collections

Journal Articles