Sampling To Improve Predictions For Underrepresented Observations In Imbalanced Data

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

64 Downloads (Pure)

Abstract

Data imbalance is common in production data, where controlled production settings require data to fall within a narrow range of variation and data are collected with quality assessment in mind, rather than data analytic insights. This imbalance negatively impacts the predictive performance of models on underrepresented observations. We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data. We investigate the use of three sampling approaches to adjust for imbalance. The goal is to downsample the covariates in the training data and subsequently fit a regression model. We investigate how the predictive power of the model changes when using either the sampled or the original data for training. We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production and find that fitting a model using the sampled data gives a small reduction in the overall predictive performance, but yields a systematically better performance on underrepresented observations. In addition, the results emphasize the need for alternative, fair, and balanced model evaluations.
Original languageEnglish
Title of host publicationProceedings of Workshop on Data-Centric AI
Number of pages5
Publication date2021
Publication statusPublished - 2021
EventData-Centric AI Virtual Workshop -
Duration: 17 Nov 202118 Nov 2021

Conference

ConferenceData-Centric AI Virtual Workshop
Period17/11/202118/11/2021

Fingerprint

Dive into the research topics of 'Sampling To Improve Predictions For Underrepresented Observations In Imbalanced Data'. Together they form a unique fingerprint.

Cite this