Variable selection wrapper in presence of correlated input variables for random forest models

Marta Rotari*, Murat Kulahci

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

21 Downloads (Pure)

Abstract

In most data analytic applications in manufacturing, understanding the data-driven models plays a crucial role in complementing the engineering knowledge about the production process. Identifying relevant input variables, rather than only predicting the response through some “black-box” model, is of great interest in many applications. There is, therefore, a growing focus on describing the contributions of the input variables to the model in the form of “variable importance”, which is readily available in certain machine learning methods such as random forest (RF). Once a ranking based on the importance measure of the variables is established, the question of how many variables are truly relevant in predicting the output variable rises. In this study, we focus on the Boruta algorithm, which is a wrapper around the RF model. It is a variable selection tool that assesses the variable importance measure for the RF model. It has been previously shown in the literature that the correlation among the input variables, which is often a common occurrence in high dimensional data, distorts and overestimates the importance of variables. The Boruta algorithm is also affected by this resulting in a larger set of input variables deemed important. To overcome this issue, in this study, we propose an extension of the Boruta algorithm for the correlated data by exploiting the conditional importance measure. This extension greatly improves the Boruta algorithm in the case of high correlation among variables and provides a more precise ranking of the variables that significantly contribute to the response. We believe this approach can be used in many industrial applications by providing more transparency and understanding of the process.

Original languageEnglish
JournalQuality and Reliability Engineering International
Volume40
Issue number1
Pages (from-to)297-312
ISSN0748-8017
DOIs
Publication statusPublished - 2024

Keywords

  • Additive manufacturing
  • Boruta algorithm
  • Conditional importance
  • Random forest
  • Variable selection algorithm

Fingerprint

Dive into the research topics of 'Variable selection wrapper in presence of correlated input variables for random forest models'. Together they form a unique fingerprint.

Cite this