Forward variable selection for random forest models

Jasper Velthoen*, Juan Juan Cai, Geurt Jongbloed

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

1 Citation (Scopus)
52 Downloads (Pure)

Abstract

Random forest is a popular prediction approach for handling high dimensional covariates. However, it often becomes infeasible to interpret the obtained high dimensional and non-parametric model. Aiming for an interpretable predictive model, we develop a forward variable selection method using the continuous ranked probability score (CRPS) as the loss function. eOur stepwise procedure selects at each step a variable that minimizes the CRPS risk and a stopping criterion for selection is designed based on an estimation of the CRPS risk difference of two consecutive steps. We provide mathematical motivation for our method by proving that in a population sense, the method attains the optimal set. In a simulation study, we compare the performance of our method with an existing variable selection method, for different sample sizes and correlation strength of covariates. Our method is observed to have a much lower false positive rate. We also demonstrate an application of our method to statistical post-processing of daily maximum temperature forecasts in the Netherlands. Our method selects about 10% covariates while retaining the same predictive power.

Original languageEnglish
Pages (from-to)2836-2856
Number of pages21
JournalJournal of Applied Statistics
Volume50 (2023)
Issue number13
DOIs
Publication statusPublished - 2022

Keywords

  • correlated covariates
  • CRPS
  • forward selection
  • Random forests
  • variable selection

Fingerprint

Dive into the research topics of 'Forward variable selection for random forest models'. Together they form a unique fingerprint.

Cite this