-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Hello!
Thank you for the amazing work.
I noticed the Subsampling for class imbalances article recommends subsampling to address class imbalances. Please, I would like to suggest adding a warning with a note of caution about the potential deleterious effects of sub-/over-sampling methods. The most recent evidence points to severe harm in calibration and little to no benefit in discrimination. This recent literature does seem somewhat limited to binary classification, though. Happy to make a PR if agreed. Disclaimer: my personal bias comes from the clinical prediction world.
Recent references on the harms of class imbalance
- Carriero at al. (2025). The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation Study. https://doi.org/10.1002/sim.10320
- Piccininni at al. (2024). Understanding random resampling techniques for class imbalance correction and their consequences on calibration and discrimination of clinical risk prediction models. https://doi.org/10.1016/j.jbi.2024.104666
Reference on the equivalence between oversampling and decision threshold selection
Text to be removed
In the Subsampling the data section, it is said that:
"However, subsampling almost always produces models that are better calibrated, meaning that the distributions of the class probabilities are more well behaved. As a result, the default 50% cutoff is much more likely to produce better sensitivity and specificity values than they would otherwise."
I am not sure how this could be the case. For instance, a logistic regression model with sub-/over-sampled data will be poorly calibrated, even with infinite data, due to incorrect intercept. Following Assunção at al. (2024)., setting the probability cutoff to the outcome prevalence seems to suffice, without harming calibration. Perhaps warning about the difference between a class imbalance problem and a sample size problem may be beneficial.
Thanks again,
Giuliano