library(rtemis)
14 Handling Imbalanced Data
In classification problems, it is common for outcome classes to appear with different frequencies. This is called imbalanced data. Consider, for example, a binary classification problem where the positive class (the ‘events’) appears with a 5% probability. Applying a learning algorithm naively without considering this class imbalance, may lead to the algorithm always predicting the majority class, which automatically results in 95% accuracy.
To handle imbalanced data, we make considerations during model training and assessment.
14.1 Model Training
There are a few different ways to address the problem of imbalanced data during training. We’ll consider the 3 main ones:
Inverse Frequency Weighting
We weigh each case based on its frequency, such that less frequent classes are up-weighed. This is called Inverse Frequency Weighting (IFW), and is enabled by default in rtemis for all classification learning algorithms that support case weights. The logical argumentifw
controls whether IFW is used. It is TRUE by default in all learners.Upsampling the minority class
We randomly sample from the minority class to reach the size of the manjority class. The effect is not very different from upweighing using IFW. The logical argumentupsample
in all rtemis learners that support classification controls whether upsampling of the minority class should be performed. (If it is set to TRUE, it makes theifw
argument irrelevant as the sample becomes balanced)Downsampling the majority class
Conversely, we randomly subsample the majority class to reach the size of the minority class. The logical argumentdownsample
controls this behavior.
14.2 Classification model performance metrics
During model selection as well as model assessment, it is crucial to use metrics that take into consideration imbalanced outcomes.
The following metrics address the issue in different ways and are reported by the modError
function in all classification problems:
Balanced Accuracy (the mean of Sensitivity + Sensitivity) \[\frac{1}{N}\sum_{i=1}^k Sensitivity_i\] i.e. the mean per-class Sensitivity. In the binary case, this is equal to the mean of Sensitivity and Specificity.
F1 Harmonic mean of Sensitivity (aka Recall) and Positive Predictive Value (aka Precision) \[F_1 = 2\frac{precision * recall}{precision + recall}\]
AUROC (Area under the ROC) i.e. the area under the True Positive Rate vs False Positive Rate curve or Sensitivity vs 1-Specificity
14.3 Example dataset
Let’s look at a very imbalanced dataset from the Penn ML Benchmarks repository
<- read("https://github.com/EpistasisLab/pmlb/raw/master/datasets/hypothyroid/hypothyroid.tsv.gz") dat
05-20-25 07:18:37 ▶ Reading hypothyroid.tsv.gz using data.table... :read
05-20-25 07:18:38 Read in 3,163 x 26 :read
05-20-25 07:18:38 Removed 77 duplicate rows. :read
05-20-25 07:18:38 New dimensions: 3,086 x 26 :read
05-20-25 07:18:38 Completed in 0.02 minutes (Real: 1.11; User: 0.06; System: 0.01) :read
$target <- factor(dat$target, levels = c(1, 0)) dat
check_data(dat)
dat: A data.table with 3086 rows and 26 columns
Data types
* 0 numeric features
* 25 integer features
* 1 factor, which is not ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 0 missing values
Recommendations
* Everything looks good
Get the frequency of the target classes:
table(dat$target)
1 0
2945 141
14.3.1 Class Imbalance
We can use the Class Imbalance formula using the class_imbalance()
function:
\[I = K\cdot\sum_{i=1}^K (n_i/N - 1/K)^2\]
class_imbalance(dat$target)
[1] 0.8255895
Let’s create some resamples to train and test models:
<- resample(dat, seed = 2019) res
05-20-25 07:18:38 Input contains more than one columns; will stratify on last :resample
.:Resampling Parameters
n.resamples: 10
resampler: strat.sub
stratify.var: y
train.p: 0.75
strat.n.bins: 4
05-20-25 07:18:38 Using max n bins possible = 2 :strat.sub
05-20-25 07:18:38 Created 10 stratified subsamples :resample
<- dat[res$Subsample_1, ]
dat.train <- dat[-res$Subsample_1, ] dat.test
14.4 GLM
14.4.1 No imbalance correction
Let’s train a GLM without inverse probability weighting or upsampling. Since IFW is set to TRUE by default in all rtemis supervised learning functions that support it, we have to set it to FALSE
:
<- s_GLM(dat.train, dat.test,
mod.glm.imb ifw = FALSE)
05-20-25 07:18:38 Hello, egenn :s_GLM
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
05-20-25 07:18:38 Training GLM... :s_GLM
.:Logistic Classification Training Summary
Estimated
Reference 1 0
1 2195 13
0 72 33
Overall
Sensitivity 0.9941
Specificity 0.3143
Balanced Accuracy 0.6542
PPV 0.9682
NPV 0.7174
F1 0.9810
Accuracy 0.9633
AUC 0.9431
Brier Score 0.0284
Positive Class: 1
.:Logistic Classification Testing Summary
Estimated
Reference 1 0
1 728 9
0 24 12
Overall
Sensitivity 0.9878
Specificity 0.3333
Balanced Accuracy 0.6606
PPV 0.9681
NPV 0.5714
F1 0.9778
Accuracy 0.9573
AUC 0.9177
Brier Score 0.0340
Positive Class: 1
05-20-25 07:18:38 Completed in 1.7e-03 minutes (Real: 0.10; User: 0.28; System: 0.02) :s_GLM
We get almost perfect Sensitivity, but very low Specificity.
14.4.2 IFW
Let’s enable IFW:
<- s_GLM(dat.train, dat.test,
mod.glm.ifw ifw = TRUE)
05-20-25 07:18:38 Hello, egenn :s_GLM
05-20-25 07:18:38 Imbalanced classes: using Inverse Frequency Weighting :prepare_data
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
05-20-25 07:18:38 Training GLM... :s_GLM
.:Logistic Classification Training Summary
Estimated
Reference 1 0
1 1912 296
0 8 97
Overall
Sensitivity 0.8659
Specificity 0.9238
Balanced Accuracy 0.8949
PPV 0.9958
NPV 0.2468
F1 0.9264
Accuracy 0.8686
AUC 0.9469
Brier Score 0.0968
Positive Class: 1
.:Logistic Classification Testing Summary
Estimated
Reference 1 0
1 624 113
0 6 30
Overall
Sensitivity 0.8467
Specificity 0.8333
Balanced Accuracy 0.8400
PPV 0.9905
NPV 0.2098
F1 0.9129
Accuracy 0.8461
AUC 0.9085
Brier Score 0.1104
Positive Class: 1
05-20-25 07:18:38 Completed in 9.8e-04 minutes (Real: 0.06; User: 0.15; System: 0.01) :s_GLM
Sensitivity dropped a little, but Specificity improved a lot and they are now very close.
14.4.3 Upsampling
Let’s try upsampling instead of IFW:
<- s_GLM(dat.train, dat.test,
mod.glm.ups ifw = FALSE,
upsample = TRUE)
05-20-25 07:18:38 Hello, egenn :s_GLM
05-20-25 07:18:38 Upsampling to create balanced set... :prepare_data
05-20-25 07:18:38 1 is majority outcome with length = 2208 :prepare_data
.:Classification Input Summary
Training features: 4416 x 25
Training outcome: 4416 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
05-20-25 07:18:38 Training GLM... :s_GLM
.:Logistic Classification Training Summary
Estimated
Reference 1 0
1 1913 295
0 124 2084
Overall
Sensitivity 0.8664
Specificity 0.9438
Balanced Accuracy 0.9051
PPV 0.9391
NPV 0.8760
F1 0.9013
Accuracy 0.9051
AUC 0.9476
Brier Score 0.0808
Positive Class: 1
.:Logistic Classification Testing Summary
Estimated
Reference 1 0
1 630 107
0 6 30
Overall
Sensitivity 0.8548
Specificity 0.8333
Balanced Accuracy 0.8441
PPV 0.9906
NPV 0.2190
F1 0.9177
Accuracy 0.8538
AUC 0.9086
Brier Score 0.1100
Positive Class: 1
05-20-25 07:18:38 Completed in 2e-03 minutes (Real: 0.12; User: 0.14; System: 0.01) :s_GLM
In this example, upsampling the minority class helped give almost perfect Specificity at the cost of lower Sensitivity.
14.4.4 Downsampling
<- s_GLM(dat.train, dat.test,
mod.glm.downs ifw = FALSE,
downsample = TRUE)
05-20-25 07:18:38 Hello, egenn :s_GLM
05-20-25 07:18:38 Downsampling to balance outcome classes... :prepare_data
05-20-25 07:18:38 0 is the minority outcome with 105 cases :prepare_data
.:Classification Input Summary
Training features: 210 x 25
Training outcome: 210 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
05-20-25 07:18:38 Training GLM... :s_GLM
.:Logistic Classification Training Summary
Estimated
Reference 1 0
1 96 9
0 6 99
Overall
Sensitivity 0.9143
Specificity 0.9429
Balanced Accuracy 0.9286
PPV 0.9412
NPV 0.9167
F1 0.9275
Accuracy 0.9286
AUC 0.9640
Brier Score 0.0675
Positive Class: 1
.:Logistic Classification Testing Summary
Estimated
Reference 1 0
1 608 129
0 3 33
Overall
Sensitivity 0.8250
Specificity 0.9167
Balanced Accuracy 0.8708
PPV 0.9951
NPV 0.2037
F1 0.9021
Accuracy 0.8292
AUC 0.9129
Brier Score 0.1243
Positive Class: 1
05-20-25 07:18:38 Completed in 4e-04 minutes (Real: 0.02; User: 0.02; System: 2e-03) :s_GLM
Similar results to upsampling, in this case.
14.5 Random forest
Some algorithms allow multiple ways to handle imbalanced data. See this Tech Report for techniques to handle imbalanced classes with Random Forest. The report describes the “Balanced Random Forest” and “Weighted Random Forest” approaches.
14.5.1 No imbalance correction
Again, let’s begin by training a model with no correction for imbalanced data:
<- s_Ranger(dat.train, dat.test,
mod.rf.imb ifw = FALSE)
05-20-25 07:18:38 Hello, egenn :s_Ranger
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
.:Parameters
n.trees: 1000
mtry: NULL
05-20-25 07:18:38 Training Random Forest (ranger) Classification with 1000 trees... :s_Ranger
.:Ranger Classification Training Summary
Estimated
Reference 1 0
1 2208 0
0 1 104
Overall
Sensitivity 1.0000
Specificity 0.9905
Balanced Accuracy 0.9952
PPV 0.9995
NPV 1.0000
F1 0.9998
Accuracy 0.9996
AUC 1.0000
Brier Score 3.1e-03
Positive Class: 1
.:Ranger Classification Testing Summary
Estimated
Reference 1 0
1 732 5
0 14 22
Overall
Sensitivity 0.9932
Specificity 0.6111
Balanced Accuracy 0.8022
PPV 0.9812
NPV 0.8148
F1 0.9872
Accuracy 0.9754
AUC 0.9785
Brier Score 0.0193
Positive Class: 1
05-20-25 07:18:39 Completed in 5e-03 minutes (Real: 0.30; User: 0.83; System: 0.03) :s_Ranger
14.5.2 IFW: Case weights
Now, with IFW. By Default, s_Ranger()
, uses IFW to define case weights (i.e. ifw.case.weights = TRUE
):
<- s_Ranger(dat.train, dat.test,
mod.rf.ifw ifw = TRUE)
05-20-25 07:18:39 Hello, egenn :s_Ranger
05-20-25 07:18:39 Imbalanced classes: using Inverse Frequency Weighting :prepare_data
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
.:Parameters
n.trees: 1000
mtry: NULL
05-20-25 07:18:39 Training Random Forest (ranger) Classification with 1000 trees... :s_Ranger
.:Ranger Classification Training Summary
Estimated
Reference 1 0
1 2194 14
0 0 105
Overall
Sensitivity 0.9937
Specificity 1.0000
Balanced Accuracy 0.9968
PPV 1.0000
NPV 0.8824
F1 0.9968
Accuracy 0.9939
AUC 1.0000
Brier Score 4.8e-03
Positive Class: 1
.:Ranger Classification Testing Summary
Estimated
Reference 1 0
1 728 9
0 10 26
Overall
Sensitivity 0.9878
Specificity 0.7222
Balanced Accuracy 0.8550
PPV 0.9864
NPV 0.7429
F1 0.9871
Accuracy 0.9754
AUC 0.9840
Brier Score 0.0187
Positive Class: 1
05-20-25 07:18:39 Completed in 0.01 minutes (Real: 0.33; User: 1.00; System: 0.03) :s_Ranger
Again, IFW increases the Specificity.
14.5.3 IFW: Class weights
Alternatively, we can use IFW to define class weights:
<- s_Ranger(dat.train, dat.test,
mod.rf.cw ifw = TRUE,
ifw.case.weights = FALSE,
ifw.class.weights = TRUE)
05-20-25 07:18:39 Hello, egenn :s_Ranger
05-20-25 07:18:39 Imbalanced classes: using Inverse Frequency Weighting :prepare_data
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
.:Parameters
n.trees: 1000
mtry: NULL
05-20-25 07:18:39 Training Random Forest (ranger) Classification with 1000 trees... :s_Ranger
.:Ranger Classification Training Summary
Estimated
Reference 1 0
1 2208 0
0 1 104
Overall
Sensitivity 1.0000
Specificity 0.9905
Balanced Accuracy 0.9952
PPV 0.9995
NPV 1.0000
F1 0.9998
Accuracy 0.9996
AUC 1.0000
Brier Score 3.1e-03
Positive Class: 1
.:Ranger Classification Testing Summary
Estimated
Reference 1 0
1 732 5
0 15 21
Overall
Sensitivity 0.9932
Specificity 0.5833
Balanced Accuracy 0.7883
PPV 0.9799
NPV 0.8077
F1 0.9865
Accuracy 0.9741
AUC 0.9813
Brier Score 0.0191
Positive Class: 1
05-20-25 07:18:39 Completed in 4.9e-03 minutes (Real: 0.30; User: 0.83; System: 0.03) :s_Ranger
14.5.4 Upsampling
Now try upsampling:
<- s_Ranger(dat.train, dat.test,
mod.rf.ups ifw = FALSE,
upsample = TRUE)
05-20-25 07:18:39 Hello, egenn :s_Ranger
05-20-25 07:18:39 Upsampling to create balanced set... :prepare_data
05-20-25 07:18:39 1 is majority outcome with length = 2208 :prepare_data
.:Classification Input Summary
Training features: 4416 x 25
Training outcome: 4416 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
.:Parameters
n.trees: 1000
mtry: NULL
05-20-25 07:18:39 Training Random Forest (ranger) Classification with 1000 trees... :s_Ranger
.:Ranger Classification Training Summary
Estimated
Reference 1 0
1 2205 3
0 0 2208
Overall
Sensitivity 0.9986
Specificity 1.0000
Balanced Accuracy 0.9993
PPV 1.0000
NPV 0.9986
F1 0.9993
Accuracy 0.9993
AUC 1.0000
Brier Score 1.4e-03
Positive Class: 1
.:Ranger Classification Testing Summary
Estimated
Reference 1 0
1 729 8
0 12 24
Overall
Sensitivity 0.9891
Specificity 0.6667
Balanced Accuracy 0.8279
PPV 0.9838
NPV 0.7500
F1 0.9865
Accuracy 0.9741
AUC 0.9817
Brier Score 0.0181
Positive Class: 1
05-20-25 07:18:40 Completed in 0.01 minutes (Real: 0.69; User: 1.98; System: 0.05) :s_Ranger