14 Handling Imbalanced Data

library(rtemis)

In classification problems, it is common for outcome classes to appear with different frequencies. This is called imbalanced data. Consider, for example, a binary classification problem where the positive class (the ‘events’) appears with a 5% probability. Applying a learning algorithm naively without considering this class imbalance, may lead to the algorithm always predicting the majority class, which automatically results in 95% accuracy.

To handle imbalanced data, we make considerations during model training and assessment.

14.1 Model Training

There are a few different ways to address the problem of imbalanced data during training. We’ll consider the 3 main ones:

Inverse Frequency Weighting
We weigh each case based on its frequency, such that less frequent classes are up-weighed. This is called Inverse Frequency Weighting (IFW), and is enabled by default in rtemis for all classification learning algorithms that support case weights. The logical argument ifw controls whether IFW is used. It is TRUE by default in all learners.
Upsampling the minority class
We randomly sample from the minority class to reach the size of the manjority class. The effect is not very different from upweighing using IFW. The logical argument upsample in all rtemis learners that support classification controls whether upsampling of the minority class should be performed. (If it is set to TRUE, it makes the ifw argument irrelevant as the sample becomes balanced)
Downsampling the majority class
Conversely, we randomly subsample the majority class to reach the size of the minority class. The logical argument downsample controls this behavior.

14.2 Classification model performance metrics

During model selection as well as model assessment, it is crucial to use metrics that take into consideration imbalanced outcomes.
The following metrics address the issue in different ways and are reported by the modError function in all classification problems:

Balanced Accuracy (the mean of Sensitivity + Sensitivity) \[\frac{1}{N}\sum_{i=1}^k Sensitivity_i\] i.e. the mean per-class Sensitivity. In the binary case, this is equal to the mean of Sensitivity and Specificity.
F1 Harmonic mean of Sensitivity (aka Recall) and Positive Predictive Value (aka Precision) \[F_1 = 2\frac{precision * recall}{precision + recall}\]
AUROC (Area under the ROC) i.e. the area under the True Positive Rate vs False Positive Rate curve or Sensitivity vs 1-Specificity

14.3 Example dataset

Let’s look at a very imbalanced dataset from the Penn ML Benchmarks repository

dat <- read("https://github.com/EpistasisLab/pmlb/raw/master/datasets/hypothyroid/hypothyroid.tsv.gz")

05-20-25 07:18:37 ▶ Reading hypothyroid.tsv.gz using data.table... :read
05-20-25 07:18:38 Read in 3,163 x 26 :read
05-20-25 07:18:38 Removed 77 duplicate rows. :read
05-20-25 07:18:38 New dimensions: 3,086 x 26 :read
05-20-25 07:18:38 Completed in 0.02 minutes (Real: 1.11; User: 0.06; System: 0.01) :read

dat$target <- factor(dat$target, levels = c(1, 0))

check_data(dat)

  dat: A data.table with 3086 rows and 26 columns

  Data types
  * 0 numeric features
  * 25 integer features
  * 1 factor, which is not ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 0 missing values

  Recommendations
  * Everything looks good

Get the frequency of the target classes:

table(dat$target)


   1    0 
2945  141

14.3.1 Class Imbalance

We can use the Class Imbalance formula using the class_imbalance() function:

\[I = K\cdot\sum_{i=1}^K (n_i/N - 1/K)^2\]

class_imbalance(dat$target)

[1] 0.8255895

Let’s create some resamples to train and test models:

res <- resample(dat, seed = 2019)

05-20-25 07:18:38 Input contains more than one columns; will stratify on last :resample
.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
05-20-25 07:18:38 Using max n bins possible = 2 :strat.sub
05-20-25 07:18:38 Created 10 stratified subsamples :resample

dat.train <- dat[res$Subsample_1, ]
dat.test <- dat[-res$Subsample_1, ]

14.4 GLM

14.4.1 No imbalance correction

Let’s train a GLM without inverse probability weighting or upsampling. Since IFW is set to TRUE by default in all rtemis supervised learning functions that support it, we have to set it to FALSE:

mod.glm.imb <- s_GLM(dat.train, dat.test,
                     ifw = FALSE)

05-20-25 07:18:38 Hello, egenn :s_GLM

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

05-20-25 07:18:38 Training GLM... :s_GLM


.:Logistic Classification Training Summary
                   Estimated 
        Reference  1     0   
                1  2195  13
                0    72  33

                   Overall  
      Sensitivity  0.9941 
      Specificity  0.3143 
Balanced Accuracy  0.6542 
              PPV  0.9682 
              NPV  0.7174 
               F1  0.9810 
         Accuracy  0.9633 
              AUC  0.9431 
      Brier Score  0.0284 

  Positive Class:  1 

.:Logistic Classification Testing Summary
                   Estimated 
        Reference  1    0   
                1  728   9
                0   24  12

                   Overall  
      Sensitivity  0.9878 
      Specificity  0.3333 
Balanced Accuracy  0.6606 
              PPV  0.9681 
              NPV  0.5714 
               F1  0.9778 
         Accuracy  0.9573 
              AUC  0.9177 
      Brier Score  0.0340 

  Positive Class:  1 
05-20-25 07:18:38 Completed in 1.7e-03 minutes (Real: 0.10; User: 0.28; System: 0.02) :s_GLM

We get almost perfect Sensitivity, but very low Specificity.

14.4.2 IFW

Let’s enable IFW:

mod.glm.ifw <- s_GLM(dat.train, dat.test,
                     ifw = TRUE)

05-20-25 07:18:38 Hello, egenn :s_GLM

05-20-25 07:18:38 Imbalanced classes: using Inverse Frequency Weighting :prepare_data

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

05-20-25 07:18:38 Training GLM... :s_GLM


.:Logistic Classification Training Summary
                   Estimated 
        Reference  1     0    
                1  1912  296
                0     8   97

                   Overall  
      Sensitivity  0.8659 
      Specificity  0.9238 
Balanced Accuracy  0.8949 
              PPV  0.9958 
              NPV  0.2468 
               F1  0.9264 
         Accuracy  0.8686 
              AUC  0.9469 
      Brier Score  0.0968 

  Positive Class:  1 

.:Logistic Classification Testing Summary
                   Estimated 
        Reference  1    0    
                1  624  113
                0    6   30

                   Overall  
      Sensitivity  0.8467 
      Specificity  0.8333 
Balanced Accuracy  0.8400 
              PPV  0.9905 
              NPV  0.2098 
               F1  0.9129 
         Accuracy  0.8461 
              AUC  0.9085 
      Brier Score  0.1104 

  Positive Class:  1 
05-20-25 07:18:38 Completed in 9.8e-04 minutes (Real: 0.06; User: 0.15; System: 0.01) :s_GLM

Sensitivity dropped a little, but Specificity improved a lot and they are now very close.

14.4.3 Upsampling

Let’s try upsampling instead of IFW:

mod.glm.ups <- s_GLM(dat.train, dat.test,
                     ifw = FALSE,
                     upsample = TRUE)

05-20-25 07:18:38 Hello, egenn :s_GLM

05-20-25 07:18:38 Upsampling to create balanced set... :prepare_data
05-20-25 07:18:38 1 is majority outcome with length = 2208 :prepare_data

.:Classification Input Summary
Training features: 4416 x 25 
 Training outcome: 4416 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

05-20-25 07:18:38 Training GLM... :s_GLM


.:Logistic Classification Training Summary
                   Estimated 
        Reference  1     0     
                1  1913   295
                0   124  2084

                   Overall  
      Sensitivity  0.8664 
      Specificity  0.9438 
Balanced Accuracy  0.9051 
              PPV  0.9391 
              NPV  0.8760 
               F1  0.9013 
         Accuracy  0.9051 
              AUC  0.9476 
      Brier Score  0.0808 

  Positive Class:  1 

.:Logistic Classification Testing Summary
                   Estimated 
        Reference  1    0    
                1  630  107
                0    6   30

                   Overall  
      Sensitivity  0.8548 
      Specificity  0.8333 
Balanced Accuracy  0.8441 
              PPV  0.9906 
              NPV  0.2190 
               F1  0.9177 
         Accuracy  0.8538 
              AUC  0.9086 
      Brier Score  0.1100 

  Positive Class:  1 
05-20-25 07:18:38 Completed in 2e-03 minutes (Real: 0.12; User: 0.14; System: 0.01) :s_GLM

In this example, upsampling the minority class helped give almost perfect Specificity at the cost of lower Sensitivity.

14.4.4 Downsampling

mod.glm.downs <- s_GLM(dat.train, dat.test,
                       ifw = FALSE,
                       downsample = TRUE)

05-20-25 07:18:38 Hello, egenn :s_GLM

05-20-25 07:18:38 Downsampling to balance outcome classes... :prepare_data
05-20-25 07:18:38 0 is the minority outcome with 105 cases :prepare_data

.:Classification Input Summary
Training features: 210 x 25 
 Training outcome: 210 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

05-20-25 07:18:38 Training GLM... :s_GLM

.:Logistic Classification Training Summary
                   Estimated 
        Reference  1   0   
                1  96   9
                0   6  99

                   Overall  
      Sensitivity  0.9143 
      Specificity  0.9429 
Balanced Accuracy  0.9286 
              PPV  0.9412 
              NPV  0.9167 
               F1  0.9275 
         Accuracy  0.9286 
              AUC  0.9640 
      Brier Score  0.0675 

  Positive Class:  1


.:Logistic Classification Testing Summary
                   Estimated 
        Reference  1    0    
                1  608  129
                0    3   33

                   Overall  
      Sensitivity  0.8250 
      Specificity  0.9167 
Balanced Accuracy  0.8708 
              PPV  0.9951 
              NPV  0.2037 
               F1  0.9021 
         Accuracy  0.8292 
              AUC  0.9129 
      Brier Score  0.1243 

  Positive Class:  1 
05-20-25 07:18:38 Completed in 4e-04 minutes (Real: 0.02; User: 0.02; System: 2e-03) :s_GLM

Similar results to upsampling, in this case.

14.5 Random forest

Some algorithms allow multiple ways to handle imbalanced data. See this Tech Report for techniques to handle imbalanced classes with Random Forest. The report describes the “Balanced Random Forest” and “Weighted Random Forest” approaches.

14.5.1 No imbalance correction

Again, let’s begin by training a model with no correction for imbalanced data:

mod.rf.imb <- s_Ranger(dat.train, dat.test,
                       ifw = FALSE)

05-20-25 07:18:38 Hello, egenn :s_Ranger

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

05-20-25 07:18:38 Training Random Forest (ranger) Classification with 1000 trees... :s_Ranger

.:Ranger Classification Training Summary
                   Estimated 
        Reference  1     0    
                1  2208    0
                0     1  104

                   Overall  
      Sensitivity  1.0000 
      Specificity  0.9905 
Balanced Accuracy  0.9952 
              PPV  0.9995 
              NPV  1.0000 
               F1  0.9998 
         Accuracy  0.9996 
              AUC  1.0000 
      Brier Score  3.1e-03

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Estimated 
        Reference  1    0   
                1  732   5
                0   14  22

                   Overall  
      Sensitivity  0.9932 
      Specificity  0.6111 
Balanced Accuracy  0.8022 
              PPV  0.9812 
              NPV  0.8148 
               F1  0.9872 
         Accuracy  0.9754 
              AUC  0.9785 
      Brier Score  0.0193 

  Positive Class:  1 
05-20-25 07:18:39 Completed in 5e-03 minutes (Real: 0.30; User: 0.83; System: 0.03) :s_Ranger

14.5.2 IFW: Case weights

Now, with IFW. By Default, s_Ranger(), uses IFW to define case weights (i.e. ifw.case.weights = TRUE):

mod.rf.ifw <- s_Ranger(dat.train, dat.test,
                       ifw = TRUE)

05-20-25 07:18:39 Hello, egenn :s_Ranger

05-20-25 07:18:39 Imbalanced classes: using Inverse Frequency Weighting :prepare_data

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

05-20-25 07:18:39 Training Random Forest (ranger) Classification with 1000 trees... :s_Ranger

.:Ranger Classification Training Summary
                   Estimated 
        Reference  1     0    
                1  2194   14
                0     0  105

                   Overall  
      Sensitivity  0.9937 
      Specificity  1.0000 
Balanced Accuracy  0.9968 
              PPV  1.0000 
              NPV  0.8824 
               F1  0.9968 
         Accuracy  0.9939 
              AUC  1.0000 
      Brier Score  4.8e-03

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Estimated 
        Reference  1    0   
                1  728   9
                0   10  26

                   Overall  
      Sensitivity  0.9878 
      Specificity  0.7222 
Balanced Accuracy  0.8550 
              PPV  0.9864 
              NPV  0.7429 
               F1  0.9871 
         Accuracy  0.9754 
              AUC  0.9840 
      Brier Score  0.0187 

  Positive Class:  1 
05-20-25 07:18:39 Completed in 0.01 minutes (Real: 0.33; User: 1.00; System: 0.03) :s_Ranger

Again, IFW increases the Specificity.

14.5.3 IFW: Class weights

Alternatively, we can use IFW to define class weights:

mod.rf.cw <- s_Ranger(dat.train, dat.test,
                      ifw = TRUE,
                      ifw.case.weights = FALSE,
                      ifw.class.weights = TRUE)

05-20-25 07:18:39 Hello, egenn :s_Ranger

05-20-25 07:18:39 Imbalanced classes: using Inverse Frequency Weighting :prepare_data

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

05-20-25 07:18:39 Training Random Forest (ranger) Classification with 1000 trees... :s_Ranger

.:Ranger Classification Training Summary
                   Estimated 
        Reference  1     0    
                1  2208    0
                0     1  104

                   Overall  
      Sensitivity  1.0000 
      Specificity  0.9905 
Balanced Accuracy  0.9952 
              PPV  0.9995 
              NPV  1.0000 
               F1  0.9998 
         Accuracy  0.9996 
              AUC  1.0000 
      Brier Score  3.1e-03

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Estimated 
        Reference  1    0   
                1  732   5
                0   15  21

                   Overall  
      Sensitivity  0.9932 
      Specificity  0.5833 
Balanced Accuracy  0.7883 
              PPV  0.9799 
              NPV  0.8077 
               F1  0.9865 
         Accuracy  0.9741 
              AUC  0.9813 
      Brier Score  0.0191 

  Positive Class:  1 
05-20-25 07:18:39 Completed in 4.9e-03 minutes (Real: 0.30; User: 0.83; System: 0.03) :s_Ranger

14.5.4 Upsampling

Now try upsampling:

mod.rf.ups <- s_Ranger(dat.train, dat.test,
                       ifw = FALSE,
                       upsample = TRUE)

05-20-25 07:18:39 Hello, egenn :s_Ranger

05-20-25 07:18:39 Upsampling to create balanced set... :prepare_data
05-20-25 07:18:39 1 is majority outcome with length = 2208 :prepare_data

.:Classification Input Summary
Training features: 4416 x 25 
 Training outcome: 4416 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

05-20-25 07:18:39 Training Random Forest (ranger) Classification with 1000 trees... :s_Ranger

.:Ranger Classification Training Summary
                   Estimated 
        Reference  1     0     
                1  2205     3
                0     0  2208

                   Overall  
      Sensitivity  0.9986 
      Specificity  1.0000 
Balanced Accuracy  0.9993 
              PPV  1.0000 
              NPV  0.9986 
               F1  0.9993 
         Accuracy  0.9993 
              AUC  1.0000 
      Brier Score  1.4e-03

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Estimated 
        Reference  1    0   
                1  729   8
                0   12  24

                   Overall  
      Sensitivity  0.9891 
      Specificity  0.6667 
Balanced Accuracy  0.8279 
              PPV  0.9838 
              NPV  0.7500 
               F1  0.9865 
         Accuracy  0.9741 
              AUC  0.9817 
      Brier Score  0.0181 

  Positive Class:  1 
05-20-25 07:18:40 Completed in 0.01 minutes (Real: 0.69; User: 1.98; System: 0.05) :s_Ranger