10 Regression & Classification

library(rtemis)

All rtemis learners train a model, after optional running of hyperparameters by grid search when applicable, and validate it if a test set is provided. Use select_learn() to get a list of all available algorithms:

10.1 Data Input for Supervised Learning

All rtemis supervised learning functions begin with “s_” for “supervised”. They accept the same first four arguments:
x, y, x.test, y.test
but are flexible allowing you to also provide combined (x, y) and (x.test, y.test) data frames.

Regression is performed for continuous outcomes of class “numeric”, and classification is performed when the outcome is categorical and of class “factor”. For binary classification, the first level of the factor will be defined as the “positive” class.

10.1.1 Scenario 1 (`x.train, y.train, x.test, y.test`)

In the most straightforward case, provide each individually:

x: Training set features
y: Training set outcome
x.test: Testing set features (Optional)
y.test: Testing set outcome (Optional)

x <- rnormmat(200, 10, seed = 2019)
w <- rnorm(10)
y <- x %*% w + rnorm(200)
res <- resample(y, seed = 2020)

.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
05-20-25 07:06:39 Created 10 stratified subsamples :resample

x.train <- x[res$Subsample_1, ]
x.test <- x[-res$Subsample_1, ]
y.train <- y[res$Subsample_1]
y.test <- y[-res$Subsample_1]

mod.glm <- s_GLM(x.train, y.train, x.test, y.test)

05-20-25 07:06:39 Hello, egenn :s_GLM

.:Regression Input Summary
Training features: 147 x 10 
 Training outcome: 147 x 1 
 Testing features: 53 x 10 
  Testing outcome: 53 x 1 

05-20-25 07:06:39 Training GLM... :s_GLM

.:GLM Regression Training Summary
    MSE = 0.84
   RMSE = 0.92
    MAE = 0.75
      r = 0.96 (p = 5.9e-81)
   R sq = 0.92

.:GLM Regression Testing Summary
    MSE = 1.22
   RMSE = 1.10
    MAE = 0.90
      r = 0.94 (p = 2.5e-26)
   R sq = 0.89
05-20-25 07:06:39 Completed in 0.01 minutes (Real: 0.35; User: 0.33; System: 0.02) :s_GLM

10.1.2 Scenario 2: (`x.train, x.test`)

You can provide training and testing sets as a single data.frame each where the last column is the outcome:

x: data.frame(x.train, y.train)
y: data.frame(x.test, y.test)

x <- rnormmat(200, 10, seed = 2019)
w <- rnorm(10)
y <- x %*% w + rnorm(200)
dat <- data.frame(x, y)
res <- resample(dat, seed = 2020)

05-20-25 07:06:39 Input contains more than one columns; will stratify on last :resample
.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
05-20-25 07:06:39 Created 10 stratified subsamples :resample

dat.train <- dat[res$Subsample_1, ]
dat.test <- dat[-res$Subsample_1, ]

The dataPrepare function will check data dimensions and determine whether data was input as separate feature and outcome sets or combined and ensure the correct number of cases and features was provided.

In either scenario, Regression will be performed if the outcome is numeric and Classification if the outcome is a factor.

10.2 Generalized Linear Model (GLM)

mod.glm <- s_GLM(dat.train, dat.test)

05-20-25 07:06:39 Hello, egenn :s_GLM

.:Regression Input Summary
Training features: 147 x 10 
 Training outcome: 147 x 1 
 Testing features: 53 x 10 
  Testing outcome: 53 x 1 

05-20-25 07:06:39 Training GLM... :s_GLM

.:GLM Regression Training Summary
    MSE = 0.84
   RMSE = 0.92
    MAE = 0.75
      r = 0.96 (p = 5.9e-81)
   R sq = 0.92

.:GLM Regression Testing Summary
    MSE = 1.22
   RMSE = 1.10
    MAE = 0.90
      r = 0.94 (p = 2.5e-26)
   R sq = 0.89
05-20-25 07:06:39 Completed in 1.3e-04 minutes (Real: 0.01; User: 0.01; System: 1e-03) :s_GLM

Note: If there are factor features, s_GLM will test that there are no levels present in the test set and not in the training. This would cause predict to fail. This is a problem that may arise when you are running multiple cross-validated experiments.

10.3 Elastic Net (Regularized GLM)

Regularization prevents overfitting and allows training a linear model on a dataset with more features than cases (p >> n).

x <- rnormmat(500, 1000, seed = 2019)
w <- rnorm(1000)
y <- x %*% w + rnorm(500)
dat <- data.frame(x, y)
res <- resample(dat)

05-20-25 07:06:39 Input contains more than one columns; will stratify on last :resample
.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
05-20-25 07:06:39 Created 10 stratified subsamples :resample

dat.train <- dat[res$Subsample_1, ]
dat.test <- dat[-res$Subsample_1, ]

mod.ridge <- s_GLMNET(dat.train, dat.test, alpha = 0)

05-20-25 07:06:39 Hello, egenn :s_GLMNET

.:Regression Input Summary
Training features: 374 x 1000 
 Training outcome: 374 x 1 
 Testing features: 126 x 1000 
  Testing outcome: 126 x 1 

05-20-25 07:06:40 Running grid search... :gridSearchLearn
.:Resampling Parameters
    n.resamples: 5 
      resampler: kfold 
   stratify.var: y 
   strat.n.bins: 4 
05-20-25 07:06:40 Created 5 independent folds :resample
.:Search parameters
    grid.params:  
                 alpha: 0 
   fixed.params:  
                             .gs: TRUE 
                 which.cv.lambda: lambda.1se 
05-20-25 07:06:40 Tuning Elastic Net by exhaustive grid search. :gridSearchLearn
05-20-25 07:06:40 5 inner resamples; 5 models total; running on 9 workers (aarch64-apple-darwin20) :gridSearchLearn
05-20-25 07:06:41 Extracting best lambda from GLMNET models... :gridSearchLearn
.:Best parameters to minimize MSE
   best.tune:  
              lambda: 191.979466499985 
               alpha: 0 
05-20-25 07:06:41 Completed in 0.03 minutes (Real: 1.60; User: 0.25; System: 0.16) :gridSearchLearn

.:Parameters
    alpha: 0 
   lambda: 191.979466499985 

05-20-25 07:06:41 Training elastic net model... :s_GLMNET

.:GLMNET Regression Training Summary
    MSE = 431.62
   RMSE = 20.78
    MAE = 16.60
      r = 0.96 (p = 3.1e-203)
   R sq = 0.59

.:GLMNET Regression Testing Summary
    MSE = 950.69
   RMSE = 30.83
    MAE = 24.94
      r = 0.52 (p = 6.2e-10)
   R sq = 0.17
05-20-25 07:06:41 Completed in 0.03 minutes (Real: 1.83; User: 0.44; System: 0.18) :s_GLMNET

mod.lasso <- s_GLMNET(dat.train, dat.test, alpha = 1)

05-20-25 07:06:41 Hello, egenn :s_GLMNET

.:Regression Input Summary
Training features: 374 x 1000 
 Training outcome: 374 x 1 
 Testing features: 126 x 1000 
  Testing outcome: 126 x 1 

05-20-25 07:06:42 Running grid search... :gridSearchLearn
.:Resampling Parameters
    n.resamples: 5 
      resampler: kfold 
   stratify.var: y 
   strat.n.bins: 4 
05-20-25 07:06:42 Created 5 independent folds :resample
.:Search parameters
    grid.params:  
                 alpha: 1 
   fixed.params:  
                             .gs: TRUE 
                 which.cv.lambda: lambda.1se 
05-20-25 07:06:42 Tuning Elastic Net by exhaustive grid search. :gridSearchLearn
05-20-25 07:06:42 5 inner resamples; 5 models total; running on 9 workers (aarch64-apple-darwin20) :gridSearchLearn
05-20-25 07:06:42 Extracting best lambda from GLMNET models... :gridSearchLearn
.:Best parameters to minimize MSE
   best.tune:  
              lambda: 5.27695386992611 
               alpha: 1 
05-20-25 07:06:42 Completed in 0.01 minutes (Real: 0.73; User: 0.11; System: 0.08) :gridSearchLearn

.:Parameters
    alpha: 1 
   lambda: 5.27695386992611 

05-20-25 07:06:42 Training elastic net model... :s_GLMNET

.:GLMNET Regression Training Summary
    MSE = 995.43
   RMSE = 31.55
    MAE = 25.17
      r = 0.40 (p = 5.8e-16)
   R sq = 0.06

.:GLMNET Regression Testing Summary
    MSE = 1131.44
   RMSE = 33.64
    MAE = 27.29
      r = 0.11 (p = 0.22)
   R sq = 0.01
05-20-25 07:06:42 Completed in 0.02 minutes (Real: 0.94; User: 0.31; System: 0.10) :s_GLMNET

If you do not define alpha, it defaults to seq(0, 1, 0.2), which means that grid search will be used for tuning.

mod.elnet <- s_GLMNET(dat.train, dat.test)

05-20-25 07:06:42 Hello, egenn :s_GLMNET

.:Regression Input Summary
Training features: 374 x 1000 
 Training outcome: 374 x 1 
 Testing features: 126 x 1000 
  Testing outcome: 126 x 1 

05-20-25 07:06:42 Running grid search... :gridSearchLearn
.:Resampling Parameters
    n.resamples: 5 
      resampler: kfold 
   stratify.var: y 
   strat.n.bins: 4 
05-20-25 07:06:42 Created 5 independent folds :resample
.:Search parameters
    grid.params:  
                 alpha: 0, 0.2, 0.4, 0.6, 0.8, 1... 
   fixed.params:  
                             .gs: TRUE 
                 which.cv.lambda: lambda.1se 
05-20-25 07:06:42 Tuning Elastic Net by exhaustive grid search. :gridSearchLearn
05-20-25 07:06:42 5 inner resamples; 30 models total; running on 9 workers (aarch64-apple-darwin20) :gridSearchLearn
05-20-25 07:06:47 Extracting best lambda from GLMNET models... :gridSearchLearn
.:Best parameters to minimize MSE
   best.tune:  
              lambda: 174.727833490782 
               alpha: 0 
05-20-25 07:06:47 Completed in 0.07 minutes (Real: 4.37; User: 0.36; System: 0.33) :gridSearchLearn

.:Parameters
    alpha: 0 
   lambda: 174.727833490782 

05-20-25 07:06:47 Training elastic net model... :s_GLMNET

.:GLMNET Regression Training Summary
    MSE = 404.86
   RMSE = 20.12
    MAE = 16.07
      r = 0.96 (p = 3.8e-207)
   R sq = 0.62

.:GLMNET Regression Testing Summary
    MSE = 941.32
   RMSE = 30.68
    MAE = 24.85
      r = 0.52 (p = 5.2e-10)
   R sq = 0.18
05-20-25 07:06:47 Completed in 0.08 minutes (Real: 4.53; User: 0.50; System: 0.35) :s_GLMNET

Many real-world relationships are nonlinear. A large number of regression approaches exist to model such relationships_

Let’s create some new synthetic data:

x <- rnormmat(400, 10)
y <- x[, 3]^2 + x[, 5] + rnorm(400)

In this example, y depends on the cube of x

10.4 Generalized Additive Model (GAM)

Generalized Additive Models provide a very efficient way of fitting penalized splines. GAMs in rtemis can be fit with s_GAM (which uses mgcv::gam):

mod.gam <- s_GAM(x, y)

05-20-25 07:06:47 Hello, egenn :s_GAM

.:Regression Input Summary
Training features: 400 x 10 
 Training outcome: 400 x 1 
 Testing features: Not available
  Testing outcome: Not available

05-20-25 07:06:47 Training GAM... :s_GAM

.:GAM Regression Training Summary
    MSE = 0.91
   RMSE = 0.96
    MAE = 0.77
      r = 0.88 (p = 3.8e-128)
   R sq = 0.77
05-20-25 07:06:47 Completed in 0.01 minutes (Real: 0.42; User: 0.40; System: 0.02) :s_GAM

10.5 Projection Pursuit Regression (PPR)

Projection Pursuit Regression is an extension of (generalized) additive models.
Where a linear model is a linear combination of a set of predictors,
an additive model is a linear combination of nonlinear transformations of a set of predictors, a projection pursuit model is a linear combination of nonlinear transformations of linear combinations of predictors.

mod.ppr <- s_PPR(x, y)

05-20-25 07:06:47 Hello, egenn :s_PPR

.:Regression Input Summary
Training features: 400 x 10 
 Training outcome: 400 x 1 
 Testing features: Not available
  Testing outcome: Not available

.:Parameters
      nterms: 4 
    optlevel: 3 
   sm.method: spline 
        bass: 0 
        span: 0 
          df: 5 
      gcvpen: 1 

05-20-25 07:06:47 Running Projection Pursuit Regression... :s_PPR

.:PPR Regression Training Summary
    MSE = 0.82
   RMSE = 0.90
    MAE = 0.72
      r = 0.89 (p = 7.1e-138)
   R sq = 0.79
05-20-25 07:06:47 Completed in 3.2e-04 minutes (Real: 0.02; User: 0.02; System: 2e-03) :s_PPR

10.6 Support Vector Machine (SVM)

mod.svm <- s_SVM(x, y)

05-20-25 07:06:47 Hello, egenn :s_SVM

.:Regression Input Summary
Training features: 400 x 10 
 Training outcome: 400 x 1 
 Testing features: Not available
  Testing outcome: Not available

05-20-25 07:06:48 Training SVM Regression with radial kernel... :s_SVM

.:SVM Regression Training Summary
    MSE = 0.79
   RMSE = 0.89
    MAE = 0.60
      r = 0.91 (p = 4.7e-154)
   R sq = 0.80
05-20-25 07:06:48 Completed in 0.01 minutes (Real: 0.33; User: 0.06; System: 0.01) :s_SVM

10.7 Classification and Regression Trees (CART)

mod.cart <- s_CART(x, y)

05-20-25 07:06:48 Hello, egenn :s_CART

.:Regression Input Summary
Training features: 400 x 10 
 Training outcome: 400 x 1 
 Testing features: Not available
  Testing outcome: Not available

05-20-25 07:06:48 Training CART... :s_CART

.:CART Regression Training Summary
    MSE = 1.09
   RMSE = 1.04
    MAE = 0.83
      r = 0.85 (p = 5.9e-113)
   R sq = 0.72
05-20-25 07:06:48 Completed in 2.8e-04 minutes (Real: 0.02; User: 0.01; System: 2e-03) :s_CART

10.8 Random Forest

Multiple Random Forest implementations are included in rtemis. Ranger provides an efficient implementation well-suited for general use.

mod.rf <- s_Ranger(x, y)

05-20-25 07:06:48 Hello, egenn :s_Ranger

.:Regression Input Summary
Training features: 400 x 10 
 Training outcome: 400 x 1 
 Testing features: Not available
  Testing outcome: Not available

.:Parameters
   n.trees: 1000 
      mtry: NULL 

05-20-25 07:06:48 Training Random Forest (ranger) Regression with 1000 trees... :s_Ranger

.:Ranger Regression Training Summary
    MSE = 0.30
   RMSE = 0.54
    MAE = 0.41
      r = 0.98 (p = 1.1e-287)
   R sq = 0.92
05-20-25 07:06:48 Completed in 1.6e-03 minutes (Real: 0.10; User: 0.40; System: 0.02) :s_Ranger

10.9 Gradient Boosting

Gradient Boosting is, on average, the best performing learning algorithm for structured data. rtemis includes multiple implementations of boosting, along with support to boost any learner - see chapter on Boosting.

mod.gbm <- s_GBM(x, y)

05-20-25 07:06:48 Hello, egenn :s_GBM

.:Regression Input Summary
Training features: 400 x 10 
 Training outcome: 400 x 1 
 Testing features: Not available
  Testing outcome: Not available
05-20-25 07:06:48 Distribution set to gaussian :s_GBM

05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM

05-20-25 07:06:48 Running grid search... :gridSearchLearn
.:Resampling Parameters
    n.resamples: 5 
      resampler: kfold 
   stratify.var: y 
   strat.n.bins: 4 
05-20-25 07:06:48 Created 5 independent folds :resample
.:Search parameters
    grid.params:  
                 interaction.depth: 2 
                         shrinkage: 0.01 
                      bag.fraction: 0.9 
                    n.minobsinnode: 5 
   fixed.params:  
                           n.trees: 2000 
                         max.trees: 5000 
                 gbm.select.smooth: FALSE 
                       n.new.trees: 500 
                         min.trees: 50 
                    failsafe.trees: 500 
                               ifw: TRUE 
                          ifw.type: 2 
                          upsample: FALSE 
                        downsample: FALSE 
                     resample.seed: NULL 
                            relInf: FALSE 
                   plot.tune.error: FALSE 
                               .gs: TRUE 
05-20-25 07:06:48 Tuning Gradient Boosting Machine by exhaustive grid search. :gridSearchLearn
05-20-25 07:06:48 5 inner resamples; 5 models total; running on 9 workers (aarch64-apple-darwin20) :gridSearchLearn
05-20-25 07:06:48 Running grid line #1 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM

.:Regression Input Summary
Training features: 319 x 10 
 Training outcome: 319 x 1 
 Testing features: 81 x 10 
  Testing outcome: 81 x 1 
05-20-25 07:06:48 Distribution set to gaussian :s_GBM

05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM

.:Parameters
             n.trees: 2000 
   interaction.depth: 2 
           shrinkage: 0.01 
        bag.fraction: 0.9 
      n.minobsinnode: 5 
             weights: NULL 

.:GBM Regression Training Summary
    MSE = 0.91
   RMSE = 0.95
    MAE = 0.73
      r = 0.89 (p = 1.3e-110)
   R sq = 0.77
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM

.:GBM Regression Testing Summary
    MSE = 1.17
   RMSE = 1.08
    MAE = 0.89
      r = 0.84 (p = 7.9e-23)
   R sq = 0.70
05-20-25 07:06:48 Completed in 2.1e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) :s_GBM
05-20-25 07:06:48 Running grid line #2 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM

.:Regression Input Summary
Training features: 320 x 10 
 Training outcome: 320 x 1 
 Testing features: 80 x 10 
  Testing outcome: 80 x 1 
05-20-25 07:06:48 Distribution set to gaussian :s_GBM

05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM

.:Parameters
             n.trees: 2000 
   interaction.depth: 2 
           shrinkage: 0.01 
        bag.fraction: 0.9 
      n.minobsinnode: 5 
             weights: NULL 

.:GBM Regression Training Summary
    MSE = 0.69
   RMSE = 0.83
    MAE = 0.65
      r = 0.91 (p = 1.3e-125)
   R sq = 0.83
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM

.:GBM Regression Testing Summary
    MSE = 1.30
   RMSE = 1.14
    MAE = 0.92
      r = 0.81 (p = 1.3e-19)
   R sq = 0.64
05-20-25 07:06:48 Completed in 2.1e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) :s_GBM
05-20-25 07:06:48 Running grid line #3 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM

.:Regression Input Summary
Training features: 322 x 10 
 Training outcome: 322 x 1 
 Testing features: 78 x 10 
  Testing outcome: 78 x 1 
05-20-25 07:06:48 Distribution set to gaussian :s_GBM

05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM

.:Parameters
             n.trees: 2000 
   interaction.depth: 2 
           shrinkage: 0.01 
        bag.fraction: 0.9 
      n.minobsinnode: 5 
             weights: NULL 

.:GBM Regression Training Summary
    MSE = 0.61
   RMSE = 0.78
    MAE = 0.62
      r = 0.92 (p = 8.6e-134)
   R sq = 0.85
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM

.:GBM Regression Testing Summary
    MSE = 1.19
   RMSE = 1.09
    MAE = 0.84
      r = 0.83 (p = 1.1e-20)
   R sq = 0.67
05-20-25 07:06:48 Completed in 2.1e-03 minutes (Real: 0.13; User: 0.12; System: 0.01) :s_GBM
05-20-25 07:06:48 Running grid line #4 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM

.:Regression Input Summary
Training features: 320 x 10 
 Training outcome: 320 x 1 
 Testing features: 80 x 10 
  Testing outcome: 80 x 1 
05-20-25 07:06:48 Distribution set to gaussian :s_GBM

05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM

.:Parameters
             n.trees: 2000 
   interaction.depth: 2 
           shrinkage: 0.01 
        bag.fraction: 0.9 
      n.minobsinnode: 5 
             weights: NULL 

.:GBM Regression Training Summary
    MSE = 0.75
   RMSE = 0.87
    MAE = 0.70
      r = 0.90 (p = 2.5e-118)
   R sq = 0.81
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM

.:GBM Regression Testing Summary
    MSE = 1.38
   RMSE = 1.17
    MAE = 0.82
      r = 0.82 (p = 1.7e-20)
   R sq = 0.66
05-20-25 07:06:48 Completed in 2.1e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) :s_GBM
05-20-25 07:06:48 Running grid line #5 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM

.:Regression Input Summary
Training features: 319 x 10 
 Training outcome: 319 x 1 
 Testing features: 81 x 10 
  Testing outcome: 81 x 1 
05-20-25 07:06:48 Distribution set to gaussian :s_GBM

05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM

.:Parameters
             n.trees: 2000 
   interaction.depth: 2 
           shrinkage: 0.01 
        bag.fraction: 0.9 
      n.minobsinnode: 5 
             weights: NULL 

.:GBM Regression Training Summary
    MSE = 0.76
   RMSE = 0.87
    MAE = 0.69
      r = 0.90 (p = 5.4e-116)
   R sq = 0.80
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM

.:GBM Regression Testing Summary
    MSE = 1.19
   RMSE = 1.09
    MAE = 0.90
      r = 0.86 (p = 1.4e-24)
   R sq = 0.73
05-20-25 07:06:48 Completed in 2e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) :s_GBM
.:Best parameters to minimize MSE
   best.tune:  
                        n.trees: 842 
              interaction.depth: 2 
                      shrinkage: 0.01 
                   bag.fraction: 0.9 
                 n.minobsinnode: 5 
05-20-25 07:06:48 Completed in 0.01 minutes (Real: 0.33; User: 0.11; System: 0.07) :gridSearchLearn

.:Parameters
             n.trees: 842 
   interaction.depth: 2 
           shrinkage: 0.01 
        bag.fraction: 0.9 
      n.minobsinnode: 5 
             weights: NULL 
05-20-25 07:06:48 Training GBM on full training set... :s_GBM

.:GBM Regression Training Summary
    MSE = 0.77
   RMSE = 0.88
    MAE = 0.70
      r = 0.90 (p = 8.9e-146)
   R sq = 0.81
05-20-25 07:06:48 Calculating relative influence of variables... :s_GBM
05-20-25 07:06:48 Completed in 0.01 minutes (Real: 0.56; User: 0.19; System: 0.08) :s_GBM