library(rtemis)
10 Regression & Classification
All rtemis learners train a model, after optional running of hyperparameters by grid search when applicable, and validate it if a test set is provided. Use select_learn()
to get a list of all available algorithms:
10.1 Data Input for Supervised Learning
All rtemis supervised learning functions begin with “s_
” for “supervised”. They accept the same first four arguments:
x
, y
, x.test
, y.test
but are flexible allowing you to also provide combined (x, y) and (x.test, y.test) data frames.
Regression is performed for continuous outcomes of class “numeric”, and classification is performed when the outcome is categorical and of class “factor”. For binary classification, the first level of the factor will be defined as the “positive” class.
10.1.1 Scenario 1 (x.train, y.train, x.test, y.test
)
In the most straightforward case, provide each individually:
x
: Training set featuresy
: Training set outcomex.test
: Testing set features (Optional)y.test
: Testing set outcome (Optional)
<- rnormmat(200, 10, seed = 2019)
x <- rnorm(10)
w <- x %*% w + rnorm(200)
y <- resample(y, seed = 2020) res
.:Resampling Parameters
n.resamples: 10
resampler: strat.sub
stratify.var: y
train.p: 0.75
strat.n.bins: 4
05-20-25 07:06:39 Created 10 stratified subsamples :resample
<- x[res$Subsample_1, ]
x.train <- x[-res$Subsample_1, ]
x.test <- y[res$Subsample_1]
y.train <- y[-res$Subsample_1] y.test
<- s_GLM(x.train, y.train, x.test, y.test) mod.glm
05-20-25 07:06:39 Hello, egenn :s_GLM
.:Regression Input Summary
Training features: 147 x 10
Training outcome: 147 x 1
Testing features: 53 x 10
Testing outcome: 53 x 1
05-20-25 07:06:39 Training GLM... :s_GLM
.:GLM Regression Training Summary
MSE = 0.84
RMSE = 0.92
MAE = 0.75
r = 0.96 (p = 5.9e-81)
R sq = 0.92
.:GLM Regression Testing Summary
MSE = 1.22
RMSE = 1.10
MAE = 0.90
r = 0.94 (p = 2.5e-26)
R sq = 0.89
05-20-25 07:06:39 Completed in 0.01 minutes (Real: 0.35; User: 0.33; System: 0.02) :s_GLM
10.1.2 Scenario 2: (x.train, x.test
)
You can provide training and testing sets as a single data.frame each where the last column is the outcome:
x
: data.frame(x.train, y.train)y
: data.frame(x.test, y.test)
<- rnormmat(200, 10, seed = 2019)
x <- rnorm(10)
w <- x %*% w + rnorm(200)
y <- data.frame(x, y)
dat <- resample(dat, seed = 2020) res
05-20-25 07:06:39 Input contains more than one columns; will stratify on last :resample
.:Resampling Parameters
n.resamples: 10
resampler: strat.sub
stratify.var: y
train.p: 0.75
strat.n.bins: 4
05-20-25 07:06:39 Created 10 stratified subsamples :resample
<- dat[res$Subsample_1, ]
dat.train <- dat[-res$Subsample_1, ] dat.test
The dataPrepare
function will check data dimensions and determine whether data was input as separate feature and outcome sets or combined and ensure the correct number of cases and features was provided.
In either scenario, Regression will be performed if the outcome is numeric and Classification if the outcome is a factor.
10.2 Generalized Linear Model (GLM)
<- s_GLM(dat.train, dat.test) mod.glm
05-20-25 07:06:39 Hello, egenn :s_GLM
.:Regression Input Summary
Training features: 147 x 10
Training outcome: 147 x 1
Testing features: 53 x 10
Testing outcome: 53 x 1
05-20-25 07:06:39 Training GLM... :s_GLM
.:GLM Regression Training Summary
MSE = 0.84
RMSE = 0.92
MAE = 0.75
r = 0.96 (p = 5.9e-81)
R sq = 0.92
.:GLM Regression Testing Summary
MSE = 1.22
RMSE = 1.10
MAE = 0.90
r = 0.94 (p = 2.5e-26)
R sq = 0.89
05-20-25 07:06:39 Completed in 1.3e-04 minutes (Real: 0.01; User: 0.01; System: 1e-03) :s_GLM
Note: If there are factor features, s_GLM
will test that there are no levels present in the test set and not in the training. This would cause predict to fail. This is a problem that may arise when you are running multiple cross-validated experiments.
10.3 Elastic Net (Regularized GLM)
Regularization prevents overfitting and allows training a linear model on a dataset with more features than cases (p >> n).
<- rnormmat(500, 1000, seed = 2019)
x <- rnorm(1000)
w <- x %*% w + rnorm(500)
y <- data.frame(x, y)
dat <- resample(dat) res
05-20-25 07:06:39 Input contains more than one columns; will stratify on last :resample
.:Resampling Parameters
n.resamples: 10
resampler: strat.sub
stratify.var: y
train.p: 0.75
strat.n.bins: 4
05-20-25 07:06:39 Created 10 stratified subsamples :resample
<- dat[res$Subsample_1, ]
dat.train <- dat[-res$Subsample_1, ] dat.test
<- s_GLMNET(dat.train, dat.test, alpha = 0) mod.ridge
05-20-25 07:06:39 Hello, egenn :s_GLMNET
.:Regression Input Summary
Training features: 374 x 1000
Training outcome: 374 x 1
Testing features: 126 x 1000
Testing outcome: 126 x 1
05-20-25 07:06:40 Running grid search... :gridSearchLearn
.:Resampling Parameters
n.resamples: 5
resampler: kfold
stratify.var: y
strat.n.bins: 4
05-20-25 07:06:40 Created 5 independent folds :resample
.:Search parameters
grid.params:
alpha: 0
fixed.params:
.gs: TRUE
which.cv.lambda: lambda.1se
05-20-25 07:06:40 Tuning Elastic Net by exhaustive grid search. :gridSearchLearn
05-20-25 07:06:40 5 inner resamples; 5 models total; running on 9 workers (aarch64-apple-darwin20) :gridSearchLearn
05-20-25 07:06:41 Extracting best lambda from GLMNET models... :gridSearchLearn
.:Best parameters to minimize MSE
best.tune:
lambda: 191.979466499985
alpha: 0
05-20-25 07:06:41 Completed in 0.03 minutes (Real: 1.60; User: 0.25; System: 0.16) :gridSearchLearn
.:Parameters
alpha: 0
lambda: 191.979466499985
05-20-25 07:06:41 Training elastic net model... :s_GLMNET
.:GLMNET Regression Training Summary
MSE = 431.62
RMSE = 20.78
MAE = 16.60
r = 0.96 (p = 3.1e-203)
R sq = 0.59
.:GLMNET Regression Testing Summary
MSE = 950.69
RMSE = 30.83
MAE = 24.94
r = 0.52 (p = 6.2e-10)
R sq = 0.17
05-20-25 07:06:41 Completed in 0.03 minutes (Real: 1.83; User: 0.44; System: 0.18) :s_GLMNET
<- s_GLMNET(dat.train, dat.test, alpha = 1) mod.lasso
05-20-25 07:06:41 Hello, egenn :s_GLMNET
.:Regression Input Summary
Training features: 374 x 1000
Training outcome: 374 x 1
Testing features: 126 x 1000
Testing outcome: 126 x 1
05-20-25 07:06:42 Running grid search... :gridSearchLearn
.:Resampling Parameters
n.resamples: 5
resampler: kfold
stratify.var: y
strat.n.bins: 4
05-20-25 07:06:42 Created 5 independent folds :resample
.:Search parameters
grid.params:
alpha: 1
fixed.params:
.gs: TRUE
which.cv.lambda: lambda.1se
05-20-25 07:06:42 Tuning Elastic Net by exhaustive grid search. :gridSearchLearn
05-20-25 07:06:42 5 inner resamples; 5 models total; running on 9 workers (aarch64-apple-darwin20) :gridSearchLearn
05-20-25 07:06:42 Extracting best lambda from GLMNET models... :gridSearchLearn
.:Best parameters to minimize MSE
best.tune:
lambda: 5.27695386992611
alpha: 1
05-20-25 07:06:42 Completed in 0.01 minutes (Real: 0.73; User: 0.11; System: 0.08) :gridSearchLearn
.:Parameters
alpha: 1
lambda: 5.27695386992611
05-20-25 07:06:42 Training elastic net model... :s_GLMNET
.:GLMNET Regression Training Summary
MSE = 995.43
RMSE = 31.55
MAE = 25.17
r = 0.40 (p = 5.8e-16)
R sq = 0.06
.:GLMNET Regression Testing Summary
MSE = 1131.44
RMSE = 33.64
MAE = 27.29
r = 0.11 (p = 0.22)
R sq = 0.01
05-20-25 07:06:42 Completed in 0.02 minutes (Real: 0.94; User: 0.31; System: 0.10) :s_GLMNET
If you do not define alpha, it defaults to seq(0, 1, 0.2)
, which means that grid search will be used for tuning.
<- s_GLMNET(dat.train, dat.test) mod.elnet
05-20-25 07:06:42 Hello, egenn :s_GLMNET
.:Regression Input Summary
Training features: 374 x 1000
Training outcome: 374 x 1
Testing features: 126 x 1000
Testing outcome: 126 x 1
05-20-25 07:06:42 Running grid search... :gridSearchLearn
.:Resampling Parameters
n.resamples: 5
resampler: kfold
stratify.var: y
strat.n.bins: 4
05-20-25 07:06:42 Created 5 independent folds :resample
.:Search parameters
grid.params:
alpha: 0, 0.2, 0.4, 0.6, 0.8, 1...
fixed.params:
.gs: TRUE
which.cv.lambda: lambda.1se
05-20-25 07:06:42 Tuning Elastic Net by exhaustive grid search. :gridSearchLearn
05-20-25 07:06:42 5 inner resamples; 30 models total; running on 9 workers (aarch64-apple-darwin20) :gridSearchLearn
05-20-25 07:06:47 Extracting best lambda from GLMNET models... :gridSearchLearn
.:Best parameters to minimize MSE
best.tune:
lambda: 174.727833490782
alpha: 0
05-20-25 07:06:47 Completed in 0.07 minutes (Real: 4.37; User: 0.36; System: 0.33) :gridSearchLearn
.:Parameters
alpha: 0
lambda: 174.727833490782
05-20-25 07:06:47 Training elastic net model... :s_GLMNET
.:GLMNET Regression Training Summary
MSE = 404.86
RMSE = 20.12
MAE = 16.07
r = 0.96 (p = 3.8e-207)
R sq = 0.62
.:GLMNET Regression Testing Summary
MSE = 941.32
RMSE = 30.68
MAE = 24.85
r = 0.52 (p = 5.2e-10)
R sq = 0.18
05-20-25 07:06:47 Completed in 0.08 minutes (Real: 4.53; User: 0.50; System: 0.35) :s_GLMNET
Many real-world relationships are nonlinear. A large number of regression approaches exist to model such relationships_
Let’s create some new synthetic data:
<- rnormmat(400, 10)
x <- x[, 3]^2 + x[, 5] + rnorm(400) y
In this example, y
depends on the cube of x
10.4 Generalized Additive Model (GAM)
Generalized Additive Models provide a very efficient way of fitting penalized splines. GAMs in rtemis can be fit with s_GAM
(which uses mgcv::gam
):
<- s_GAM(x, y) mod.gam
05-20-25 07:06:47 Hello, egenn :s_GAM
.:Regression Input Summary
Training features: 400 x 10
Training outcome: 400 x 1
Testing features: Not available
Testing outcome: Not available
05-20-25 07:06:47 Training GAM... :s_GAM
.:GAM Regression Training Summary
MSE = 0.91
RMSE = 0.96
MAE = 0.77
r = 0.88 (p = 3.8e-128)
R sq = 0.77
05-20-25 07:06:47 Completed in 0.01 minutes (Real: 0.42; User: 0.40; System: 0.02) :s_GAM
10.5 Projection Pursuit Regression (PPR)
Projection Pursuit Regression is an extension of (generalized) additive models.
Where a linear model is a linear combination of a set of predictors,
an additive model is a linear combination of nonlinear transformations of a set of predictors, a projection pursuit model is a linear combination of nonlinear transformations of linear combinations of predictors.
<- s_PPR(x, y) mod.ppr
05-20-25 07:06:47 Hello, egenn :s_PPR
.:Regression Input Summary
Training features: 400 x 10
Training outcome: 400 x 1
Testing features: Not available
Testing outcome: Not available
.:Parameters
nterms: 4
optlevel: 3
sm.method: spline
bass: 0
span: 0
df: 5
gcvpen: 1
05-20-25 07:06:47 Running Projection Pursuit Regression... :s_PPR
.:PPR Regression Training Summary
MSE = 0.82
RMSE = 0.90
MAE = 0.72
r = 0.89 (p = 7.1e-138)
R sq = 0.79
05-20-25 07:06:47 Completed in 3.2e-04 minutes (Real: 0.02; User: 0.02; System: 2e-03) :s_PPR
10.6 Support Vector Machine (SVM)
<- s_SVM(x, y) mod.svm
05-20-25 07:06:47 Hello, egenn :s_SVM
.:Regression Input Summary
Training features: 400 x 10
Training outcome: 400 x 1
Testing features: Not available
Testing outcome: Not available
05-20-25 07:06:48 Training SVM Regression with radial kernel... :s_SVM
.:SVM Regression Training Summary
MSE = 0.79
RMSE = 0.89
MAE = 0.60
r = 0.91 (p = 4.7e-154)
R sq = 0.80
05-20-25 07:06:48 Completed in 0.01 minutes (Real: 0.33; User: 0.06; System: 0.01) :s_SVM
10.7 Classification and Regression Trees (CART)
<- s_CART(x, y) mod.cart
05-20-25 07:06:48 Hello, egenn :s_CART
.:Regression Input Summary
Training features: 400 x 10
Training outcome: 400 x 1
Testing features: Not available
Testing outcome: Not available
05-20-25 07:06:48 Training CART... :s_CART
.:CART Regression Training Summary
MSE = 1.09
RMSE = 1.04
MAE = 0.83
r = 0.85 (p = 5.9e-113)
R sq = 0.72
05-20-25 07:06:48 Completed in 2.8e-04 minutes (Real: 0.02; User: 0.01; System: 2e-03) :s_CART
10.8 Random Forest
Multiple Random Forest implementations are included in rtemis. Ranger provides an efficient implementation well-suited for general use.
<- s_Ranger(x, y) mod.rf
05-20-25 07:06:48 Hello, egenn :s_Ranger
.:Regression Input Summary
Training features: 400 x 10
Training outcome: 400 x 1
Testing features: Not available
Testing outcome: Not available
.:Parameters
n.trees: 1000
mtry: NULL
05-20-25 07:06:48 Training Random Forest (ranger) Regression with 1000 trees... :s_Ranger
.:Ranger Regression Training Summary
MSE = 0.30
RMSE = 0.54
MAE = 0.41
r = 0.98 (p = 1.1e-287)
R sq = 0.92
05-20-25 07:06:48 Completed in 1.6e-03 minutes (Real: 0.10; User: 0.40; System: 0.02) :s_Ranger
10.9 Gradient Boosting
Gradient Boosting is, on average, the best performing learning algorithm for structured data. rtemis includes multiple implementations of boosting, along with support to boost any learner - see chapter on Boosting.
<- s_GBM(x, y) mod.gbm
05-20-25 07:06:48 Hello, egenn :s_GBM
.:Regression Input Summary
Training features: 400 x 10
Training outcome: 400 x 1
Testing features: Not available
Testing outcome: Not available
05-20-25 07:06:48 Distribution set to gaussian :s_GBM
05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM
05-20-25 07:06:48 Running grid search... :gridSearchLearn
.:Resampling Parameters
n.resamples: 5
resampler: kfold
stratify.var: y
strat.n.bins: 4
05-20-25 07:06:48 Created 5 independent folds :resample
.:Search parameters
grid.params:
interaction.depth: 2
shrinkage: 0.01
bag.fraction: 0.9
n.minobsinnode: 5
fixed.params:
n.trees: 2000
max.trees: 5000
gbm.select.smooth: FALSE
n.new.trees: 500
min.trees: 50
failsafe.trees: 500
ifw: TRUE
ifw.type: 2
upsample: FALSE
downsample: FALSE
resample.seed: NULL
relInf: FALSE
plot.tune.error: FALSE
.gs: TRUE
05-20-25 07:06:48 Tuning Gradient Boosting Machine by exhaustive grid search. :gridSearchLearn
05-20-25 07:06:48 5 inner resamples; 5 models total; running on 9 workers (aarch64-apple-darwin20) :gridSearchLearn
05-20-25 07:06:48 Running grid line #1 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM
.:Regression Input Summary
Training features: 319 x 10
Training outcome: 319 x 1
Testing features: 81 x 10
Testing outcome: 81 x 1
05-20-25 07:06:48 Distribution set to gaussian :s_GBM
05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM
.:Parameters
n.trees: 2000
interaction.depth: 2
shrinkage: 0.01
bag.fraction: 0.9
n.minobsinnode: 5
weights: NULL
.:GBM Regression Training Summary
MSE = 0.91
RMSE = 0.95
MAE = 0.73
r = 0.89 (p = 1.3e-110)
R sq = 0.77
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM
.:GBM Regression Testing Summary
MSE = 1.17
RMSE = 1.08
MAE = 0.89
r = 0.84 (p = 7.9e-23)
R sq = 0.70
05-20-25 07:06:48 Completed in 2.1e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) :s_GBM
05-20-25 07:06:48 Running grid line #2 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM
.:Regression Input Summary
Training features: 320 x 10
Training outcome: 320 x 1
Testing features: 80 x 10
Testing outcome: 80 x 1
05-20-25 07:06:48 Distribution set to gaussian :s_GBM
05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM
.:Parameters
n.trees: 2000
interaction.depth: 2
shrinkage: 0.01
bag.fraction: 0.9
n.minobsinnode: 5
weights: NULL
.:GBM Regression Training Summary
MSE = 0.69
RMSE = 0.83
MAE = 0.65
r = 0.91 (p = 1.3e-125)
R sq = 0.83
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM
.:GBM Regression Testing Summary
MSE = 1.30
RMSE = 1.14
MAE = 0.92
r = 0.81 (p = 1.3e-19)
R sq = 0.64
05-20-25 07:06:48 Completed in 2.1e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) :s_GBM
05-20-25 07:06:48 Running grid line #3 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM
.:Regression Input Summary
Training features: 322 x 10
Training outcome: 322 x 1
Testing features: 78 x 10
Testing outcome: 78 x 1
05-20-25 07:06:48 Distribution set to gaussian :s_GBM
05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM
.:Parameters
n.trees: 2000
interaction.depth: 2
shrinkage: 0.01
bag.fraction: 0.9
n.minobsinnode: 5
weights: NULL
.:GBM Regression Training Summary
MSE = 0.61
RMSE = 0.78
MAE = 0.62
r = 0.92 (p = 8.6e-134)
R sq = 0.85
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM
.:GBM Regression Testing Summary
MSE = 1.19
RMSE = 1.09
MAE = 0.84
r = 0.83 (p = 1.1e-20)
R sq = 0.67
05-20-25 07:06:48 Completed in 2.1e-03 minutes (Real: 0.13; User: 0.12; System: 0.01) :s_GBM
05-20-25 07:06:48 Running grid line #4 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM
.:Regression Input Summary
Training features: 320 x 10
Training outcome: 320 x 1
Testing features: 80 x 10
Testing outcome: 80 x 1
05-20-25 07:06:48 Distribution set to gaussian :s_GBM
05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM
.:Parameters
n.trees: 2000
interaction.depth: 2
shrinkage: 0.01
bag.fraction: 0.9
n.minobsinnode: 5
weights: NULL
.:GBM Regression Training Summary
MSE = 0.75
RMSE = 0.87
MAE = 0.70
r = 0.90 (p = 2.5e-118)
R sq = 0.81
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM
.:GBM Regression Testing Summary
MSE = 1.38
RMSE = 1.17
MAE = 0.82
r = 0.82 (p = 1.7e-20)
R sq = 0.66
05-20-25 07:06:48 Completed in 2.1e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) :s_GBM
05-20-25 07:06:48 Running grid line #5 of 5... :...future.FUN
05-20-25 07:06:48 Hello, egenn :s_GBM
.:Regression Input Summary
Training features: 319 x 10
Training outcome: 319 x 1
Testing features: 81 x 10
Testing outcome: 81 x 1
05-20-25 07:06:48 Distribution set to gaussian :s_GBM
05-20-25 07:06:48 Running Gradient Boosting Regression with a gaussian loss function :s_GBM
.:Parameters
n.trees: 2000
interaction.depth: 2
shrinkage: 0.01
bag.fraction: 0.9
n.minobsinnode: 5
weights: NULL
.:GBM Regression Training Summary
MSE = 0.76
RMSE = 0.87
MAE = 0.69
r = 0.90 (p = 5.4e-116)
R sq = 0.80
05-20-25 07:06:48 Using predict for Regression with type = link :s_GBM
.:GBM Regression Testing Summary
MSE = 1.19
RMSE = 1.09
MAE = 0.90
r = 0.86 (p = 1.4e-24)
R sq = 0.73
05-20-25 07:06:48 Completed in 2e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) :s_GBM
.:Best parameters to minimize MSE
best.tune:
n.trees: 842
interaction.depth: 2
shrinkage: 0.01
bag.fraction: 0.9
n.minobsinnode: 5
05-20-25 07:06:48 Completed in 0.01 minutes (Real: 0.33; User: 0.11; System: 0.07) :gridSearchLearn
.:Parameters
n.trees: 842
interaction.depth: 2
shrinkage: 0.01
bag.fraction: 0.9
n.minobsinnode: 5
weights: NULL
05-20-25 07:06:48 Training GBM on full training set... :s_GBM
.:GBM Regression Training Summary
MSE = 0.77
RMSE = 0.88
MAE = 0.70
r = 0.90 (p = 8.9e-146)
R sq = 0.81
05-20-25 07:06:48 Calculating relative influence of variables... :s_GBM
05-20-25 07:06:48 Completed in 0.01 minutes (Real: 0.56; User: 0.19; System: 0.08) :s_GBM