DATA 622 Meetup 9: Trees Part 2: Boosting

George I. Hagstrom

2026-03-23

Week Summary

Lab 5 Available, due next Sunday
Reading:
- Chapter 8 in ISLP: Boosting and Bart
- Chapter 7 Boosting in ISLP
Vignette on Boosting
Vignette on Production Basics

Meetup Wednesday!!!

Lab 3 Feedback

Be very wary of default or LLM chosen model parameters!!!
- Penalty On by default

Lab 3 Feedback

Be very cognizant of model options!!!
- Without standard scaler, model is not what you expect

Lab 3 Feedback

Be very cognizant of model options!!!
- ‘class_weights’ balanced ruins calibration

Lab 3 Feedback

Be very cognizant of model options!!!
- ‘class_weights’ balanced ruins calibration

Lab 3 Feedback

Read the questions
Read your report
Message me before turning in a really long report!

Bagging Review

Use bootstrapping to generate a lot of datasets
Fit a deep tree to each dataset
Average to predict
Subsample features to make it a Random Forest

Strong versus Weak Learners

Many methods in ML are based on the idea that given enough data, the algorithm can reach arbitrarily high accuracy
Consider an algorithm for identifying horses:

Thanks nanobanana

Strong versus Weak Learners

As we increase the number of images, accuracy will improve

Thanks nanobanana

Strong versus Weak Learners

And as the number of horses goes to infinity, the strong learner will become close to perfect

Thanks nanobanana

Strong versus Weak Learners

The weak learner on the other hand, is any learner that is able to exceed 50% accuracy with enough data

Thanks nanobanana

Boosting Leverages Weak Learners

Boosting stands for Hypothesis Boosting and was inspired by the idea of enhancing weak learners to create a strong learner
A weak learner might be decision trees of depth 1

Adaptive Boosting

‘AdaBoost’:
- Fit simple tree \(\hat{f}_0\)
- Initialize ensmeble \(\hat{f} = \hat{f}_0\)
- Identify misclassified points under \(f\)
- Weight points based on error using learning rate \(\lambda\)
- Fit \(\hat{f}_1\) with the new weights

Boosting Regression

‘Boosting’:
- Fit simple tree \(\hat{f} = \lambda\hat{f}_0\)
- Calculate error: \(r_i = y_i - \hat{f}(\mathbf{x}_i)\)
- Fit new tree \(\hat{f}_1\) to \(r_i\)
- Update model: \(\hat{f} = \lambda\hat{f}_0 + \lambda\hat{f}_1\)
- Repeat

Boosting Regression Example

Start with target function:

Boosting Regression Example

Fit a decision tree of depth 2

Boosting Regression Example

Look at the residuals:

Boosting Regression Example

Now fit the residuals

10 Rounds of Boosting

100 Rounds of Boosting

1000 Rounds of Boosting

Boosted Trees Can Overfit

Learning Rate \(\lambda\) \[ \hat{f} = \sum_{i=1}^{n_e} \lambda \hat{f}_i \]
Small \(\lambda\) slows learning, reduces variance

Learning Rate Effects

Small \(\lambda\) regularizes

Max Depth

Deeper Trees lead to Overfitting

Other Hyperparameters

‘subsample’ part of the data
- Large, noisy data
- Regularizes
- Increases speed
‘min_samples_leaf’
Lasso and Ridge Penalties

Classification and Gradient Boosting

Boosted Trees use regression for classification problems
Standard Loss is called ‘cross-entropy’, equivalent to ‘log’ score
For each class there are trees that predict the weight: \[ p(y=k|\mathbf{x}) = \frac{\exp(\hat{f}_k(\mathbf{x}))}{\sum_{k'=1}^K \exp(\hat{f}_k(\mathbf{x}))} \]

Classification and Gradient Boosting

Weights come from sum of trees for each class:

\[ \hat{f}_k(\mathbf{x}) = \sum_{m=1}^M \eta \hat{f}_{mk}(\mathbf{x}) \]

Where does ‘Gradient’ Come in?

Gradient of the log likelihood with respect to the weights: \[ \frac{\partial L}{\partial \hat{f}_k(\mathbf{x}_i)} = I(y_i = k) - p(y=k|\mathbf{x}_i) \]
This is the direction in which the likelihood increases most
Treat the gradient like a residual

State-Of-The-Art Gradient Boosting

There are lots of very advanced gradient boosting libraries and models
Also approximate curvature

Gradient Boosting Libraries

‘XGBoost’: Oldest, slowest, most solid, needs one-hot
‘CatBoost’: Best for categorical, some tricks for low bias
‘LightGBM’: From Microsoft, built for speed, more overfitting potential
‘HistGradientBoostingClassifier’: ‘sklearn’ implementation

Example: UCI Credit Card Defaults

Dataset of 30,000 observations
Features: Payment History

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_3	PAY_4	PAY_5	...	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6
17041	190000.0	2	2	2	25	0	0	0	...	23130.0	28126.0	26104.0	18840.0	1615.0	1200.0	26703.0	2104.0	7000.0	11747.0
8451	50000.0	1	2	2	40	0	0	0	...	48600.0	7514.0	9336.0	-177.0	10000.0	21019.0	1500.0	3000.0	1210.0	7900.0
5764	80000.0	2	3	2	22	-1	-1	-2	...	15674.0	-1.0	-1.0	-1.0	2400.0	15674.0	0.0	0.0	0.0	0.0
1745	50000.0	2	1	2	22	0	0	0	...	50071.0	10104.0	9208.0	10075.0	2300.0	2000.0	1000.0	500.0	1000.0	500.0
29645	310000.0	1	2	1	34	0	0	0	...	80533.0	70343.0	58365.0	51454.0	3100.0	3604.0	2366.0	2018.0	2000.0	1700.0

5 rows × 23 columns

Example: UCI Credit Card Default Dataset

Dataset of 30,000 observations
Target: Default in the next month
78% don’t default

default.payment.next.month
0    18691
1     5309
Name: count, dtype: int64

Example: UCI Credit Card Default Dataset

Dataset of 30,000 observations
Target: Default in the next month

	correlation
PAY_0	0.327770
PAY_2	0.263590
PAY_3	0.235160
PAY_4	0.218304
PAY_5	0.206920
PAY_6	0.188902
LIMIT_BAL	-0.155912
PAY_AMT1	-0.070510
PAY_AMT5	-0.056307
PAY_AMT2	-0.055253
PAY_AMT3	-0.055114
PAY_AMT4	-0.052717
PAY_AMT6	-0.047702
SEX	-0.042882
EDUCATION	0.025775

CV with ‘XGBoost’

Have a lot of hyperparameters:
- ‘n_estimators’
- ‘learning_rate’
- ‘max_depth’
- even more…

Early Stopping

DO NOT use CV to pick ‘n_estimators’
Instead monitor validation error on each fold
If validation error does not decrease after certain number of iterations, stop
Pick Other hyperparameters using CV

Regularization Path

Select Learning Rate with lowest log-loss

‘n_estimators’ vs ‘learning_rate’ trade-off

Eval Set

Refit the model with selected hyperparameters on the training
But hold out 15% to determine early stopping
Just edges out a RF/LR models

Model                       Log Loss   Accuracy
-----------------------------------------------
XGBoost                       0.4337     0.8200
Random Forest                 0.4423     0.8158
Logistic Regression           0.4679     0.8113

DATA 622 Meetup 9: Trees Part 2: Boosting

Week Summary

Meetup Wednesday!!!

Meetup Wednesday!!!

Lab 3 Feedback

Lab 3 Feedback

Lab 3 Feedback

Lab 3 Feedback

Lab 3 Feedback

Bagging Review

Strong versus Weak Learners

Strong versus Weak Learners

Strong versus Weak Learners

Strong versus Weak Learners

Boosting Leverages Weak Learners

Adaptive Boosting

Boosting Regression

Boosting Regression Example

Boosting Regression Example

Boosting Regression Example

Boosting Regression Example

10 Rounds of Boosting

100 Rounds of Boosting

1000 Rounds of Boosting

Boosted Trees Can Overfit

Learning Rate Effects

Max Depth

Other Hyperparameters

Classification and Gradient Boosting

Classification and Gradient Boosting

Where does ‘Gradient’ Come in?

State-Of-The-Art Gradient Boosting

Gradient Boosting Libraries

Example: UCI Credit Card Defaults

Example: UCI Credit Card Default Dataset

Example: UCI Credit Card Default Dataset

CV with ‘XGBoost’

Early Stopping

Regularization Path

‘n_estimators’ vs ‘learning_rate’ trade-off

Eval Set

ROC Curves

Calibration Curves

Thanks