DATA 622 Meetup 9: Trees Part 2: Boosting

George I. Hagstrom

2026-03-23

Week Summary

  • Lab 5 Available, due next Sunday
  • Reading:
    • Chapter 8 in ISLP: Boosting and Bart
    • Chapter 7 Boosting in ISLP
  • Vignette on Boosting
  • Vignette on Production Basics

Meetup Wednesday!!!

Meetup Wednesday!!!

Lab 3 Feedback

  • Be very wary of default or LLM chosen model parameters!!!
    • Penalty On by default

Lab 3 Feedback

  • Be very cognizant of model options!!!
    • Without standard scaler, model is not what you expect

Lab 3 Feedback

  • Be very cognizant of model options!!!
    • ‘class_weights’ balanced ruins calibration

Lab 3 Feedback

  • Be very cognizant of model options!!!
    • ‘class_weights’ balanced ruins calibration

Lab 3 Feedback

  • Read the questions
  • Read your report
  • Message me before turning in a really long report!

Bagging Review

  • Use bootstrapping to generate a lot of datasets
  • Fit a deep tree to each dataset
  • Average to predict
  • Subsample features to make it a Random Forest

Strong versus Weak Learners

  • Many methods in ML are based on the idea that given enough data, the algorithm can reach arbitrarily high accuracy
  • Consider an algorithm for identifying horses:

Thanks nanobanana

Strong versus Weak Learners

  • As we increase the number of images, accuracy will improve

Thanks nanobanana

Strong versus Weak Learners

  • And as the number of horses goes to infinity, the strong learner will become close to perfect

Thanks nanobanana

Strong versus Weak Learners

  • The weak learner on the other hand, is any learner that is able to exceed 50% accuracy with enough data

Thanks nanobanana

Boosting Leverages Weak Learners

  • Boosting stands for Hypothesis Boosting and was inspired by the idea of enhancing weak learners to create a strong learner
  • A weak learner might be decision trees of depth 1

Adaptive Boosting

  • ‘AdaBoost’:
    • Fit simple tree \(\hat{f}_0\)
    • Initialize ensmeble \(\hat{f} = \hat{f}_0\)
    • Identify misclassified points under \(f\)
    • Weight points based on error using learning rate \(\lambda\)
    • Fit \(\hat{f}_1\) with the new weights

Boosting Regression

  • ‘Boosting’:
    • Fit simple tree \(\hat{f} = \lambda\hat{f}_0\)
    • Calculate error: \(r_i = y_i - \hat{f}(\mathbf{x}_i)\)
    • Fit new tree \(\hat{f}_1\) to \(r_i\)
    • Update model: \(\hat{f} = \lambda\hat{f}_0 + \lambda\hat{f}_1\)
    • Repeat

Boosting Regression Example

  • Start with target function:

Boosting Regression Example

  • Fit a decision tree of depth 2

Boosting Regression Example

  • Look at the residuals:

Boosting Regression Example

  • Now fit the residuals

10 Rounds of Boosting

100 Rounds of Boosting

1000 Rounds of Boosting

Boosted Trees Can Overfit

  • Learning Rate \(\lambda\) \[ \hat{f} = \sum_{i=1}^{n_e} \lambda \hat{f}_i \]

  • Small \(\lambda\) slows learning, reduces variance

Learning Rate Effects

  • Small \(\lambda\) regularizes

Max Depth

  • Deeper Trees lead to Overfitting

Other Hyperparameters

  • ‘subsample’ part of the data
    • Large, noisy data
    • Regularizes
    • Increases speed
  • ‘min_samples_leaf’
  • Lasso and Ridge Penalties

Classification and Gradient Boosting

  • Boosted Trees use regression for classification problems
  • Standard Loss is called ‘cross-entropy’, equivalent to ‘log’ score
  • For each class there are trees that predict the weight: \[ p(y=k|\mathbf{x}) = \frac{\exp(\hat{f}_k(\mathbf{x}))}{\sum_{k'=1}^K \exp(\hat{f}_k(\mathbf{x}))} \]

Classification and Gradient Boosting

  • Weights come from sum of trees for each class:

\[ \hat{f}_k(\mathbf{x}) = \sum_{m=1}^M \eta \hat{f}_{mk}(\mathbf{x}) \]

Where does ‘Gradient’ Come in?

  • Gradient of the log likelihood with respect to the weights: \[ \frac{\partial L}{\partial \hat{f}_k(\mathbf{x}_i)} = I(y_i = k) - p(y=k|\mathbf{x}_i) \]
  • This is the direction in which the likelihood increases most
  • Treat the gradient like a residual

State-Of-The-Art Gradient Boosting

  • There are lots of very advanced gradient boosting libraries and models
  • Also approximate curvature

Gradient Boosting Libraries

  • ‘XGBoost’: Oldest, slowest, most solid, needs one-hot
  • ‘CatBoost’: Best for categorical, some tricks for low bias
  • ‘LightGBM’: From Microsoft, built for speed, more overfitting potential
  • ‘HistGradientBoostingClassifier’: ‘sklearn’ implementation

Example: UCI Credit Card Defaults

  • Dataset of 30,000 observations
  • Features: Payment History
LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 ... BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
17041 190000.0 2 2 2 25 0 0 0 0 0 ... 23130.0 28126.0 26104.0 18840.0 1615.0 1200.0 26703.0 2104.0 7000.0 11747.0
8451 50000.0 1 2 2 40 0 0 0 0 0 ... 48600.0 7514.0 9336.0 -177.0 10000.0 21019.0 1500.0 3000.0 1210.0 7900.0
5764 80000.0 2 3 2 22 0 0 -1 -1 -2 ... 15674.0 -1.0 -1.0 -1.0 2400.0 15674.0 0.0 0.0 0.0 0.0
1745 50000.0 2 1 2 22 0 0 0 0 0 ... 50071.0 10104.0 9208.0 10075.0 2300.0 2000.0 1000.0 500.0 1000.0 500.0
29645 310000.0 1 2 1 34 0 0 0 0 0 ... 80533.0 70343.0 58365.0 51454.0 3100.0 3604.0 2366.0 2018.0 2000.0 1700.0

5 rows × 23 columns

Example: UCI Credit Card Default Dataset

  • Dataset of 30,000 observations
  • Target: Default in the next month
  • 78% don’t default
default.payment.next.month
0    18691
1     5309
Name: count, dtype: int64

Example: UCI Credit Card Default Dataset

  • Dataset of 30,000 observations
  • Target: Default in the next month
correlation
PAY_0 0.327770
PAY_2 0.263590
PAY_3 0.235160
PAY_4 0.218304
PAY_5 0.206920
PAY_6 0.188902
LIMIT_BAL -0.155912
PAY_AMT1 -0.070510
PAY_AMT5 -0.056307
PAY_AMT2 -0.055253
PAY_AMT3 -0.055114
PAY_AMT4 -0.052717
PAY_AMT6 -0.047702
SEX -0.042882
EDUCATION 0.025775

CV with ‘XGBoost’

  • Have a lot of hyperparameters:
    • ‘n_estimators’
    • ‘learning_rate’
    • ‘max_depth’
    • even more…

Early Stopping

  • DO NOT use CV to pick ‘n_estimators’
  • Instead monitor validation error on each fold
  • If validation error does not decrease after certain number of iterations, stop
  • Pick Other hyperparameters using CV

Regularization Path

  • Select Learning Rate with lowest log-loss

‘n_estimators’ vs ‘learning_rate’ trade-off

Eval Set

  • Refit the model with selected hyperparameters on the training
  • But hold out 15% to determine early stopping
  • Just edges out a RF/LR models
Model                       Log Loss   Accuracy
-----------------------------------------------
XGBoost                       0.4337     0.8200
Random Forest                 0.4423     0.8158
Logistic Regression           0.4679     0.8113

ROC Curves

Calibration Curves

Thanks