DATA 622 Meetup 8: Trees and Ensemble Methods

George I. Hagstrom

2026-03-16

Week Summary

Pushed MVP Demo to Week After Spring Break
Lab 5 Available, due in two weeks
Reading:
- Chapter 8 in ISLP (skip BART until Causal Inference week)
- Chapter 6 and 7 (up to boosting) in HOML
One Vignette on Random Forests and Decision Trees

Save the Date March 25th 6:30PM!

Save the Date March 25th!

Sign up here: Sign up Link
Tell your classmates just in case

Spring Pitchfest!

https://pitchfest.github.io/

Decision Trees

Decision Trees divide predictors space into regions

8.6 ISLP

Decision Trees

Predictions made using training data that falls into each region

8.6 ISLP

Previous Models Easy to Solve

Previous Models:
- Linear Regression
- Logistic Regression
- GLMs
- Support Vector Machines

Learning algorithms for these models guarantee finding the “best” solution

Models for Rest of the Class Hard to Solve

Rest of Class:
- Decision Trees
- Neural Networks

It will NOT be possible to find the one best model for your data

It becomes much more about finding good solutions

Goal of Decision Trees

Idea is to implement a method called “Piecewise constant regression”

Goal of Decision Trees

Each value of predictors falls in a region

Goal of Decision Trees

Use the values in the region to predict

Goal of Decision Trees

Mean or median for regression

Goal of Decision Trees

Majority vote for classification

Binary Recursive Split

Decision tree algorithms use binary trees to define regions

Binary Recursive Split

Nodes split “parent” set into “child” sets based on a condition

Greedy Splits

At each step: Consider all splits, and all predictors.
Find Best split
- Regression: Reduction in MSE \[ \Delta_{\mathrm{split}} = \quad n\text{MSE}_{\textrm{parent}} - \left(n_L \cdot \textrm{MSE}_L + n_R \cdot \textrm{MSE}_R\right) \]
Stop when too many leaves, too much depth, or too few points in a split

Greedy Splits

Start with “Hitters” and log salary versus RBIs and Hits

Greedy Splits

1 Split

Greedy Splits

2 Splits

Greedy Splits

3 Splits

Greedy Splits

4 Splits

Greedy Splits

20 Splits

Classification

Instead of accuracy, pick variable and splits using evenness:
- Gini Impurity: \(\sum_{k}\hat{p}_k(1-\hat{p}_k)\)
- Entropy: \(-\sum_{k}\hat{p}_k\log \hat{p}_k\)
Pick variable and split to give maximum reduction of weighted Gini or entropy

Trees are Low Bias and High Variance

Boston Housing Prices

Pruning Regularizes Trees

Common strategy is to build a large tree and then use regularization to pick a sub-tree
Cost-Complexity Pruning Penalty \(\alpha\)
Pick tree with lowest penalized error: \[ \mathrm{argmin}_{T\subset T_0} \sum_{i=1}^n (y_i - T(\mathbf{x}_i))^2 + \alpha |T| \]
\(|T|\) is number of terminal nodes

Best \(\alpha\) with CV

Wisdom of the Crowds

Statistician went to 1908 Plymouth County Fair:
Polled 800 townsfolk on weight of a cow

Oil Painting of Craven Heiffer 1811

Wisdom of the Crowds

Crowd Median: 1198 pounds
Actual Weight: 1202 pounds

Oil Painting of Crave Heiffer 1811

Ensemble Learning

Combine predictions of several methods

HOML 7.2

Ensemble Learning

Different models make errors in different ways
- Combining them leads to complementary insights
- Stacking: Train a final model to combine outputs of different models
- Netflix Prize did this with Neural Networks and Matrix factorization methods, kNN
Mixture of Experts for LLMs

When does it help?

Consider errors from two models, \(\delta_1\) and \(\delta_2\)
Average them together and compute variance: \[ \mathrm{Var}(\frac{1}{2}\delta_1 + \frac{1}{2}\delta_2) = \frac{1}{4}\mathrm{Var}(\delta_1) + \\ \frac{1}{4}\mathrm{Var}(\delta_1) + \frac{1}{2}\mathrm{Cov}(\delta_1,\delta_2) \]
If models are same, makes no difference
If models make different errors and accuracy is similar, improve error

Bagging

You can also combine models from the same class trained with different features or data:

HOML 7.4

Bagging

Generate a lot of trees, each tree generalizes poorly
Average or poll their predictions to reduce variance \[ \hat{g}_{\mathrm{bag}}(\mathbf{x}) = \frac{1}{B} \sum_{j=1}^B \hat{g}_{j}(\mathbf{x}) \]

Bagging Classification

Generate a lot of trees, each tree generalizes poorly
Average or poll their predictions to reduce variance

\[ \hat{g}_{\mathrm{bag}}(\mathbf{x}) = \mathbf{argmax}_{y\in Y} \sum_{j=1}^N I(\hat{g}_j(\mathbf{x}) = y) \]

Where do the \(g\)s come from?

Goal: Generate trees that have learned, but are different from each-other
Bootstrap Aggregating
- Fit each tree to Bootstrap Sample of Data

Bagging can be applied to any statistical model, but most common for decision trees

How many trees to bag?

Model improves as \(B\) increases, up until a point
Adding more trees doesn’t increase test error

Decreasing the Variance by Decorrelating Trees

Improvement from Bagging was modest
Trees tend to be similar to each-other even though different bootstrap samples

Random Forest: For each split, randomly select a certain number of features less than the total.

Forces trees to be less correlated

Random Forests

Can try any value of ‘max_features’ from 1 to \(p\)
\(p/3\) for regression, \(\sqrt{p}\) for classification

Random Forest Classification

15 Class Gene Expression data with 500 features

ISLP 8.10

Out Of Bag Error

1/3rd of data doesn’t get chosen each sample
Can use those data to estimate error

Feature Importance

We are building up to something like this:

Poor Extrapolation

Trees do not extrapolate out of the “range” of your dataset
Predictions “flat” in each region

DATA 622 Meetup 8: Trees and Ensemble Methods

Week Summary

Save the Date March 25th 6:30PM!

Save the Date March 25th!

Spring Pitchfest!

Decision Trees

Decision Trees

Previous Models Easy to Solve

Models for Rest of the Class Hard to Solve

Goal of Decision Trees

Goal of Decision Trees

Goal of Decision Trees

Goal of Decision Trees

Goal of Decision Trees

Binary Recursive Split

Binary Recursive Split

Greedy Splits

Greedy Splits

Greedy Splits

Greedy Splits

Greedy Splits

Greedy Splits

Greedy Splits

Classification

Trees are Low Bias and High Variance

Pruning Regularizes Trees

Best \(\alpha\) with CV

Wisdom of the Crowds

Wisdom of the Crowds

Ensemble Learning

Ensemble Learning

When does it help?

Bagging

Bagging

Bagging Classification

Where do the \(g\)s come from?

How many trees to bag?

Decreasing the Variance by Decorrelating Trees

Random Forests

Random Forest Classification

Out Of Bag Error

Feature Importance

Feature Importance

Poor Extrapolation

Thanks