Module 6 - Resampling and Cross Validation

Overview

In traditional statistical modeling, model fits are evaluated using test statistics, hypothesis tests, or examination of the posterior distribution. These statistical tools are often not available for machine learning models because of their complexity. Instead, computational methods based on resampling have been developed which allow for estimation of uncertainty, out of sample accuracy (generalization), and model comparison. During this week we will begin our exploration of these tools by studying resampling, the boostrap, and cross-validation, which is one of the most crucial techniques for evaluating machine learning models.

Learning Objectives

  • Understand how to apply cross-validation to assess out of sample accuracy
  • Understand the trade-offs in different data splits
  • Apply the bootstrap to estimate uncertainty in predictions and parameters

Readings

  • ISLP (Introduction to Statistical Learning): Chapter 5

Extra Reading. If you want to be a cross-validation pro, this paper represents the state of the art in my opinion, it extends well beyond the ecology context in which it was published.

Why no HOML recommendation? There really isn’t a specific section on cross-validation in that book. But I highly recommend that if you enjoy the book, you should search for cross-validation (include the hyphen) and read the relevant snippets which occur throughout.

Course Meetup Video

Vignette Videos

The code for these can be found in the following jupyter notebook: cross-validation-bootstrap-vignette.ipynb or cv-nfl-live.ipynb

Videos

ISLP Coding Videos