DATA 622 Meetup 13: Neural Networks

George I. Hagstrom

2026-04-27

Week Summary

Focus this week is convolutionary neural networks (CNNs)
Secondary focus:
- Heuristics for training Neural Networks
- Deep Neural Network Issues
Lab 7 Due this week

Optimization Basics

Simplified loss landscape:
Imagine more dimensions

Gradient Descent

Start at \(\mathbf{x}_0\)
Find gradient \(\nabla \mathrm{loss}(\mathbf{x})\)
\(\mathbf{x}_n = \mathbf{x}_{n-1} - \kappa \nabla\mathrm{loss}(\mathbf{x}_{n-1})\)
\(\kappa\) is learning rate

What Learning Rate to Pick?

High learning rate trains faster
Too high and don’t converge at all
Limit is set by algorithm and ellipticity of well

What Learning Rate to Pick?

Steepness discrepancy limits learning rate
Most of the step is in wrong direction

Momentum

Advanced algos use momentum, gives “memory” of past gradient
Reduces effect of valley oscillation

Hyperparameter Problems

We don’t know the shape of the loss landscape to pick \(\kappa\)
We handle this several ways:
- Experimentation
- Heuristics
- Adaptive Methods

Stochastic Gradient Descent

Common for optimization algorithms to “get stuck”

Stochastic Gradient Descent

In low-D local minima are problematic

Stochastic Gradient Descent

Trajectories get Stuck

Stochastic Gradient Descent

The barriers in high dimensions likely more complex
Stochastic Gradient Descent modifies gradient descent style methods
- Compute the gradient of loss using only part of the data
- Called a “mini-batch”
- Smaller the mini-batch, the more the noise

Stochastic Gradient Descent

In 2D, can just add noise

Annealing

Noise can deflect you from the minimum when you are close
Common practice is to start with noise high, reduce it later
- Simulated annealing (from metallurgy)
Even more common to start with high learning rate and gradually decrease it
- Whereas in some ML applications increasing batch-size could cause overfitting

Adaptive Methods

We saw that hyper-parameters are hard to pick at outset
Ideal learning rate depends on condition number and also magnitude of gradients
Concept: Change Learning Rate as we go
- Keep memory of gradient in each variable
- Step in the direction of the average gradient
- Decrease learning rate if variance is high
- Increase learning rate if variance is low

ADAM

Adaptive Moment Estimation
Have a local estimate of average gradient \(\hat{\mathbf{v}}_i\)
Have a local estimate of squared gradient \(\hat{G}_{s,i}\)
Adjusted learning rate based on both: \[ \mathbf{x}_{i+1} = \mathbf{x}_i -\kappa \frac{\hat{\mathbf{v}}_i}{\sqrt{\hat{G}_{s,i}+\epsilon}} \]

When and Why is ADAM good?

ADAM is much more robust to learning rate choices
ADAM is excellent when the gradient is sparse
ADAM is often the best in initial training stages

When is ADAM bad?

ADAM is perhaps the most widely used optimizer, the default for deep learning
However, it is not even guaranteed to converge on easy problems
Can generalize worse (need more regularization)
Less memory efficient
It is very heuristic in nature, can be improved

Inductive Bias

Machine Learning works best when structure of hypothesis set matches structure of target function:
- Linear regression and linear relationships
- Naive Bayes and feature independence
- Decisions Trees bias towards specific features
- Nearest Neighbors: Features contribute equally

Network Architecture and Inductive Bias

Fully Connected Networks don’t have much inductive bias

Image Recognition Bias and Symmetry

Neural Network for image recognition
This is a bicycle

Image Recognition Bias and Symmetry

Neural Network for image recognition
If we shift it it is still a bicycle

Image Recognition Bias and Symmetry

Neural Network for image recognition
If we shift it it is still a bicycle

Image Recognition Bias and Symmetry

Neural Network for image recognition
If we shift it it is still a bicycle

Image Recognition Bias and Symmetry

Neural Network for image recognition
If we shift it it is still a bicycle

Image Recognition Bias and Symmetry

Neural Network for image recognition
If we shift it it is still a bicycle

Image Recognition Bias and Symmetry

Neural Network for image recognition
We can even rotate it

Accuracy of Fully Connected Networks

Fully Connected Networks are not suited to image recognition

Small well designed neural networks exceed 90% on CIFAR10
Best networks exceed 99%

Convolution

1960s Q: How do we detect edges in images?
1960s A: Create an “archetype” image of an edge, and calculate the correlation of it with each part of the original image

Applying the Convolution

The convolution ‘kernel’ is like a pattern that is applied all over the target image

Convolution in Practice

Convolutional Neural Networks

CNN Architecture:
- Layers are small, learned filters
- Feature map per layer

Convolutional Neural Networks

CNN Architecture:
- Layers are small, learned filters
- Feature map per layer
- Apply ReLU to feature maps

Convolutional Neural Networks

CNN Architecture:
- Layers are small, learned filters
- Feature map per layer
- Apply ReLU to feature maps
- Stack Conv layers

Pooling

Convolutional layers dramatically increase the size of outputs
Max pooling layers reduce the spatial size
Look at 2x2 or 3x3 grid and pick maximum value

Full Architecture

CNNs mix 1-3 Conv layers between max pooling layers
Have some fully connected layers at the end for classification

CNNs Outperform Fully Connected for Images

Fully Connected:

CNNs Outperform Fully Connected for Images

Fully Connected:

More Inductive Biases

Translation and Rotation
- Convolution effectively reuses weights to handle translations
Slight change in colors, lightness/darkness
Image being partially obscured

Data Augmentation

Instead of modifying models, we modify training set with transformations!

transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.RandomRotation(5),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616)),
    transforms.RandomErasing(p=0.5, scale=(0.02, 0.2))
])

Data Augmentation

Instead of modifying models, we modify training set with transformations!
Data augmentation is one of the most powerful “tricks” out there

Regularization: Dropout

To prevent memorization, randomly zero out a fraction of neuron weights during training:

    self.output = nn.Sequential(
        nn.Dropout(0.5),
        nn.Linear(2*2*512, 512),
        nn.ReLU(),
        nn.Linear(512, 100)
    )

Regularization: Early Stopping

Overfitting: Training loss decreases, validation loss is constant
Solution: Stop training if validation loss has stopped improving

early_stopper = EarlyStopping(
    monitor='valid_loss',
    patience=20,
    mode='min'
)

cifar_trainer = Trainer(
    max_epochs=300,
    callbacks=[early_stopper, ...]
)

Deep Learning

Over time, networks have become deeper
Empirically, they lead to higher accuracy

Goodfellow et al

Deep Learning

Over time, networks have become deeper
Empirically, they lead to higher accuracy

ImageNet Challenge

Different than more neurons

Scaling up neurons often doesn’t help in shallow nets

Goodfellow

Vashing Gradients

Landscape of deep network varies by layer:

Nielson

Why Vanishing Gradients?

Consider a deep network with one neuron per layer:
If activation function is \(\phi\) can write it as: \[ f(x_1) = \phi \circ (w_n x + b_n) \circ \phi \circ (w_{n-1}x + b_{n-1}) \circ \\ \phi\circ \cdots\circ \phi \circ(w_1 x +b_1) (x_1) \]

Taking Derivatives

The close to the output, the fewer “terms” in the derivative:

Solution: Activations

Rectified Linear Unit (ReLU)

\[ \phi(x) = \cases{0& \text{if} \quad x\leq 0 \\ x& \text{if} \quad x>0} \]

Spiky Loss Landscape

Ideally, as you move in the direction of the gradient the loss smoothly decreases

Batch Norm

Idea: Rescale the outputs of previous layer so that roughly 50% of neurons are active in subsequent layer
At each batch, and each forward step, rescale data so that mean = 0 and variance is 1
Learn an additional scaling parameter per layer
For evaluation, using running tally of activations at each layer for normalization

Batch Norm is a Type of Layer

    self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding='same')
    self.bn1 = nn.BatchNorm2d(out_channels)
    self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding='same')
    self.bn2 = nn.BatchNorm2d(out_channels)
    self.conv3 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding='same')
    self.bn3 = nn.BatchNorm2d(out_channels)

Batch Norm Enables Training Deep Networks

Batch Norm makes the loss landscape much smoother

DATA 622 Meetup 13: Neural Networks

Week Summary

Optimization Basics

Gradient Descent

What Learning Rate to Pick?

What Learning Rate to Pick?

Momentum

Hyperparameter Problems

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

Annealing

Adaptive Methods

ADAM

When and Why is ADAM good?

When is ADAM bad?

Inductive Bias

Network Architecture and Inductive Bias

Image Recognition Bias and Symmetry

Image Recognition Bias and Symmetry

Image Recognition Bias and Symmetry

Image Recognition Bias and Symmetry

Image Recognition Bias and Symmetry

Image Recognition Bias and Symmetry

Image Recognition Bias and Symmetry

Accuracy of Fully Connected Networks

Convolution

Applying the Convolution

Convolution in Practice

Convolutional Neural Networks

Convolutional Neural Networks

Convolutional Neural Networks

Pooling

Full Architecture

CNNs Outperform Fully Connected for Images

CNNs Outperform Fully Connected for Images

More Inductive Biases

Data Augmentation

Data Augmentation

Regularization: Dropout

Regularization: Early Stopping

Deep Learning

Deep Learning

Different than more neurons

Vashing Gradients

Why Vanishing Gradients?

Taking Derivatives

Solution: Activations

Spiky Loss Landscape

Batch Norm

Batch Norm is a Type of Layer

Batch Norm Enables Training Deep Networks

Thanks