
2026-04-27





ADAM is much more robust to learning rate choices
ADAM is excellent when the gradient is sparse
ADAM is often the best in initial training stages




Fully Connected:

Fully Connected:

Fully Connected:

Fully Connected:

transform_train = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, padding=4),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
transforms.RandomRotation(5),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2470, 0.2435, 0.2616)),
transforms.RandomErasing(p=0.5, scale=(0.02, 0.2))
])Goodfellow et al
ImageNet Challenge
Goodfellow
Nielson

\[ \phi(x) = \cases{0& \text{if} \quad x\leq 0 \\ x& \text{if} \quad x>0} \]
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding='same')
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding='same')
self.bn2 = nn.BatchNorm2d(out_channels)
self.conv3 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding='same')
self.bn3 = nn.BatchNorm2d(out_channels)
DATA 622