DATA 622 Meetup 14: RNNs and Text

George I. Hagstrom

2026-05-04

Week Summary

CIFAR100 problem: Lab 7 Due 5/10
- Lab 8 shortened, available soon
Schedule Final Presentations Next Week
RNNs and Text this week
Pretrained Models next week

nyhackr Thursday May 14th!

Christian Martinez
R package for interacting with NYCOpenData

End of Semester Gathering?

Post in the Slack if you are interested!

Sequential Data

Data has a one-dimensional order:

ISLP

Sequential Data

Data has a one-dimensional order:
- Sound/Music
- Text
- DNA/Proteins/RNA
- Position of a moving object
- User Behavior

Recurrent Neural Networks

Take advantage of the inductive bias:
- What comes next depends on what came before

Apply the network to the data sequentially

Apply it to the old activations and next state

Repeat

Use the final output for your prediction

RNN Architecture

Network has \(K\) internal layers
Activations \(A_{lk}\) at sequence point \(l\)
Need weights for the inputs \(X_{il}\):

\[ A_{lk} = g\left(\sum_{j=1}^p w_{jk} X_{lj} + \cdots \right) \]

RNN Architecture

Network has \(K\) internal layers
Activations \(A_{lk}\) at sequence point \(l\)
Need weights for the old activations \(A_{(l-1),s}\):

\[ A_{lk} = g\left(\sum_{j=1}^p w_{jk} X_{lj} + \sum_{s=1}^K u_{sk} A_{(l-1),s} + \cdots \right) \]

RNN Architecture

Network has \(K\) internal layers
Activations \(A_{lk}\) at sequence point \(l\)
Need a bias:

\[ A_{lk} = g\left(w_{0k} +\sum_{j=1}^p w_{jk} X_{lj} + \sum_{s=1}^K u_{sk} A_{(l-1),s} \right) \]

RNN Architecture

Need to Calculate Output \(O_l\):
Regression:

\[ O_l = \beta_0 + \sum_k=1^K \beta_k A_{lk} \]

RNN Architecture

Need to Calculate Output \(O_l\):
Classification:

\[ O_l = g(\beta_0 + \sum_k=1^K \beta_k A_{lk}) \]

RNN Architecture

Need to Calculate Output \(O_l\):
Can also calculate \(O_l\) using some additional layers

RNN Use Cases

Time Series (624…..)
- But check out Reservoir Computing
All sorts of device use cases
Audio processing
NLP (But have been superseeded recently)

RNN Example: IMDB Sentiment

imdb.com

IMDB contains millions of movie reviews
Can we predict sentiment from text?

How Can a Neural Network Read?

NLP always involves feature engineering:
- Tokenize into words or n-grams
- Use Sentiment Lexicon
- One-Hot Encode and use ML?

One-Hot and Meaning?

Consider words “angry” and “mad”
- Almost the same word
- Training treats them as independent
Word meaning is a type of “inductive bias” that we want

Solution: Low Dimensional Embedding

Goal: Find mapping from 10,000 dimensional word space to lower dimensional “meaning” space
Consider Word “King” in One-Hot \[ \mathrm{King}_{OH} = (0\, 0\, \cdots\, 0\, 1\, 0\, \cdots\, 0) \]
In Embedded Representation: \[ \mathrm{King}_{e} = (-1.1\, 0.3\, \cdots\, 2.2) \]

Similar Words Cluster Together

Queen and Princess are near

Features May Be Interpretable

One axis is Gender, other is Status

Dimension Reduction

Each word is now a vector in \(m\)-dimensional space

Jay Alammar

How to Embed?

DIY Option:
- Pick a dimension \(m\)
- Learn \(10000 \times m\) weights

class LSTMModel(nn.Module):
    def __init__(self, input_size):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(input_size, 64)

How to Embed?

DIY Option:
- Pick a dimension \(m\)
- Learn \(10000 \times m\) weights
- Weights are trained from scratch each time
- Embedding is unsophisticated, but optimized to your problem

How to Embed?

Pretrained Options
- Global Vectors for word representation (GloVe)
- Trained on wikipedia text, up to 300 dimensions:

How to Embed?

Pretrained Options
- Global Vectors for word representation (GloVe)
- Trained on wikipedia text, up to 300 dimensions:

How to Embed?

Pretrained Options
- GloVe
- Word2Vec
- BERT (though it is more complicated)

class LSTMModel(nn.Module):
    def __init__(self, input_size):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(
            glove_embeddings,
            freeze=True
            )

Fine Tuning

Pre-trained embeddings have been trained on huge corpuses of data
Similar networks exist in other domains, ResNet for images
freeze=False fine-tunes those weights
Fine-tuning allows you to adapt part or the whole of a sophisticated model to your particular task

Vanishing Gradients in RNNs

RNNs cause vanishing gradients
ReLU can’t be used!
Sequence length compounds it
RNNs lose key information

Long Short Term Memory

LSTM was invented to handle this
RNN just has activations \(A\)
- Renamed \(h\) for LSTM
LSTM has a “memory” called \(C\) (Cell State)

Cell State Updates

Each new token, update \(C_l\)
Forget part of \(C_{l-1}\)
Add new info from \(X_{l}\) and \(h_{l-1}\)

\[ C_l = f_l C_{l-1} + i_l \tilde{C}_l \]

Forgetting

Forgetting is number between 0 and 1
Depends on \(h_{l-1}\) and \(X_l\)

Christopher Olah

Input

Input is number between 0 and 1
Candidate state is new info to add

Christopher Olah

Cell State Update Diagram

Ready for the update

Christopher Olah

Update Hidden State

The hidden state is calculated from the cell state

Christopher Olah

PyTorch Implementation

class LSTMModel(nn.Module):
    def __init__(self, input_size):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(input_size, 64)
        self.lstm = nn.LSTM(input_size=64,
                            hidden_size=64,
                            batch_first=True)
        self.dense = nn.Linear(64, 1)
    def forward(self, x):
        val, (h_n, c_n) = self.lstm(self.embedding(x))
        return torch.flatten(self.dense(val[:,-1]))

LSTM Architecture Tips

Hidden Layers Become Harmful Fast
- LSTM doesn’t totally eliminate vanishing gradients
Width depends on your training data
Dropout, weight decay, layer norm, early stopping, data augmentation
- Weight Freezing for Embedding

Optuna

How do we systematically search to find the best hyperparameters when there are many?
Grid Search is ineffcient when there are many parameters
Better: Sample parameters randomly, bias to where you have been getting better results

Optuna Code

Choose your hyperparameters and their ranges:

def lstm_objective(trial):
    # Hyperparameters to optimize over
    embedding_dim = trial.suggest_categorical("embedding_dim", [16, 32, 64, 128])
    hidden_size = trial.suggest_categorical("hidden_size", [32, 64, 128])
    num_layers = trial.suggest_int("num_layers", 1, 3)
    dropout = trial.suggest_float("dropout", 0.0, 0.5)
    learning_rate = trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True)
    weight_decay = trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True)
    batch_size = trial.suggest_categorical("batch_size", [128, 256, 512])

Optuna Code

Make you NN class depend on the values

class TrialLSTMModel(nn.Module):
        def __init__(self, input_size):
            super(TrialLSTMModel, self).__init__()
            self.embedding = nn.Embedding(input_size, embedding_dim)
            self.drop1 = nn.Dropout(dropout)
            self.lstm = nn.LSTM(input_size=embedding_dim,
                                hidden_size=hidden_size,
                                num_layers=num_layers,
                                dropout=dropout if num_layers > 1 else 0,
                                batch_first=True)
            self.drop2 = nn.Dropout(dropout)
            self.dense = nn.Linear(hidden_size, 1)

Optuna Code

Run the trial!

study = optuna.create_study(direction="maximize")
study.optimize(lstm_objective, n_trials=20)

You will need a bit more boilerplate but its not bad
There are a lot of sophisticated features

Get a small boost

Before tuning 86%
After tuning 88%

DATA 622 Meetup 14: RNNs and Text

Week Summary

nyhackr Thursday May 14th!

nyhackr Thursday May 14th!

End of Semester Gathering?

Sequential Data

Sequential Data

Recurrent Neural Networks

RNN Architecture

RNN Architecture

RNN Architecture

RNN Architecture

RNN Architecture

RNN Architecture

RNN Use Cases

RNN Example: IMDB Sentiment

How Can a Neural Network Read?

One-Hot and Meaning?

Solution: Low Dimensional Embedding

Similar Words Cluster Together

Features May Be Interpretable

Dimension Reduction

How to Embed?

How to Embed?

How to Embed?

How to Embed?

How to Embed?

Fine Tuning

Vanishing Gradients in RNNs

Long Short Term Memory

Cell State Updates

Forgetting

Input

Cell State Update Diagram

Update Hidden State

PyTorch Implementation

LSTM Architecture Tips

Optuna

Optuna Code

Optuna Code

Optuna Code

Get a small boost

Thanks