DATA 622 Meetup 15: Pretrained Models

George I. Hagstrom

2026-05-11

Week Summary

Optional Lab 8
Presentations!
Pretrained Models: Different Reading
Coding Vignette Posted (without video)

nyhackr Thursday May 14th!

Lab Comments

No ridge regression without using the standard scaler!!!
Pick CausalML Hyperparameters to Make things Smooth

est = CausalForestDML(
    model_y=LassoCV(cv=3),
    model_t=LogisticRegressionCV(cv=3),
    n_estimators=8000,
    min_samples_leaf=50,
    #max_depth=5,
    max_samples=0.02,
    discrete_treatment=True,
    random_state=123
)

Lab Comments

No ridge regression without using the standard scaler!!!
Pick CausalML Hyperparameters to Make things Smooth

Lab Comments

No ridge regression without using the standard scaler!!!
Aggressive Choices Lead to Aggressive Fits

est = CausalForestDML(
    model_y=LassoCV(cv=3),
    model_t=LogisticRegressionCV(cv=3),
    n_estimators=8000,
    min_samples_leaf=10,
    max_depth=40,
    max_samples=0.45,
    discrete_treatment=True,
    random_state=123
)

Lab Comments

No ridge regression without using the standard scaler!!!
Aggressive Choices Lead to Aggressive Fits

Contextual Meaning

Consider an LSTM Processing this Text

Contextual Meaning

Pronouns embedded as vectors, but mean specific things

Contextual Meaning

LSTM incorporates their meaning via long-range states

Contextual Meaning

Later, same pronouns may occur

Contextual Meaning

LSTM’s understanding too compressed to remember previous context

Context

Distant words hold key context

Attention

Attention is a mechanism to “update” the meaning of a word based on the surrounding context

Attention

Consider simple sentence

Attention

Each token has an embedding

Attention

Nearby tokens give context, change embedding in next layer

Attention

Accumulate changes to embedding in next layer

Attention

Accumulate changes to embedding in next layer

Attention

Accumulate changes to embedding in next layer

Transformer Achitecture

A transformer is an architecture containing layers of transformer blocks.

Transformer Layer

Transformer Layer has two parts:
- Attention Mechanism
- MLP layer betweem

Allocating Attention

Each attention head within a layer has several projections:
- ‘q_proj’ and ‘k_proj’ determine which other tokens are relevant
- ‘v_proj’ determines what information gets added if relevant
- ‘o_proj’ combines the added info for each token

Encoder Models: BERT

22 Transformer Layers
150M paramters
Head is trained to predict masked words

Encoder Models: BERT

22 Transformer Layers
150M paramters
Head is trained to predict masked words

Encoder Models

Excel at all non-generative tasks
- Classification
- Named Entity Recognition
- Similarity
Fast
Cannot Generate Text
Less Efficient Training

Decoder Models

ChatGPT, Claude, Gemini, Qwen, etc
Head is configured to predict next token probability

Decoder Models

Can sort of do anything
Easier to train
- Can get bigger
- People think they show more ‘emergence’

Fine Tuning Large Models

LLMs start life as “Base Models”
They just produce text:

Fine Tuning Large Models

To make LLMs useful, they must be specified to a task
Train them on examples of what you are looking for!

{"messages": [
  {"role": "user", "content": "What are the main risks of investing in bonds?"},
  {"role": "assistant", "content": "The main risks include interest rate risk, credit risk, and inflation risk..."},
  {"role": "user", "content": "Can you explain interest rate risk more?"},
  {"role": "assistant", "content": "When interest rates rise, existing bond prices fall because..."},
  {"role": "user", "content": "So should I avoid bonds when rates are rising?"},
  {"role": "assistant", "content": "Not necessarily. Short-duration bonds are less affected..."}
]}

Fine Tuning

Fine tuning works like regular training but:
- Smaller Dataset
- Fewer epochs
- Sometimes only modify part of the network

Fine Tuning BERT

The entire thing:

# Extarct the BERT model
bert_model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base", num_labels=3)

# Setup the training, this is the analog of the pytorch lightning trainer

bert_trainer = HFTrainer(
    model=bert_model,
    args=bert_training_args,
    train_dataset=hf_train_tok,
    eval_dataset=hf_val_tok,
    data_collator=DataCollatorWithPadding(bert_tokenizer), # Trick to save computation time
    compute_metrics=compute_metrics,
)

# Perform training
bert_trainer.train()

Fine Tuning BERT

Loop through and freeze/unfreeze!

    # Freeze lhe first N transformer layers via a loop
    
    for layer in bert_model.model.layers[:-N]:
        for param in layer.parameters():
            param.requires_grad = False

Fine Tuning BERT

Loop through and freeze/unfreeze!
Typical to fine-tune layers near the top

Fine Tune Big Models?

BERT has 150M parameters
Qwen 0.5B
Others have much more, some approach 1T parameters
Will NOT fit in your GPU RAM

LoRA

LoRA stands for Low Rank Adaptation
For the Linear Algebra Knowers:

\[ L = M_1 \times M_2^t \]

\(M_1\) and \(M_2\) are \(N\times r\) matrices
Way of generating full size matrix with small number of parameters to learn

LoRA

LoRA stands for Low Rank Adaptation
For the Linear Algebra Knowers:

\[ Q_{FT} = Q + \alpha L \]

LoRA

LoRA is standard technique to fine-tune large transformers
Key hyperparameters:
- Rank \(r\): Higher \(r\) is more complex and costlier
- Higher \(\alpha\): \(\alpha\) is “size” of perturbation
- Components: ‘q_proj’ and ‘v_proj’, ‘MLP’ layer
Apply LoRA to every layer

LoRA Implementation

Hugging Faces has great tools for this

#Get model
qwen_model = AutoModelForCausalLM.from_pretrained(
    qwen_model_id,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,  # force all non-quantized weights to float16
    device_map="auto"
)

# Setup LoRA
lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM", # Let it know that this model produces text recursively
)

LoRA Implementation

Hugging Faces has great tools for this

# Train/Fine Tune!!!
qwen_trainer = SFTTrainer(
    model=qwen_model,
    args=qwen_training_args,
    train_dataset=hf_train_fmt,
    eval_dataset=hf_val_fmt,
    peft_config=lora_config,
)

qwen_trainer.train()

LoRA Impact

Fine-Tuning leads to stunning performance improvements with relatively little effort
Original Model:

Fine Tuned Model:

DATA 622 Meetup 15: Pretrained Models

Week Summary

nyhackr Thursday May 14th!

Lab Comments

Lab Comments

Lab Comments

Lab Comments

Contextual Meaning

Contextual Meaning

Contextual Meaning

Contextual Meaning

Contextual Meaning

Context

Attention

Attention

Attention

Attention

Attention

Attention

Attention

Transformer Achitecture

Transformer Layer

Allocating Attention

Encoder Models: BERT

Encoder Models: BERT

Encoder Models

Decoder Models

Decoder Models

Fine Tuning Large Models

Fine Tuning Large Models

Fine Tuning

Fine Tuning BERT

Fine Tuning BERT

Fine Tuning BERT

Fine Tune Big Models?

LoRA

LoRA

LoRA

LoRA Implementation

LoRA Implementation

LoRA Impact

Thanks!