Skip to content

niruvk/Transformers_Research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Memorization Capabilities of Neural Networks: Understanding Scaling Laws in Autoregressive and Sequence-To-Sequence Learning

Investigating Scaling Laws in Transformers and MLPs on Random Sequence Memorization Tasks
By Niranjan Vijaya Krishnan, Christine Guo, Diya Hundiwala
πŸ“„ Read the full paper here


πŸ” Overview

This project explores memorization thresholds in neural networksβ€”how well models like MLPs and decoder-only Transformers can memorize random sequence mappings, under a fixed parameter budget. Our experiments probe how different architectural features (depth, width, attention heads) impact memorization on synthetic tasks.


✨ Key Contributions

  • πŸ” Comparative Analysis of MLPs vs. Transformers on random bijective sequence mapping tasks.
  • πŸ“ˆ Scaling Law Insights: MLPs benefit from width, while Transformers benefit from multi-head attention.
  • πŸ§ͺ Controlled Experiments: All models are constrained to identical parameter budgets (10k–128k) for fair comparison.
  • πŸ“Š Memorization Threshold: Defined as the maximum number of mappings a model can learn perfectly.

πŸ§ͺ Methodology Summary

Task

Models are trained to memorize random bijective mappings between fixed-length sequences of tokens (length = 10, vocab size = 10). No generalization is expectedβ€”the goal is pure memorization.

Model Architectures

  • MLPs: Fully connected feedforward networks used for sequence-to-sequence learning.
  • Decoder-Only Transformers: Used for autoregressive token prediction.

Each model configuration is evaluated on:

  • Varying depth vs. width (MLPs)
  • Attention heads vs. feedforward size (Transformers)

Memorization Threshold Search

  • We use binary search to find the highest dataset size where the model achieves 100% accuracy.
  • Training stops when:
    • Accuracy < 100%
    • Or training plateaus for 3 epochs

πŸ“Š Key Results

MLP Insights

Parameter Count Best Depth Max Threshold
10,000 1 layer 467 sequences
30,000 1 layer 1728 sequences
90,000 1 layer 3648 sequences

βœ… Wider and shallower MLPs memorize better
⚠️ Deep MLPs collapse even with more parameters


Transformer Insights

Param Count Max Heads Max Threshold
64,000 32 77 sequences
128,000 64 42 sequences

βœ… Attention heads improve memorization
πŸ“‰ Depth mildly reduces performance
❌ Transformers less efficient than MLPs for pure memorization


πŸ“š Citation

@article{krishnan2024memorization,
  title={The Memorization Capabilities of Neural Networks: Understanding Scaling Laws in Autoregressive and Seq2Seq Learning},
  author={Krishnan, Niranjan Vijaya and Guo, Christine and Hundiwala, Diya},
  journal={GitHub},
  year={2024},
  url={https://github.com/niruvk/Transformers_Research}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published