Memorization Capabilities of Neural Networks: Understanding Scaling Laws in Autoregressive and Sequence-To-Sequence Learning
Investigating Scaling Laws in Transformers and MLPs on Random Sequence Memorization Tasks
By Niranjan Vijaya Krishnan, Christine Guo, Diya Hundiwala
π Read the full paper here
This project explores memorization thresholds in neural networksβhow well models like MLPs and decoder-only Transformers can memorize random sequence mappings, under a fixed parameter budget. Our experiments probe how different architectural features (depth, width, attention heads) impact memorization on synthetic tasks.
- π Comparative Analysis of MLPs vs. Transformers on random bijective sequence mapping tasks.
- π Scaling Law Insights: MLPs benefit from width, while Transformers benefit from multi-head attention.
- π§ͺ Controlled Experiments: All models are constrained to identical parameter budgets (10kβ128k) for fair comparison.
- π Memorization Threshold: Defined as the maximum number of mappings a model can learn perfectly.
Models are trained to memorize random bijective mappings between fixed-length sequences of tokens (length = 10, vocab size = 10). No generalization is expectedβthe goal is pure memorization.
- MLPs: Fully connected feedforward networks used for sequence-to-sequence learning.
- Decoder-Only Transformers: Used for autoregressive token prediction.
Each model configuration is evaluated on:
- Varying depth vs. width (MLPs)
- Attention heads vs. feedforward size (Transformers)
- We use binary search to find the highest dataset size where the model achieves 100% accuracy.
- Training stops when:
- Accuracy < 100%
- Or training plateaus for 3 epochs
Parameter Count | Best Depth | Max Threshold |
---|---|---|
10,000 | 1 layer | 467 sequences |
30,000 | 1 layer | 1728 sequences |
90,000 | 1 layer | 3648 sequences |
β
Wider and shallower MLPs memorize better
Param Count | Max Heads | Max Threshold |
---|---|---|
64,000 | 32 | 77 sequences |
128,000 | 64 | 42 sequences |
β
Attention heads improve memorization
π Depth mildly reduces performance
β Transformers less efficient than MLPs for pure memorization
@article{krishnan2024memorization,
title={The Memorization Capabilities of Neural Networks: Understanding Scaling Laws in Autoregressive and Seq2Seq Learning},
author={Krishnan, Niranjan Vijaya and Guo, Christine and Hundiwala, Diya},
journal={GitHub},
year={2024},
url={https://github.com/niruvk/Transformers_Research}
}