Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning

This repository contains the official implementation of the paper "Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning", accepted at the International Conference on Learning Representations (ICLR) 2025.

Abstract

Recent advancements in large language models (LLMs) based on transformer architectures have sparked significant interest in understanding their inner workings. In this paper, we introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs). Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index. Through spectral analysis of the model's dynamics, we uncover an increase in eigenvalue magnitude that challenges the weight-sharing assumption prevalent in existing theoretical studies. We also leverage the Lyapunov exponent to examine token-level sensitivity, enhancing model interpretability. Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets, while offering flexible fine-tuning capabilities that can adapt to different architectural constraints.

Model Architecture

Our model formulates transformers as neural ODEs with highly flexible non-autonomous vector fields. Instead of shared weights across layers, we parameterize all weights through neural networks that express these weights as functions of a continuous layer index (time). The model includes:

Time-dependent weights for attention components (Q, K, V)
Time-dependent weights for feed-forward networks
Representation of weights using a time-dependent unit that embeds time information in the Fourier domain

Experimental Results

Language Modeling

Comparable or better performance than vanilla transformers across various configurations

Significant improvements in downstream tasks, particularly in reading comprehension

Adaptive finetune

Flexible fine-tuning capabilities that can adapt to different architectural constraints

Implementation

The implementation is built on JAX, utilizing an ecosystem that includes Equinox, Haliax, and the Levanter framework.

Citation

If you find this work useful, please consider citing:

@inproceedings{
tong2025neural,
title={Neural {ODE} Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning},
author={Anh Tong and Thanh Nguyen-Tang and Dongeun Lee and Duc Nguyen and Toan Tran and David Leo Wright Hall and Cheongwoong Kang and Jaesik Choi},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=XnDyddPcBT}
}

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
config		config
notebooks		notebooks
qkvflow		qkvflow
LICENSE		LICENSE
README.md		README.md
five_shot_llama.png		five_shot_llama.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning

Abstract

Model Architecture

Experimental Results

Language Modeling

Adaptive finetune

Implementation

Citation

License

About

Uh oh!

Languages

License

SDML-KU/qkvflow

Folders and files

Latest commit

History

Repository files navigation

Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning

Abstract

Model Architecture

Experimental Results

Language Modeling

Adaptive finetune

Implementation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages