Uncontaminated Sample Packing #3525

djsaunde · 2025-10-29T17:01:05Z

This PR adds sample packing support. It uses TRL's SFTConfig packing=True and padding_free=True args to pack the sequences, and we compute packed_seq_lengths metadata and thread it through the model forward pass. This metadata is used to create block causal masks for SDPA and xformers attention, and is passed to the flash attention varlen API which handles the block causal masking itself under the hood (we need to do this ourselves because of our custom forward pass, whereas TRL handles the sequence length metadata internally in their trainer).

I added a few unit tests. I also wrote a quick bash script for smoke testing some common model architectures: gist, which runs.

Below is a comparison of short unsloth/qwen2.5-0.5b training runs. The losses don't match because we're seeing more / different samples on each step. But the scale and trend match, which is the important bit.

Commands:

No sample packing:

python unsloth-cli.py --model_name unsloth/qwen2.5-0.5b --dataset yahma/alpaca-cleaned --per_device_train_batch_size 8 --max_steps 50 --max_seq_length 2048

Sample packing:

python unsloth-cli.py --model_name unsloth/qwen2.5-0.5b --dataset yahma/alpaca-cleaned --per_device_train_batch_size 1 --max_steps 50 --max_seq_length 2048 --sample_packing

Note that we use --per_device_train_batch_size 1 in the latter case since we are packing multiple examples into a single [1, max_seq_length] tensor.

The benefit of this approach is that we're able to discard a lot of zero padding, and therefore get higher token/s training throughput. The below plot shows that we're able to get through our dataset ~20% faster. These gains depend on the dataset and configured --max_seq_length; if we increase this we generally get better packing efficiency => higher throughput.

I manually tested on SDPA and flash attention, but I still need to test xformers attention since I couldn't get it to build for blackwell.

TODO

test xformers attention

djsaunde · 2025-10-30T16:42:20Z

Follow up: DRY up attention code. We re-implement a big if / else block for selecting / running the attention per modeling file. We can factor this out into a separate module and call a helper function. CC @Datta0

djsaunde · 2025-10-30T18:22:15Z

I added support for passing position IDs to RoPE (needed for correctness, just like attention), and a (fused QK) triton kernel for the RoPE embedding (similar to what exists currently for the non-packing case).

Benchmarks show we're competitive to the triton kernel for the non-packing case while numerical ~match and significantly beat the torch slow path:

RoPE kernel benchmark sweep (microseconds per call)

seqlen	varlen	dense	old	new	speedup	max abs Δ	mean abs Δ
256	False	198.501	–	–	–	–	–
256	True	–	429.066	223.670	1.918	4.768e-07	1.136e-08
512	False	413.377	–	–	–	–	–
512	True	–	1149.956	566.851	2.029	4.768e-07	1.170e-08
1024	False	1113.990	–	–	–	–	–
1024	True	–	2784.808	1140.053	2.443	4.768e-07	1.187e-08
2048	False	2341.204	–	–	–	–	–
2048	True	–	5525.063	2372.505	2.329	4.768e-07	1.214e-08
4096	False	4675.885	–	–	–	–	–
4096	True	–	11354.554	4681.061	2.426	4.768e-07	1.239e-08
8192	False	9285.158	–	–	–	–	–
8192	True	–	21901.080	9323.563	2.349	4.768e-07	1.256e-08

djsaunde requested review from danielhanchen and mmathew23 October 29, 2025 17:01

djsaunde self-assigned this Oct 29, 2025

djsaunde force-pushed the packing branch from 6e45dad to fdebcef Compare October 29, 2025 17:04

djsaunde changed the title ~~Packing~~ sample packing Oct 29, 2025

djsaunde force-pushed the packing branch from 738a0b3 to c07d6bd Compare October 30, 2025 18:06

implement (sdpa, xformers, fa2) sample packing

c23f676

djsaunde force-pushed the packing branch from c07d6bd to c23f676 Compare October 30, 2025 18:22

shimmyshimmer changed the title ~~sample packing~~ Uncontaminated packing Oct 30, 2025

shimmyshimmer changed the title ~~Uncontaminated packing~~ Uncontaminated Sample Packing Oct 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uncontaminated Sample Packing #3525

Uncontaminated Sample Packing #3525

djsaunde commented Oct 29, 2025 •

edited

Loading

Uh oh!

djsaunde commented Oct 30, 2025

Uh oh!

djsaunde commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uncontaminated Sample Packing #3525

Are you sure you want to change the base?

Uncontaminated Sample Packing #3525

Conversation

djsaunde commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djsaunde commented Oct 30, 2025

Uh oh!

djsaunde commented Oct 30, 2025

RoPE kernel benchmark sweep (microseconds per call)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

djsaunde commented Oct 29, 2025 •

edited

Loading