-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Perf regressions are listed below. The data is collected from https://github.com/pytorch-labs/tritonbench/actions/runs/16035767306 (and for Meta employees, aggregated using https://fburl.com/scuba/pytorch_user_benchmarks/2mp2d6a9)
Note: just look at the "speedup difference ((new-old)/old)" column; apologies for the weird column names / numbers, it's an artifact of the data visualization / comparison tool used to generate the csv. For example, the worst real regression is tritonbench_layer_norm_bwd[x_6656-liger_layer_norm]_speedup
, and the value of 4.34 means that the speedup is regressed 434% compared to 3.3, or assuming the torch baseline is stable, the latency has regressed 434%. For the examples below, I've double checked to verify that there are actual regressions in the measured kernels (not just improvements in the baseline).
These are notes about anything w/ > 10% regression
Biggest regressions (note: ignore the first 13 results with infinite regression, they are failing/not running on both versions): https://gist.github.com/davidberard98/af67f9d7be019c353bb4821a07f28bfa
- liger layernorm bwd: compared to torch, the liger kernels appear to do well on small x, and then fall off a cliff. On Triton 3.3, the cliff occurs between 8192 and 8704; on Triton 3.4, it occurs between 4096 and 4608. Therefore, all of the shapes 4608 to 8192 have significant regressions (like 400%+) compared to Triton 3.3. Repro:
python run.py --op layer_norm --baseline torch_layer_norm --metrics latency --only liger_layer_norm,torch_layer_norm --bwd --input-id 7 --num-inputs 1
. Bisect points to [Backend] Bump to llvm/llvm-project@8957e64a20fc triton-lang/triton#7138, an LLVM pin update (so this is a regression in LLVM). Bisection on llvm points to [NVPTX] Remove Float register classes llvm/llvm-project#140487 as the cause of the regression - liger embedding bwd: 20-53% regression across a range of shapes. Repro:
python run.py --op embedding --baseline torch_embedding --metrics latency --bwd --only liger_embedding,torch_embedding,torch_embedding
. Bisects to Update ptxas to version 12.8.93 triton-lang/triton#6149
Can't repro:
- rope bwd: This appears in the dataset but looks fine when I run locally. Possible that this is because the original data was collected on an old version of Liger. Repro:
python run.py --op rope --baseline apply_rotary_pos_emb --metrics latency --only liger_rotary_pos_emb,apply_rotary_pos_emb
- low_mem_dropout: these appear to have high variability, reported regression is likely just noise
- flex_attention_fwd: 19.9% regression on
x_ (8, 16, 512, 16, 512, 128)
- likely noise (baseline appears noisy in some runs) Repropython run.py --op flex_attention --baseline eager --metrics latency,speedup --only compiled,eager
Biggest improvements
https://gist.github.com/davidberard98/ec81ec7a5c035db225053074e7e280d6
- flex attention!
- fp8_gemm_blockwise
- int4_gemm