3.3 -> 3.4 TritonBench H100 perf regressions (& improvements)

Perf regressions are listed below. The data is collected from https://github.com/pytorch-labs/tritonbench/actions/runs/16035767306 (and for Meta employees, aggregated using https://fburl.com/scuba/pytorch_user_benchmarks/2mp2d6a9)

Note: just look at the "speedup difference ((new-old)/old)" column; apologies for the weird column names / numbers, it's an artifact of the data visualization / comparison tool used to generate the csv. For example, the worst real regression is `
tritonbench_layer_norm_bwd[x_6656-liger_layer_norm]_speedup`, and the value of 4.34 means that the speedup is regressed 434% compared to 3.3, or assuming the torch baseline is stable, the latency has regressed 434%. For the examples below, I've double checked to verify that there are actual regressions in the measured kernels (not just improvements in the baseline).

These are notes about anything w/ > 10% regression

**Biggest regressions** (note: ignore the first 13 results with infinite regression, they are failing/not running on both versions): https://gist.github.com/davidberard98/af67f9d7be019c353bb4821a07f28bfa

* [ ] liger layernorm bwd: compared to torch, the liger kernels appear to do well on small x, and then fall off a cliff. On Triton 3.3, the cliff occurs between 8192 and 8704; on Triton 3.4, it occurs between 4096 and 4608. Therefore, all of the shapes 4608 to 8192 have significant regressions (like 400%+) compared to Triton 3.3. Repro: `python run.py --op layer_norm --baseline torch_layer_norm --metrics latency --only liger_layer_norm,torch_layer_norm --bwd --input-id 7 --num-inputs 1`. **Bisect** points to triton-lang/triton#7138, an LLVM pin update (so this is a regression in LLVM). Bisection on llvm points to llvm/llvm-project#140487 as the cause of the regression
* [ ] liger embedding bwd: 20-53% regression across a range of shapes. Repro: `python run.py --op embedding --baseline torch_embedding --metrics latency --bwd --only liger_embedding,torch_embedding,torch_embedding`. Bisects to triton-lang/triton#6149


Can't repro:
* rope bwd: This appears in the dataset but looks fine when I run locally. Possible that this is because the original data was collected on an old version of Liger. Repro: `python run.py --op rope --baseline apply_rotary_pos_emb --metrics latency --only liger_rotary_pos_emb,apply_rotary_pos_emb`
* low_mem_dropout: these appear to have high variability, reported regression is likely just noise
* flex_attention_fwd: 19.9% regression on `x_ (8, 16, 512, 16, 512, 128)` - likely noise (baseline appears noisy in some runs) Repro `python run.py --op flex_attention --baseline eager --metrics latency,speedup --only compiled,eager`

**Biggest improvements**
https://gist.github.com/davidberard98/ec81ec7a5c035db225053074e7e280d6

* flex attention!
* fp8_gemm_blockwise
* int4_gemm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3.3 -> 3.4 TritonBench H100 perf regressions (& improvements) #264

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

3.3 -> 3.4 TritonBench H100 perf regressions (& improvements) #264

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions