Skip to content

3.3 -> 3.4 TritonBench H100 perf regressions (& improvements) #264

@davidberard98

Description

@davidberard98

Perf regressions are listed below. The data is collected from https://github.com/pytorch-labs/tritonbench/actions/runs/16035767306 (and for Meta employees, aggregated using https://fburl.com/scuba/pytorch_user_benchmarks/2mp2d6a9)

Note: just look at the "speedup difference ((new-old)/old)" column; apologies for the weird column names / numbers, it's an artifact of the data visualization / comparison tool used to generate the csv. For example, the worst real regression is tritonbench_layer_norm_bwd[x_6656-liger_layer_norm]_speedup, and the value of 4.34 means that the speedup is regressed 434% compared to 3.3, or assuming the torch baseline is stable, the latency has regressed 434%. For the examples below, I've double checked to verify that there are actual regressions in the measured kernels (not just improvements in the baseline).

These are notes about anything w/ > 10% regression

Biggest regressions (note: ignore the first 13 results with infinite regression, they are failing/not running on both versions): https://gist.github.com/davidberard98/af67f9d7be019c353bb4821a07f28bfa

  • liger layernorm bwd: compared to torch, the liger kernels appear to do well on small x, and then fall off a cliff. On Triton 3.3, the cliff occurs between 8192 and 8704; on Triton 3.4, it occurs between 4096 and 4608. Therefore, all of the shapes 4608 to 8192 have significant regressions (like 400%+) compared to Triton 3.3. Repro: python run.py --op layer_norm --baseline torch_layer_norm --metrics latency --only liger_layer_norm,torch_layer_norm --bwd --input-id 7 --num-inputs 1. Bisect points to [Backend] Bump to llvm/llvm-project@8957e64a20fc triton-lang/triton#7138, an LLVM pin update (so this is a regression in LLVM). Bisection on llvm points to [NVPTX] Remove Float register classes llvm/llvm-project#140487 as the cause of the regression
  • liger embedding bwd: 20-53% regression across a range of shapes. Repro: python run.py --op embedding --baseline torch_embedding --metrics latency --bwd --only liger_embedding,torch_embedding,torch_embedding. Bisects to Update ptxas to version 12.8.93 triton-lang/triton#6149

Can't repro:

  • rope bwd: This appears in the dataset but looks fine when I run locally. Possible that this is because the original data was collected on an old version of Liger. Repro: python run.py --op rope --baseline apply_rotary_pos_emb --metrics latency --only liger_rotary_pos_emb,apply_rotary_pos_emb
  • low_mem_dropout: these appear to have high variability, reported regression is likely just noise
  • flex_attention_fwd: 19.9% regression on x_ (8, 16, 512, 16, 512, 128) - likely noise (baseline appears noisy in some runs) Repro python run.py --op flex_attention --baseline eager --metrics latency,speedup --only compiled,eager

Biggest improvements
https://gist.github.com/davidberard98/ec81ec7a5c035db225053074e7e280d6

  • flex attention!
  • fp8_gemm_blockwise
  • int4_gemm

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions