pytorch
diff --git a/‎.github/workflows/regression_test_aarch64.yml‎
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/regression_test_aarch64.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 17 additions & 17 deletions b/‎README.md‎
Lines changed: 17 additions & 17 deletions
diff --git a/‎benchmarks/float8/bench_matmul.py‎
Lines changed: 7 additions & 0 deletions b/‎benchmarks/float8/bench_matmul.py‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎benchmarks/float8/float8_inference_roofline.py‎
Lines changed: 26 additions & 5 deletions b/‎benchmarks/float8/float8_inference_roofline.py‎
Lines changed: 26 additions & 5 deletions
diff --git a/‎benchmarks/prototype/moe_training/bench_moe_layer.py‎
Lines changed: 1 addition & 1 deletion b/‎benchmarks/prototype/moe_training/bench_moe_layer.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎benchmarks/prototype/moe_training/benchmark_scaled_grouped_mm_dq.py‎
Lines changed: 4 additions & 4 deletions b/‎benchmarks/prototype/moe_training/benchmark_scaled_grouped_mm_dq.py‎
Lines changed: 4 additions & 4 deletions
@@ -37,15 +37,15 @@ jobs:
           # Install executorch first because it installs its own version
           # of torch and torchao, which we do not want to use
           pip install executorch
-          pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu --force-reinstall
+          pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cpu --force-reinstall
           pip install -r dev-requirements.txt
           USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install . --no-build-isolation
       - name: Install requirements linux
         if: runner.os == 'Linux'
         run: |
           conda activate venv
           pip install coremltools
-          pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu --force-reinstall
+          pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cpu --force-reinstall
           pip install -r dev-requirements.txt
           BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP pip install . --no-build-isolation
       - name: Run python tests
 
@@ -24,7 +24,8 @@
 
 ## 📣 Latest News
 
-- [Oct 20] MXFP8 MoE training prototype achieved **~1.45x speedup** for MoE layer in Llama4 Scout, and **~1.25x** speedup for MoE layer in DeepSeekV3 671b - with comparable numerics to bfloat16! Check out the [docs](./torchao/prototype/moe_training/) to try it out.
+- [Oct 25] QAT is now integrated into [Unsloth](https://docs.unsloth.ai/new/quantization-aware-training-qat) for both full and LoRA fine-tuning! Try it out using [this notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_%284B%29_Instruct-QAT.ipynb).
+- [Oct 25] MXFP8 MoE training prototype achieved **~1.45x speedup** for MoE layer in Llama4 Scout, and **~1.25x** speedup for MoE layer in DeepSeekV3 671b - with comparable numerics to bfloat16! Check out the [docs](./torchao/prototype/moe_training/) to try it out.
 - [Sept 25] MXFP8 training achieved [1.28x speedup on Crusoe B200 cluster](https://pytorch.org/blog/accelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster/) with virtually identical loss curve to bfloat16!
 - [Sept 19] [TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/)!
 - [Jun 25] Our [TorchAO paper](https://openreview.net/attachment?id=HpqH0JakHf&name=pdf) was accepted to CodeML @ ICML 2025!
@@ -103,22 +104,6 @@ pip install torchao
 
 Please see the [torchao compability table](https://github.com/pytorch/ao/issues/2919) for version requirements for dependencies.
 
-## 🔗 Integrations
-
-TorchAO is integrated into some of the leading open-source libraries including:
-
-* Unsloth for QAT, blog post coming soon!
-* HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865)
-* HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md)
-* vLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
-* Integration with [FBGEMM](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai) for SOTA kernels on server GPUs
-* Integration with [ExecuTorch](https://github.com/pytorch/executorch/) for edge device deployment
-* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)
-* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
-* HuggingFace PEFT for LoRA using TorchAO as their [quantization backend](https://huggingface.co/docs/peft/en/developer_guides/quantization#torchao-pytorch-architecture-optimization)
-* TorchTune for our NF4 [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes
-* SGLang for LLM serving: [usage](https://docs.sglang.ai/advanced_features/quantization.html#online-quantization)
-
 ## 🔎 Inference
 
 TorchAO delivers substantial performance gains with minimal code changes:
@@ -265,6 +250,21 @@ We've added support for authoring and releasing [custom ops](./torchao/csrc/) th
 If you believe there's other CUDA kernels we should be taking a closer look at please leave a comment on [this issue](https://github.com/pytorch/ao/issues/697) or feel free to contribute directly to the repo.
 -->
 
+## 🔗 Integrations
+
+TorchAO is integrated into some of the leading open-source libraries including:
+
+* Unsloth for QAT, blog post coming soon!
+* HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865)
+* HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md)
+* vLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
+* Integration with [FBGEMM](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai) for SOTA kernels on server GPUs
+* Integration with [ExecuTorch](https://github.com/pytorch/executorch/) for edge device deployment
+* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)
+* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
+* HuggingFace PEFT for LoRA using TorchAO as their [quantization backend](https://huggingface.co/docs/peft/en/developer_guides/quantization#torchao-pytorch-architecture-optimization)
+* TorchTune for our NF4 [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes
+* SGLang for LLM serving: [usage](https://docs.sglang.ai/advanced_features/quantization.html#online-quantization)
 
 ## 🎥 Videos
 * [Keynote talk at GPU MODE IRL](https://youtu.be/FH5wiwOyPX4?si=VZK22hHz25GRzBG1&t=1009)
 
@@ -17,6 +17,7 @@
 
 from torchao.ops import mx_fp4_bf16
 from torchao.prototype.mx_formats.mx_tensor import to_mx
+from torchao.prototype.mx_formats.utils import to_blocked
 from torchao.testing.training.roofline_utils import get_specs
 from torchao.utils import is_MI300
 
@@ -125,10 +126,16 @@ def run(
         elif recipe in ("mxfp8_cublas", "mxfp4_cutlass"):
             scale_a = torch.ones(M, K // 32, device=device, dtype=torch.float8_e8m0fnu)
             scale_b = torch.ones(N, K // 32, device=device, dtype=torch.float8_e8m0fnu)
+            # pad if needed
+            scale_a = to_blocked(scale_a)
+            scale_b = to_blocked(scale_b)
         elif recipe == "nvfp4":
             # Use the blockwise scales from nvfp4_quantize
             scale_a = A_scales.view(torch.float8_e4m3fn)
             scale_b = B_scales.view(torch.float8_e4m3fn)
+            # pad if needed
+            scale_a = to_blocked(scale_a)
+            scale_b = to_blocked(scale_b)
         else:
             assert False, f"unknown recipe {recipe}"
 
 
@@ -46,6 +46,7 @@
     NVFP4InferenceConfig,
     NVFP4MMConfig,
 )
+from torchao.prototype.mx_formats.utils import to_blocked
 from torchao.quantization.quant_api import (
     Float8DynamicActivationFloat8WeightConfig,
     PerRow,
@@ -134,12 +135,18 @@ def get_gemm_times(
     elif recipe_name == "mxfp8_cublas":
         scale_a = torch.ones(M, K // 32, device=device, dtype=torch.float8_e8m0fnu)
         scale_b = torch.ones(N, K // 32, device=device, dtype=torch.float8_e8m0fnu)
+        scale_a = to_blocked(scale_a)
+        scale_b = to_blocked(scale_b)
     elif recipe_name == "mxfp4_cutlass":
         scale_a = torch.ones(M, K // 32, device=device, dtype=torch.float8_e8m0fnu)
         scale_b = torch.ones(N, K // 32, device=device, dtype=torch.float8_e8m0fnu)
+        scale_a = to_blocked(scale_a)
+        scale_b = to_blocked(scale_b)
     elif recipe_name == "nvfp4":
         scale_a = torch.ones(M, K // 16, device=device, dtype=torch.float8_e4m3fn)
         scale_b = torch.ones(N, K // 16, device=device, dtype=torch.float8_e4m3fn)
+        scale_a = to_blocked(scale_a)
+        scale_b = to_blocked(scale_b)
 
     else:
         assert False, "unsupported"
@@ -166,16 +173,22 @@ def run(
     recipe_name: str,
     do_benchmarks: bool = True,
     shape_gen_name: str = "pow2",
+    M: Optional[int] = None,
+    K: Optional[int] = None,
+    N: Optional[int] = None,
     n_limit: Optional[int] = None,
     save_profile_traces: bool = False,
+    enable_fusion_modeling: bool = False,
 ):
     """
     Args:
     * `recipe_name`: quantization recipe (tensorwise, rowwise, mxfp8*, mxfp4*, nvfp4*)
     * `do_benchmarks`: if True, gemm and e2e fwd+bwd of LNLinearSigmoid are benchmarked
-    * `shape_gen_name`: `llama`, `pow2`, `pow2_extended`, or `sweep`
+    * `shape_gen_name`: `llama`, `pow2`, `pow2_extended`, `sweep`, or `custom`
+    * `M|K|N`: if shape_gen_name is `custom`, then these values are used for MKN
     * `n_limit (optional)`: if specified, only runs `n_limit` iterations
     # `save_profile_traces (optional)`: if True, saves profiling traces
+    # `enable_fusion_modeling`: if True, models activation -> gemm instead of just gemm
     """
     config_table = [
         ["GPU", torch.cuda.get_device_name(0)],
@@ -184,16 +197,22 @@ def run(
         ["recipe_name", recipe_name],
         ["do_benchmarks", do_benchmarks],
         ["shape_gen_name", shape_gen_name],
+        ["enable_fusion_modeling", enable_fusion_modeling],
+        ["MKN", f"{M} {K} {N}"],
     ]
     print(tabulate(config_table, headers=["Parameter", "Value"], tablefmt="simple"))
 
+    # reassign user specified MKN, so we can use them for sympy
+    user_M, user_K, user_N = M, K, N
+
     M, K, N = sympy.symbols("M K N")
 
     fp8_ovhd_time_sympy = get_inference_float8_mem_sympy(
         M,
         K,
         N,
         recipe_name,
+        # TODO(future): also enable fusion modeling here
     )
     bf16_gemm_time_sympy = get_inference_gemm_time_sympy(M, K, N, torch.bfloat16, None)
 
@@ -241,7 +260,7 @@ def run(
     ]
     results = []
 
-    name_to_shapes = get_name_to_shapes_iter(shape_gen_name, None, None, None)
+    name_to_shapes = get_name_to_shapes_iter(shape_gen_name, user_M, user_K, user_N)
 
     for idx, (name, (M_val, K_val, N_val)) in enumerate(tqdm.tqdm(name_to_shapes)):
         if n_limit is not None and idx >= n_limit:
@@ -287,9 +306,11 @@ def run(
         b_bf16_e2e_time_s, b_fp8_e2e_time_s = 0, 0
         if do_benchmarks:
             # create the model
-            m_orig = (
-                nn.Sequential(nn.Linear(K_val, N_val, bias=False)).cuda().bfloat16()
-            )
+            if not enable_fusion_modeling:
+                m_orig = nn.Sequential(nn.Linear(K_val, N_val, bias=False))
+            else:
+                m_orig = nn.Sequential(nn.ReLU(), nn.Linear(K_val, N_val, bias=False))
+            m_orig = m_orig.cuda().bfloat16()
             x = torch.randn(
                 M_val, K_val, dtype=torch.bfloat16, device="cuda"
             ).requires_grad_()
 
@@ -205,7 +205,7 @@ def warmup(model, input, labels):
     parser.add_argument(
         "--local_batch_size",
         type=int,
-        default=8,
+        default=12,
     )
     parser.add_argument(
         "--hidden_dim",
 
@@ -19,7 +19,7 @@
     bench_fwd_microseconds,
     profile_fwd_bwd,
 )
-from torchao.prototype.moe_training import _scaled_grouped_mm
+from torchao.prototype.moe_training import _quantize_then_scaled_grouped_mm
 from torchao.prototype.moe_training.conversion_utils import MoEScalingType
 from torchao.prototype.moe_training.utils import generate_jagged_offs
 
@@ -158,7 +158,7 @@ def run_experiment(
 
     # fwd_bwd scaled benchmark + profiling
     scaled_fwd_bwd_us = bench_fwd_bwd_microseconds(
-        _scaled_grouped_mm,
+        _quantize_then_scaled_grouped_mm,
         A,
         B_t,
         offs,
@@ -169,7 +169,7 @@ def run_experiment(
     )
     if args.profile:
         profile_fwd_bwd(
-            _scaled_grouped_mm,
+            _quantize_then_scaled_grouped_mm,
             A,
             B_t,
             offs,
@@ -190,7 +190,7 @@ def run_experiment(
         fullgraph=True,
     )
     scaled_fwd_us = bench_fwd_microseconds(
-        _scaled_grouped_mm,
+        _quantize_then_scaled_grouped_mm,
         A,
         B_t,
         offs,
Original file line number	Diff line number	Diff line change
`@@ -205,7 +205,7 @@ def warmup(model, input, labels):`
`205`	`205`	`parser.add_argument(`
`206`	`206`	`"--local_batch_size",`
`207`	`207`	`type=int,`
`208`		`- default=8,`
	`208`	`+ default=12,`
`209`	`209`	`)`
`210`	`210`	`parser.add_argument(`
`211`	`211`	`"--hidden_dim",`