From e7b20cc6098db5ba31b18a98cd3af5137dae02f0 Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Thu, 3 Jul 2025 13:06:49 -0700
Subject: [PATCH 01/11] A dummy tutorial structure

---
 docs/source/microbenchmarking.rst | 493 ++++++++++++++++++++++++++++++
 1 file changed, 493 insertions(+)
 create mode 100644 docs/source/microbenchmarking.rst
diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst
new file mode 100644
index 0000000000..3f5702abdb
--- /dev/null
+++ b/docs/source/microbenchmarking.rst
@@ -0,0 +1,493 @@
+Microbenchmarking Tutorial
+==========================
+
+This tutorial will guide you through using the TorchAO microbenchmarking framework. The tutorial contains different use cases for benchmarking your API and integrating with the dashboard.
+
+1. Add an API to benchmarking recipes
+2. Add a model to benchmarking recipes
+3. Benchmark your API locally
+4. Add an API to benchmarking CI dashboard
+
+1. Add an API to Benchmarking Recipes
+--------------------------------------
+
+To add a new quantization API to the benchmarking system, you need to ensure your quantization method is available in the TorchAO quantization recipes.
+
+1.1 Supported Quantization Methods
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The framework currently supports these quantization types:
+
+- ``baseline``: No quantization (bfloat16 reference)
+- ``int8wo``: 8-bit weight-only quantization
+- ``int8dq``: 8-bit dynamic quantization
+- ``int4wo-{group_size}``: 4-bit weight-only quantization with specified group size
+- ``int4wo-{group_size}-hqq``: 4-bit weight-only quantization with HQQ
+- ``float8wo``: Float8 weight-only quantization
+- ``float8dq-tensor``: Float8 dynamic quantization (tensor-wise)
+- ``float8dq-row``: Float8 dynamic quantization (row-wise)
+- ``gemlitewo-{bit_width}-{group_size}``: 4 or 8 bit integer quantization with gemlite triton kernel
+
+1.2 Adding a New Quantization Recipe
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To add a new quantization method:
+
+1. **Implement your quantization function** in the appropriate TorchAO module (e.g., ``torchao/quantization/``)
+
+2. **Add the recipe to the quantization system** by ensuring it can be called with the same interface as existing methods
+
+3. **Test your quantization method** with a simple benchmark configuration:
+
+.. code-block:: yaml
+
+    # test_my_quantization.yml
+    benchmark_mode: "inference"
+    quantization_config_recipe_names:
+      - "baseline"
+      - "my_new_quantization"  # Your new method
+
+    output_dir: "test_results"
+
+    model_params:
+      - name: "test_linear"
+        matrix_shapes:
+          - name: "custom"
+            shapes: [[1024, 1024, 1024]]
+        high_precision_dtype: "torch.bfloat16"
+        use_torch_compile: false
+        device: "cuda"
+        model_type: "linear"
+
+4. **Verify the integration** by running:
+
+.. code-block:: bash
+
+    python -m benchmarks.microbenchmarks.benchmark_runner --config test_my_quantization.yml
+
+2. Add a Model to Benchmarking Recipes
+---------------------------------------
+
+To add a new model architecture to the benchmarking system, you need to modify ``torchao/testing/model_architectures.py``.
+
+2.1 Current Model Types
+~~~~~~~~~~~~~~~~~~~~~~~
+
+The framework supports these model types:
+
+- ``linear``: Simple linear layer (``ToyLinearModel``)
+- ``ln_linear_<activation>``: LayerNorm + Linear + Activation (``LNLinearActivationModel``)
+
+  - ``ln_linear_sigmoid``: LayerNorm + Linear + Sigmoid
+  - ``ln_linear_relu``: LayerNorm + Linear + ReLU
+  - ``ln_linear_gelu``: LayerNorm + Linear + GELU
+  - ``ln_linear_silu``: LayerNorm + Linear + SiLU
+  - ``ln_linear_leakyrelu``: LayerNorm + Linear + LeakyReLU
+  - ``ln_linear_relu6``: LayerNorm + Linear + ReLU6
+  - ``ln_linear_hardswish``: LayerNorm + Linear + Hardswish
+
+- ``transformer_block``: Transformer block with self-attention and MLP (``TransformerBlock``)
+
+2.2 Adding a New Model Architecture
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To add a new model type:
+
+1. **Define your model class** in ``torchao/testing/model_architectures.py``:
+
+.. code-block:: python
+
+    class MyCustomModel(torch.nn.Module):
+        def __init__(self, input_dim, output_dim, dtype=torch.bfloat16):
+            super().__init__()
+            # Define your model architecture
+            self.layer1 = torch.nn.Linear(input_dim, 512, bias=False).to(dtype)
+            self.activation = torch.nn.ReLU()
+            self.layer2 = torch.nn.Linear(512, output_dim, bias=False).to(dtype)
+
+        def forward(self, x):
+            x = self.layer1(x)
+            x = self.activation(x)
+            x = self.layer2(x)
+            return x
+
+2. **Update the** ``create_model_and_input_data`` **function** to handle your new model type:
+
+.. code-block:: python
+
+    def create_model_and_input_data(
+        model_type: str,
+        m: int,
+        k: int,
+        n: int,
+        high_precision_dtype: torch.dtype = torch.bfloat16,
+        device: str = "cuda",
+        activation: str = "relu",
+    ):
+        # ... existing code ...
+
+        elif model_type == "my_custom_model":
+            model = MyCustomModel(k, n, high_precision_dtype).to(device)
+            input_data = torch.randn(m, k, device=device, dtype=high_precision_dtype)
+
+        # ... rest of existing code ...
+
+3. **Test your new model** with a benchmark configuration:
+
+.. code-block:: yaml
+
+    # test_my_model.yml
+    benchmark_mode: "inference"
+    quantization_config_recipe_names:
+      - "baseline"
+      - "int8wo"
+
+    output_dir: "test_results"
+
+    model_params:
+      - name: "test_my_custom_model"
+        matrix_shapes:
+          - name: "custom"
+            shapes: [[1024, 1024, 1024]]
+        high_precision_dtype: "torch.bfloat16"
+        use_torch_compile: false
+        device: "cuda"
+        model_type: "my_custom_model"  # Your new model type
+
+4. **Verify the integration**:
+
+.. code-block:: bash
+
+    python -m benchmarks.microbenchmarks.benchmark_runner --config test_my_model.yml
+
+2.3 Model Design Considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When adding new models:
+
+- **Input/Output Dimensions**: Ensure your model handles the (m, k, n) dimension convention where:
+
+  - ``m``: Batch size or sequence length
+  - ``k``: Input feature dimension
+  - ``n``: Output feature dimension
+
+- **Data Types**: Support the ``high_precision_dtype`` parameter (typically ``torch.bfloat16``)
+
+- **Device Compatibility**: Ensure your model works on CUDA, CPU, and other target devices
+
+- **Quantization Compatibility**: Design your model to work with TorchAO quantization methods
+
+3. Benchmark Your API Locally
+------------------------------
+
+For local development and testing:
+
+3.1 Quick Start
+~~~~~~~~~~~~~~~
+
+Create a minimal configuration for local testing:
+
+.. code-block:: yaml
+
+    # local_test.yml
+    benchmark_mode: "inference"
+    quantization_config_recipe_names:
+      - "baseline"
+      - "int8wo"
+
+    output_dir: "local_results"
+
+    model_params:
+      - name: "quick_test"
+        matrix_shapes:
+          - name: "custom"
+            shapes: [[1024, 1024, 1024]]
+        high_precision_dtype: "torch.bfloat16"
+        use_torch_compile: false  # Disable for faster iteration
+        device: "cuda"
+        model_type: "linear"
+
+3.2 Run Local Benchmark
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+    python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml
+
+3.3 Shape Generation Options
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can use different shape generation strategies:
+
+**Custom Shapes:**
+
+.. code-block:: yaml
+
+    matrix_shapes:
+      - name: "custom"
+        shapes: [
+          [1024, 1024, 1024],  # [m, k, n]
+          [2048, 4096, 1024]
+        ]
+
+**LLaMa Model Shapes:**
+
+.. code-block:: yaml
+
+    matrix_shapes:
+      - name: "llama"  # Uses LLaMa 2 70B single-node weight shapes
+
+**Power of 2 Shapes:**
+
+.. code-block:: yaml
+
+    matrix_shapes:
+      - name: "pow2"
+        min_power: 10  # 2^10 = 1024
+        max_power: 12  # 2^12 = 4096
+
+**Extended Power of 2 Shapes:**
+
+.. code-block:: yaml
+
+    matrix_shapes:
+      - name: "pow2_extended"
+        min_power: 10  # Generates: 1024, 1536, 2048, 3072, etc.
+        max_power: 11
+
+**Small Sweep (for heatmaps):**
+
+.. code-block:: yaml
+
+    matrix_shapes:
+      - name: "small_sweep"
+        min_power: 10
+        max_power: 15
+
+**Full Sweep:**
+
+.. code-block:: yaml
+
+    matrix_shapes:
+      - name: "sweep"
+        min_power: 8
+        max_power: 9
+
+3.4 Enable Profiling for Debugging
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For detailed performance analysis, enable profiling:
+
+.. code-block:: yaml
+
+    model_params:
+      - name: "debug_model"
+        # ... other parameters ...
+        enable_profiler: true        # Enable standard profiling
+        enable_memory_profiler: true # Enable CUDA memory profiling
+
+This will generate:
+
+- Standard PyTorch profiler traces
+- CUDA memory snapshots and visualizations
+- Memory usage analysis in the ``memory_profiler`` subdirectory
+
+3.5 Device Options
+~~~~~~~~~~~~~~~~~~
+
+Test on different devices:
+
+.. code-block:: yaml
+
+    device: "cuda"  # NVIDIA GPU
+    # device: "xpu"   # Intel GPU
+    # device: "mps"   # Apple Silicon GPU
+    # device: "cpu"   # CPU fallback
+
+3.6 Compilation Options
+~~~~~~~~~~~~~~~~~~~~~~
+
+Control PyTorch compilation for performance tuning:
+
+.. code-block:: yaml
+
+    use_torch_compile: true
+    torch_compile_mode: "max-autotune"  # Options: "default", "max-autotune", "false"
+
+4. Add an API to Benchmarking CI Dashboard
+------------------------------------------
+
+To integrate your API with the continuous integration dashboard:
+
+4.1 Modify Existing CI Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Add your quantization method to the existing CI configuration file at ``benchmarks/dashboard/microbenchmark_quantization_config.yml``:
+
+.. code-block:: yaml
+
+    # benchmarks/dashboard/microbenchmark_quantization_config.yml
+    benchmark_mode: "inference"
+    quantization_config_recipe_names:
+      - "int8wo"
+      - "int8dq"
+      - "float8dq-tensor"
+      - "float8dq-row"
+      - "float8wo"
+      - "my_new_quantization"  # Add your method here
+
+    output_dir: "benchmarks/microbenchmarks/results"
+
+    model_params:
+      - name: "small_bf16_linear"
+        matrix_shapes:
+          - name: "small_sweep"
+            min_power: 10
+            max_power: 15
+        high_precision_dtype: "torch.bfloat16"
+        use_torch_compile: true
+        torch_compile_mode: "max-autotune"
+        device: "cuda"
+        model_type: "linear"
+
+4.2 Run CI Benchmarks
+~~~~~~~~~~~~~~~~~~~~~
+
+Use the CI runner to generate results in PyTorch OSS benchmark database format:
+
+.. code-block:: bash
+
+    python benchmarks/dashboard/ci_microbenchmark_runner.py \
+        --config benchmarks/dashboard/microbenchmark_quantization_config.yml \
+        --output benchmark_results.json
+
+4.3 CI Output Format
+~~~~~~~~~~~~~~~~~~~~
+
+The CI runner outputs results in a specific JSON format required by the PyTorch OSS benchmark database:
+
+.. code-block:: json
+
+    [
+      {
+        "benchmark": {
+          "name": "micro-benchmark api",
+          "mode": "inference",
+          "dtype": "int8wo",
+          "extra_info": {
+            "device": "cuda",
+            "arch": "NVIDIA A100-SXM4-80GB"
+          }
+        },
+        "model": {
+          "name": "1024-1024-1024",
+          "type": "micro-benchmark custom layer",
+          "origins": ["torchao"]
+        },
+        "metric": {
+          "name": "speedup(wrt bf16)",
+          "benchmark_values": [1.25],
+          "target_value": 0.0
+        },
+        "runners": [],
+        "dependencies": {}
+      }
+    ]
+
+4.4 Integration with CI Pipeline
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To integrate with your CI pipeline, add the benchmark step to your workflow:
+
+.. code-block:: yaml
+
+    # Example GitHub Actions step
+    - name: Run Microbenchmarks
+      run: |
+        python benchmarks/dashboard/ci_microbenchmark_runner.py \
+          --config benchmarks/dashboard/microbenchmark_quantization_config.yml \
+          --output benchmark_results.json
+
+    - name: Upload Results
+      # Upload benchmark_results.json to your dashboard system
+
+Advanced Usage
+--------------
+
+Multiple Model Configurations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can benchmark multiple model configurations in a single run:
+
+.. code-block:: yaml
+
+    model_params:
+      - name: "small_models"
+        matrix_shapes:
+          - name: "pow2"
+            min_power: 10
+            max_power: 12
+        model_type: "linear"
+        device: "cuda"
+
+      - name: "transformer_models"
+        matrix_shapes:
+          - name: "llama"
+        model_type: "transformer_block"
+        device: "cuda"
+
+      - name: "cpu_models"
+        matrix_shapes:
+          - name: "custom"
+            shapes: [[512, 512, 512]]
+        model_type: "linear"
+        device: "cpu"
+
+Running Tests
+~~~~~~~~~~~~~
+
+To verify your setup and run the test suite:
+
+.. code-block:: bash
+
+    python -m unittest discover benchmarks/microbenchmarks/test
+
+Interpreting Results
+~~~~~~~~~~~~~~~~~~~~
+
+The benchmark results include:
+
+- **Speedup**: Performance improvement compared to baseline (bfloat16)
+- **Memory Usage**: Peak memory consumption during inference
+- **Latency**: Time taken for inference operations
+- **Profiling Data**: Detailed performance traces (when enabled)
+
+Results are saved in CSV format with columns for:
+
+- Model configuration
+- Quantization method
+- Shape dimensions (M, K, N)
+- Performance metrics
+- Device information
+
+Troubleshooting
+---------------
+
+Common Issues
+~~~~~~~~~~~~~
+
+1. **CUDA Out of Memory**: Reduce batch size or matrix dimensions
+2. **Compilation Errors**: Set ``use_torch_compile: false`` for debugging
+3. **Missing Quantization Methods**: Ensure TorchAO is properly installed
+4. **Device Not Available**: Check device availability and drivers
+
+Best Practices
+~~~~~~~~~~~~~~
+
+1. Always include a baseline configuration for comparison
+2. Use ``small_sweep`` for initial testing, ``sweep`` for comprehensive analysis
+3. Enable profiling only when needed (adds overhead)
+4. Test on multiple devices when possible
+5. Use consistent naming conventions for reproducibility
+
+For more detailed information about the framework components, see the README files in the ``benchmarks/microbenchmarks/`` directory.

From a6a2ae0f959aabd30889b89f0900a9bf536a2aee Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 7 Jul 2025 09:56:24 -0700
Subject: [PATCH 02/11] update tutorial

---
 docs/source/microbenchmarking.rst | 230 ++++--------------------------
 1 file changed, 29 insertions(+), 201 deletions(-)

diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst
index 3f5702abdb..90c26ecd85 100644
--- a/docs/source/microbenchmarking.rst
+++ b/docs/source/microbenchmarking.rst
@@ -11,89 +11,39 @@ This tutorial will guide you through using the TorchAO microbenchmarking framewo
 1. Add an API to Benchmarking Recipes
 --------------------------------------
 
-To add a new quantization API to the benchmarking system, you need to ensure your quantization method is available in the TorchAO quantization recipes.
+The framework currently supports quantization and sparsity recipes, which can be run using the quantize_() or sparsity_() functions:
 
-1.1 Supported Quantization Methods
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To add a new recipe, add the corresponding string configuration to the function ``string_to_config()`` in ``benchmarks/microbenchmarks/utils.py``.
 
-The framework currently supports these quantization types:
-
-- ``baseline``: No quantization (bfloat16 reference)
-- ``int8wo``: 8-bit weight-only quantization
-- ``int8dq``: 8-bit dynamic quantization
-- ``int4wo-{group_size}``: 4-bit weight-only quantization with specified group size
-- ``int4wo-{group_size}-hqq``: 4-bit weight-only quantization with HQQ
-- ``float8wo``: Float8 weight-only quantization
-- ``float8dq-tensor``: Float8 dynamic quantization (tensor-wise)
-- ``float8dq-row``: Float8 dynamic quantization (row-wise)
-- ``gemlitewo-{bit_width}-{group_size}``: 4 or 8 bit integer quantization with gemlite triton kernel
-
-1.2 Adding a New Quantization Recipe
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To add a new quantization method:
-
-1. **Implement your quantization function** in the appropriate TorchAO module (e.g., ``torchao/quantization/``)
-
-2. **Add the recipe to the quantization system** by ensuring it can be called with the same interface as existing methods
-
-3. **Test your quantization method** with a simple benchmark configuration:
+.. code-block:: python
 
-.. code-block:: yaml
+  def string_to_config(
+    quantization: Optional[str], sparsity: Optional[str], **kwargs
+  ) -> AOBaseConfig:
 
-    # test_my_quantization.yml
-    benchmark_mode: "inference"
-    quantization_config_recipe_names:
-      - "baseline"
-      - "my_new_quantization"  # Your new method
+  # ... existing code ...
 
-    output_dir: "test_results"
+  elif quantization == "my_new_quantization":
+    # If additional information needs to be passed as kwargs, process it here
+    return MyNewQuantizationConfig(**kwargs)
+  elif sparsity == "my_new_sparsity":
+    return MyNewSparsityConfig(**kwargs)
 
-    model_params:
-      - name: "test_linear"
-        matrix_shapes:
-          - name: "custom"
-            shapes: [[1024, 1024, 1024]]
-        high_precision_dtype: "torch.bfloat16"
-        use_torch_compile: false
-        device: "cuda"
-        model_type: "linear"
+  # ... rest of existing code ...
 
-4. **Verify the integration** by running:
+Now we can use this recipe throughout the benchmarking framework.
 
-.. code-block:: bash
+.. note::
 
-    python -m benchmarks.microbenchmarks.benchmark_runner --config test_my_quantization.yml
+  If the ``AOBaseConfig`` uses input parameters, like bit-width, group-size etc, you can pass them appended to the string config in input
+  For example, for ``GemliteUIntXWeightOnlyConfig`` we can pass it-width and group-size as ``gemlitewo-<bit_width>-<group_size>``
 
 2. Add a Model to Benchmarking Recipes
 ---------------------------------------
 
 To add a new model architecture to the benchmarking system, you need to modify ``torchao/testing/model_architectures.py``.
 
-2.1 Current Model Types
-~~~~~~~~~~~~~~~~~~~~~~~
-
-The framework supports these model types:
-
-- ``linear``: Simple linear layer (``ToyLinearModel``)
-- ``ln_linear_<activation>``: LayerNorm + Linear + Activation (``LNLinearActivationModel``)
-
-  - ``ln_linear_sigmoid``: LayerNorm + Linear + Sigmoid
-  - ``ln_linear_relu``: LayerNorm + Linear + ReLU
-  - ``ln_linear_gelu``: LayerNorm + Linear + GELU
-  - ``ln_linear_silu``: LayerNorm + Linear + SiLU
-  - ``ln_linear_leakyrelu``: LayerNorm + Linear + LeakyReLU
-  - ``ln_linear_relu6``: LayerNorm + Linear + ReLU6
-  - ``ln_linear_hardswish``: LayerNorm + Linear + Hardswish
-
-- ``transformer_block``: Transformer block with self-attention and MLP (``TransformerBlock``)
-
-2.2 Adding a New Model Architecture
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To add a new model type:
-
-1. **Define your model class** in ``torchao/testing/model_architectures.py``:
+1. To add a new model type, define your model class in ``torchao/testing/model_architectures.py``:
 
 .. code-block:: python
 
@@ -111,7 +61,7 @@ To add a new model type:
             x = self.layer2(x)
             return x
 
-2. **Update the** ``create_model_and_input_data`` **function** to handle your new model type:
+2. Update the ``create_model_and_input_data`` function to handle your new model type:
 
 .. code-block:: python
 
@@ -132,36 +82,7 @@ To add a new model type:
 
         # ... rest of existing code ...
 
-3. **Test your new model** with a benchmark configuration:
-
-.. code-block:: yaml
-
-    # test_my_model.yml
-    benchmark_mode: "inference"
-    quantization_config_recipe_names:
-      - "baseline"
-      - "int8wo"
-
-    output_dir: "test_results"
-
-    model_params:
-      - name: "test_my_custom_model"
-        matrix_shapes:
-          - name: "custom"
-            shapes: [[1024, 1024, 1024]]
-        high_precision_dtype: "torch.bfloat16"
-        use_torch_compile: false
-        device: "cuda"
-        model_type: "my_custom_model"  # Your new model type
-
-4. **Verify the integration**:
-
-.. code-block:: bash
-
-    python -m benchmarks.microbenchmarks.benchmark_runner --config test_my_model.yml
-
-2.3 Model Design Considerations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+**Model Design Considerations**
 
 When adding new models:
 
@@ -194,19 +115,26 @@ Create a minimal configuration for local testing:
     quantization_config_recipe_names:
       - "baseline"
       - "int8wo"
+      # Add your recipe here
 
-    output_dir: "local_results"
+    output_dir: "local_results" # Add your output directory here
 
     model_params:
+      # Add your model configurations here
       - name: "quick_test"
         matrix_shapes:
+          # Define a custom shape, or use one of the predefined shape generators
           - name: "custom"
             shapes: [[1024, 1024, 1024]]
         high_precision_dtype: "torch.bfloat16"
-        use_torch_compile: false  # Disable for faster iteration
+        use_torch_compile: true
         device: "cuda"
         model_type: "linear"
 
+.. note::
+  - For a list of latest supported config recipes for quantization or sparsity, please refer to ``benchmarks/microbenchmarks/README.md``.
+  - For a list of all model types, please refer to ``torchao/testing/model_architectures.py``.
+
 3.2 Run Local Benchmark
 ~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -214,106 +142,6 @@ Create a minimal configuration for local testing:
 
     python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml
 
-3.3 Shape Generation Options
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-You can use different shape generation strategies:
-
-**Custom Shapes:**
-
-.. code-block:: yaml
-
-    matrix_shapes:
-      - name: "custom"
-        shapes: [
-          [1024, 1024, 1024],  # [m, k, n]
-          [2048, 4096, 1024]
-        ]
-
-**LLaMa Model Shapes:**
-
-.. code-block:: yaml
-
-    matrix_shapes:
-      - name: "llama"  # Uses LLaMa 2 70B single-node weight shapes
-
-**Power of 2 Shapes:**
-
-.. code-block:: yaml
-
-    matrix_shapes:
-      - name: "pow2"
-        min_power: 10  # 2^10 = 1024
-        max_power: 12  # 2^12 = 4096
-
-**Extended Power of 2 Shapes:**
-
-.. code-block:: yaml
-
-    matrix_shapes:
-      - name: "pow2_extended"
-        min_power: 10  # Generates: 1024, 1536, 2048, 3072, etc.
-        max_power: 11
-
-**Small Sweep (for heatmaps):**
-
-.. code-block:: yaml
-
-    matrix_shapes:
-      - name: "small_sweep"
-        min_power: 10
-        max_power: 15
-
-**Full Sweep:**
-
-.. code-block:: yaml
-
-    matrix_shapes:
-      - name: "sweep"
-        min_power: 8
-        max_power: 9
-
-3.4 Enable Profiling for Debugging
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-For detailed performance analysis, enable profiling:
-
-.. code-block:: yaml
-
-    model_params:
-      - name: "debug_model"
-        # ... other parameters ...
-        enable_profiler: true        # Enable standard profiling
-        enable_memory_profiler: true # Enable CUDA memory profiling
-
-This will generate:
-
-- Standard PyTorch profiler traces
-- CUDA memory snapshots and visualizations
-- Memory usage analysis in the ``memory_profiler`` subdirectory
-
-3.5 Device Options
-~~~~~~~~~~~~~~~~~~
-
-Test on different devices:
-
-.. code-block:: yaml
-
-    device: "cuda"  # NVIDIA GPU
-    # device: "xpu"   # Intel GPU
-    # device: "mps"   # Apple Silicon GPU
-    # device: "cpu"   # CPU fallback
-
-3.6 Compilation Options
-~~~~~~~~~~~~~~~~~~~~~~
-
-Control PyTorch compilation for performance tuning:
-
-.. code-block:: yaml
-
-    use_torch_compile: true
-    torch_compile_mode: "max-autotune"  # Options: "default", "max-autotune", "false"
-
 4. Add an API to Benchmarking CI Dashboard
 ------------------------------------------
 

From cde732cae537f93f0f77dc99c8679184670d3d22 Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 7 Jul 2025 10:07:11 -0700
Subject: [PATCH 03/11] update tutorial

---
 docs/source/microbenchmarking.rst | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst
index 90c26ecd85..9fcf48dda5 100644
--- a/docs/source/microbenchmarking.rst
+++ b/docs/source/microbenchmarking.rst
@@ -142,10 +142,19 @@ Create a minimal configuration for local testing:
 
     python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml
 
+3.3 Analysing the Output
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The output generated after running the benchmarking script, is the form of a csv. It'll contain the following:
+ - time for inference for running baseline model and quantized model
+ - speedup in inference time in quantized model
+ - compile or eager mode
+ - if enabled, memory snapshot and gpu chrome trace
+
 4. Add an API to Benchmarking CI Dashboard
 ------------------------------------------
 
-To integrate your API with the continuous integration dashboard:
+To integrate your API with the CI `dashboard <https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fao&benchmarkName=micro-benchmark+api>`_:
 
 4.1 Modify Existing CI Configuration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From 9f994f1b46fd740bbd103059440a0406fec6913b Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 7 Jul 2025 14:49:03 -0700
Subject: [PATCH 04/11] update tutorial

---
 docs/source/index.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/index.rst b/docs/source/index.rst
index aac72590fd..60355f761b 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -21,6 +21,7 @@ for an overall introduction to the library and recent highlight and updates.
    quantization
    sparsity
    contributor_guide
+   microbenchmarking
 
 .. toctree::
    :glob:

From ed6f659806e009d8ad20ebd1f7c94216debbe90d Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 7 Jul 2025 15:20:35 -0700
Subject: [PATCH 05/11] updates

---
 docs/source/microbenchmarking.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst
index 9fcf48dda5..7849a345e9 100644
--- a/docs/source/microbenchmarking.rst
+++ b/docs/source/microbenchmarking.rst
@@ -3,10 +3,10 @@ Microbenchmarking Tutorial
 
 This tutorial will guide you through using the TorchAO microbenchmarking framework. The tutorial contains different use cases for benchmarking your API and integrating with the dashboard.
 
-1. Add an API to benchmarking recipes
-2. Add a model to benchmarking recipes
-3. Benchmark your API locally
-4. Add an API to benchmarking CI dashboard
+1. :ref:`Add an API to benchmarking recipes`
+2. :ref:`Add a model to benchmarking recipes`
+3. :ref:`Benchmark your API locally`
+4. :ref:`Add an API to benchmarking CI dashboard`
 
 1. Add an API to Benchmarking Recipes
 --------------------------------------

From 41b99865ece08e95e49927dfcd5754c50079b1c6 Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Wed, 9 Jul 2025 11:32:15 -0700
Subject: [PATCH 06/11] update tutorial to .md

---
 docs/source/microbenchmarking.rst | 330 ------------------------------
 1 file changed, 330 deletions(-)
 delete mode 100644 docs/source/microbenchmarking.rst

diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst
deleted file mode 100644
index 7849a345e9..0000000000
--- a/docs/source/microbenchmarking.rst
+++ /dev/null
@@ -1,330 +0,0 @@
-Microbenchmarking Tutorial
-==========================
-
-This tutorial will guide you through using the TorchAO microbenchmarking framework. The tutorial contains different use cases for benchmarking your API and integrating with the dashboard.
-
-1. :ref:`Add an API to benchmarking recipes`
-2. :ref:`Add a model to benchmarking recipes`
-3. :ref:`Benchmark your API locally`
-4. :ref:`Add an API to benchmarking CI dashboard`
-
-1. Add an API to Benchmarking Recipes
---------------------------------------
-
-The framework currently supports quantization and sparsity recipes, which can be run using the quantize_() or sparsity_() functions:
-
-To add a new recipe, add the corresponding string configuration to the function ``string_to_config()`` in ``benchmarks/microbenchmarks/utils.py``.
-
-.. code-block:: python
-
-  def string_to_config(
-    quantization: Optional[str], sparsity: Optional[str], **kwargs
-  ) -> AOBaseConfig:
-
-  # ... existing code ...
-
-  elif quantization == "my_new_quantization":
-    # If additional information needs to be passed as kwargs, process it here
-    return MyNewQuantizationConfig(**kwargs)
-  elif sparsity == "my_new_sparsity":
-    return MyNewSparsityConfig(**kwargs)
-
-  # ... rest of existing code ...
-
-Now we can use this recipe throughout the benchmarking framework.
-
-.. note::
-
-  If the ``AOBaseConfig`` uses input parameters, like bit-width, group-size etc, you can pass them appended to the string config in input
-  For example, for ``GemliteUIntXWeightOnlyConfig`` we can pass it-width and group-size as ``gemlitewo-<bit_width>-<group_size>``
-
-2. Add a Model to Benchmarking Recipes
----------------------------------------
-
-To add a new model architecture to the benchmarking system, you need to modify ``torchao/testing/model_architectures.py``.
-
-1. To add a new model type, define your model class in ``torchao/testing/model_architectures.py``:
-
-.. code-block:: python
-
-    class MyCustomModel(torch.nn.Module):
-        def __init__(self, input_dim, output_dim, dtype=torch.bfloat16):
-            super().__init__()
-            # Define your model architecture
-            self.layer1 = torch.nn.Linear(input_dim, 512, bias=False).to(dtype)
-            self.activation = torch.nn.ReLU()
-            self.layer2 = torch.nn.Linear(512, output_dim, bias=False).to(dtype)
-
-        def forward(self, x):
-            x = self.layer1(x)
-            x = self.activation(x)
-            x = self.layer2(x)
-            return x
-
-2. Update the ``create_model_and_input_data`` function to handle your new model type:
-
-.. code-block:: python
-
-    def create_model_and_input_data(
-        model_type: str,
-        m: int,
-        k: int,
-        n: int,
-        high_precision_dtype: torch.dtype = torch.bfloat16,
-        device: str = "cuda",
-        activation: str = "relu",
-    ):
-        # ... existing code ...
-
-        elif model_type == "my_custom_model":
-            model = MyCustomModel(k, n, high_precision_dtype).to(device)
-            input_data = torch.randn(m, k, device=device, dtype=high_precision_dtype)
-
-        # ... rest of existing code ...
-
-**Model Design Considerations**
-
-When adding new models:
-
-- **Input/Output Dimensions**: Ensure your model handles the (m, k, n) dimension convention where:
-
-  - ``m``: Batch size or sequence length
-  - ``k``: Input feature dimension
-  - ``n``: Output feature dimension
-
-- **Data Types**: Support the ``high_precision_dtype`` parameter (typically ``torch.bfloat16``)
-
-- **Device Compatibility**: Ensure your model works on CUDA, CPU, and other target devices
-
-- **Quantization Compatibility**: Design your model to work with TorchAO quantization methods
-
-3. Benchmark Your API Locally
-------------------------------
-
-For local development and testing:
-
-3.1 Quick Start
-~~~~~~~~~~~~~~~
-
-Create a minimal configuration for local testing:
-
-.. code-block:: yaml
-
-    # local_test.yml
-    benchmark_mode: "inference"
-    quantization_config_recipe_names:
-      - "baseline"
-      - "int8wo"
-      # Add your recipe here
-
-    output_dir: "local_results" # Add your output directory here
-
-    model_params:
-      # Add your model configurations here
-      - name: "quick_test"
-        matrix_shapes:
-          # Define a custom shape, or use one of the predefined shape generators
-          - name: "custom"
-            shapes: [[1024, 1024, 1024]]
-        high_precision_dtype: "torch.bfloat16"
-        use_torch_compile: true
-        device: "cuda"
-        model_type: "linear"
-
-.. note::
-  - For a list of latest supported config recipes for quantization or sparsity, please refer to ``benchmarks/microbenchmarks/README.md``.
-  - For a list of all model types, please refer to ``torchao/testing/model_architectures.py``.
-
-3.2 Run Local Benchmark
-~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code-block:: bash
-
-    python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml
-
-3.3 Analysing the Output
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-The output generated after running the benchmarking script, is the form of a csv. It'll contain the following:
- - time for inference for running baseline model and quantized model
- - speedup in inference time in quantized model
- - compile or eager mode
- - if enabled, memory snapshot and gpu chrome trace
-
-4. Add an API to Benchmarking CI Dashboard
-------------------------------------------
-
-To integrate your API with the CI `dashboard <https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fao&benchmarkName=micro-benchmark+api>`_:
-
-4.1 Modify Existing CI Configuration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Add your quantization method to the existing CI configuration file at ``benchmarks/dashboard/microbenchmark_quantization_config.yml``:
-
-.. code-block:: yaml
-
-    # benchmarks/dashboard/microbenchmark_quantization_config.yml
-    benchmark_mode: "inference"
-    quantization_config_recipe_names:
-      - "int8wo"
-      - "int8dq"
-      - "float8dq-tensor"
-      - "float8dq-row"
-      - "float8wo"
-      - "my_new_quantization"  # Add your method here
-
-    output_dir: "benchmarks/microbenchmarks/results"
-
-    model_params:
-      - name: "small_bf16_linear"
-        matrix_shapes:
-          - name: "small_sweep"
-            min_power: 10
-            max_power: 15
-        high_precision_dtype: "torch.bfloat16"
-        use_torch_compile: true
-        torch_compile_mode: "max-autotune"
-        device: "cuda"
-        model_type: "linear"
-
-4.2 Run CI Benchmarks
-~~~~~~~~~~~~~~~~~~~~~
-
-Use the CI runner to generate results in PyTorch OSS benchmark database format:
-
-.. code-block:: bash
-
-    python benchmarks/dashboard/ci_microbenchmark_runner.py \
-        --config benchmarks/dashboard/microbenchmark_quantization_config.yml \
-        --output benchmark_results.json
-
-4.3 CI Output Format
-~~~~~~~~~~~~~~~~~~~~
-
-The CI runner outputs results in a specific JSON format required by the PyTorch OSS benchmark database:
-
-.. code-block:: json
-
-    [
-      {
-        "benchmark": {
-          "name": "micro-benchmark api",
-          "mode": "inference",
-          "dtype": "int8wo",
-          "extra_info": {
-            "device": "cuda",
-            "arch": "NVIDIA A100-SXM4-80GB"
-          }
-        },
-        "model": {
-          "name": "1024-1024-1024",
-          "type": "micro-benchmark custom layer",
-          "origins": ["torchao"]
-        },
-        "metric": {
-          "name": "speedup(wrt bf16)",
-          "benchmark_values": [1.25],
-          "target_value": 0.0
-        },
-        "runners": [],
-        "dependencies": {}
-      }
-    ]
-
-4.4 Integration with CI Pipeline
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To integrate with your CI pipeline, add the benchmark step to your workflow:
-
-.. code-block:: yaml
-
-    # Example GitHub Actions step
-    - name: Run Microbenchmarks
-      run: |
-        python benchmarks/dashboard/ci_microbenchmark_runner.py \
-          --config benchmarks/dashboard/microbenchmark_quantization_config.yml \
-          --output benchmark_results.json
-
-    - name: Upload Results
-      # Upload benchmark_results.json to your dashboard system
-
-Advanced Usage
---------------
-
-Multiple Model Configurations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-You can benchmark multiple model configurations in a single run:
-
-.. code-block:: yaml
-
-    model_params:
-      - name: "small_models"
-        matrix_shapes:
-          - name: "pow2"
-            min_power: 10
-            max_power: 12
-        model_type: "linear"
-        device: "cuda"
-
-      - name: "transformer_models"
-        matrix_shapes:
-          - name: "llama"
-        model_type: "transformer_block"
-        device: "cuda"
-
-      - name: "cpu_models"
-        matrix_shapes:
-          - name: "custom"
-            shapes: [[512, 512, 512]]
-        model_type: "linear"
-        device: "cpu"
-
-Running Tests
-~~~~~~~~~~~~~
-
-To verify your setup and run the test suite:
-
-.. code-block:: bash
-
-    python -m unittest discover benchmarks/microbenchmarks/test
-
-Interpreting Results
-~~~~~~~~~~~~~~~~~~~~
-
-The benchmark results include:
-
-- **Speedup**: Performance improvement compared to baseline (bfloat16)
-- **Memory Usage**: Peak memory consumption during inference
-- **Latency**: Time taken for inference operations
-- **Profiling Data**: Detailed performance traces (when enabled)
-
-Results are saved in CSV format with columns for:
-
-- Model configuration
-- Quantization method
-- Shape dimensions (M, K, N)
-- Performance metrics
-- Device information
-
-Troubleshooting
----------------
-
-Common Issues
-~~~~~~~~~~~~~
-
-1. **CUDA Out of Memory**: Reduce batch size or matrix dimensions
-2. **Compilation Errors**: Set ``use_torch_compile: false`` for debugging
-3. **Missing Quantization Methods**: Ensure TorchAO is properly installed
-4. **Device Not Available**: Check device availability and drivers
-
-Best Practices
-~~~~~~~~~~~~~~
-
-1. Always include a baseline configuration for comparison
-2. Use ``small_sweep`` for initial testing, ``sweep`` for comprehensive analysis
-3. Enable profiling only when needed (adds overhead)
-4. Test on multiple devices when possible
-5. Use consistent naming conventions for reproducibility
-
-For more detailed information about the framework components, see the README files in the ``benchmarks/microbenchmarks/`` directory.

From b8564ca08680137603eac072f14f2e6107999da5 Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Wed, 9 Jul 2025 12:06:04 -0700
Subject: [PATCH 07/11] update ondex.rst and tutorials

---
 docs/source/benchmarking_overview.md | 215 +++++++++++++++++++++++++++
 docs/source/benchmarking_user_faq.md | 133 +++++++++++++++++
 docs/source/index.rst                |   3 +-
 3 files changed, 350 insertions(+), 1 deletion(-)
 create mode 100644 docs/source/benchmarking_overview.md
 create mode 100644 docs/source/benchmarking_user_faq.md

diff --git a/docs/source/benchmarking_overview.md b/docs/source/benchmarking_overview.md
new file mode 100644
index 0000000000..fc415e297f
--- /dev/null
+++ b/docs/source/benchmarking_overview.md
@@ -0,0 +1,215 @@
+# Benchmarking Overview
+
+This tutorial will guide you through using the TorchAO benchmarking framework. The tutorial contains integrating new APIs with the framework and dashboard.
+
+1. [Add an API to benchmarking recipes](#add-an-api-to-benchmarking-recipes)
+2. [Add a model architecture for benchmarking recipes](#add-a-model-to-benchmarking-recipes)
+3. [Add an HF model to benchmarking recipes](#add-an-hf-model-to-benchmarking-recipes)
+4. [Add an API to micro-benchmarking CI dashboard](#add-an-api-to-benchmarking-ci-dashboard)
+
+## Add an API to Benchmarking Recipes
+
+The framework currently supports quantization and sparsity recipes, which can be run using the quantize_() or sparsity_() functions:
+
+To add a new recipe, add the corresponding string configuration to the function `string_to_config()` in `benchmarks/microbenchmarks/utils.py`.
+
+```python
+def string_to_config(
+  quantization: Optional[str], sparsity: Optional[str], **kwargs
+) -> AOBaseConfig:
+
+# ... existing code ...
+
+elif quantization == "my_new_quantization":
+  # If additional information needs to be passed as kwargs, process it here
+  return MyNewQuantizationConfig(**kwargs)
+elif sparsity == "my_new_sparsity":
+  return MyNewSparsityConfig(**kwargs)
+
+# ... rest of existing code ...
+```
+
+Now we can use this recipe throughout the benchmarking framework.
+
+> **Note:** If the `AOBaseConfig` uses input parameters, like bit-width, group-size etc, you can pass them appended to the string config in input. For example, for `GemliteUIntXWeightOnlyConfig` we can pass bit-width and group-size as `gemlitewo-<bit_width>-<group_size>`
+
+## Add a Model to Benchmarking Recipes
+
+To add a new model architecture to the benchmarking system, you need to modify `torchao/testing/model_architectures.py`.
+
+1. To add a new model type, define your model class in `torchao/testing/model_architectures.py`:
+
+```python
+class MyCustomModel(torch.nn.Module):
+    def __init__(self, input_dim, output_dim, dtype=torch.bfloat16):
+        super().__init__()
+        # Define your model architecture
+        self.layer1 = torch.nn.Linear(input_dim, 512, bias=False).to(dtype)
+        self.activation = torch.nn.ReLU()
+        self.layer2 = torch.nn.Linear(512, output_dim, bias=False).to(dtype)
+
+    def forward(self, x):
+        x = self.layer1(x)
+        x = self.activation(x)
+        x = self.layer2(x)
+        return x
+```
+
+2. Update the `create_model_and_input_data` function to handle your new model type:
+
+```python
+def create_model_and_input_data(
+    model_type: str,
+    m: int,
+    k: int,
+    n: int,
+    high_precision_dtype: torch.dtype = torch.bfloat16,
+    device: str = "cuda",
+    activation: str = "relu",
+):
+    # ... existing code ...
+
+    elif model_type == "my_custom_model":
+        model = MyCustomModel(k, n, high_precision_dtype).to(device)
+        input_data = torch.randn(m, k, device=device, dtype=high_precision_dtype)
+
+    # ... rest of existing code ...
+```
+
+### Model Design Considerations
+
+When adding new models:
+
+- **Input/Output Dimensions**: Ensure your model handles the (m, k, n) dimension convention where:
+  - `m`: Batch size or sequence length
+  - `k`: Input feature dimension
+  - `n`: Output feature dimension
+
+- **Data Types**: Support the `high_precision_dtype` parameter (typically `torch.bfloat16`)
+
+- **Device Compatibility**: Ensure your model works on CUDA, CPU, and other target devices
+
+- **Quantization Compatibility**: Design your model to work with TorchAO quantization methods
+
+## Add an HF model to benchmarking recipes
+(Coming soon!!!)
+
+## Add an API to Benchmarking CI Dashboard
+
+To integrate your API with the CI [dashboard](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fao&benchmarkName=micro-benchmark+api):
+
+### 1. Modify Existing CI Configuration
+
+Add your quantization method to the existing CI configuration file at `benchmarks/dashboard/microbenchmark_quantization_config.yml`:
+
+```yaml
+# benchmarks/dashboard/microbenchmark_quantization_config.yml
+benchmark_mode: "inference"
+quantization_config_recipe_names:
+  - "int8wo"
+  - "int8dq"
+  - "float8dq-tensor"
+  - "float8dq-row"
+  - "float8wo"
+  - "my_new_quantization"  # Add your method here
+
+output_dir: "benchmarks/microbenchmarks/results"
+
+model_params:
+  - name: "small_bf16_linear"
+    matrix_shapes:
+      - name: "small_sweep"
+        min_power: 10
+        max_power: 15
+    high_precision_dtype: "torch.bfloat16"
+    use_torch_compile: true
+    torch_compile_mode: "max-autotune"
+    device: "cuda"
+    model_type: "linear"
+```
+
+### 2. Run CI Benchmarks
+
+Use the CI runner to generate results in PyTorch OSS benchmark database format:
+
+```bash
+python benchmarks/dashboard/ci_microbenchmark_runner.py \
+    --config benchmarks/dashboard/microbenchmark_quantization_config.yml \
+    --output benchmark_results.json
+```
+
+### 3. CI Output Format
+
+The CI runner outputs results in a specific JSON format required by the PyTorch OSS benchmark database:
+
+```json
+[
+  {
+    "benchmark": {
+      "name": "micro-benchmark api",
+      "mode": "inference",
+      "dtype": "int8wo",
+      "extra_info": {
+        "device": "cuda",
+        "arch": "NVIDIA A100-SXM4-80GB"
+      }
+    },
+    "model": {
+      "name": "1024-1024-1024",
+      "type": "micro-benchmark custom layer",
+      "origins": ["torchao"]
+    },
+    "metric": {
+      "name": "speedup(wrt bf16)",
+      "benchmark_values": [1.25],
+      "target_value": 0.0
+    },
+    "runners": [],
+    "dependencies": {}
+  }
+]
+```
+
+### 4. Integration with CI Pipeline
+
+To integrate with your CI pipeline, add the benchmark step to your workflow:
+
+```yaml
+# Example GitHub Actions step
+- name: Run Microbenchmarks
+  run: |
+    python benchmarks/dashboard/ci_microbenchmark_runner.py \
+      --config benchmarks/dashboard/microbenchmark_quantization_config.yml \
+      --output benchmark_results.json
+
+- name: Upload Results
+  # Upload benchmark_results.json to your dashboard system
+```
+
+## Troubleshooting
+
+### Running Tests
+
+To verify your setup and run the test suite:
+
+```bash
+python -m unittest discover benchmarks/microbenchmarks/test
+```
+
+### Common Issues
+
+1. **CUDA Out of Memory**: Reduce batch size or matrix dimensions
+2. **Compilation Errors**: Set `use_torch_compile: false` for debugging
+3. **Missing Quantization Methods**: Ensure TorchAO is properly installed
+4. **Device Not Available**: Check device availability and drivers
+
+### Best Practices
+
+1. Use `small_sweep` for basic testing, `custom shapes` for comprehensive or model specific analysis
+2. Enable profiling only when needed (adds overhead)
+3. Test on multiple devices when possible
+4. Use consistent naming conventions for reproducibility
+
+For information on different use-cases for benchmarking, refer to [Benchmarking Use-Case FAQs](benchmarking_user_faq.md)
+
+For more detailed information about the framework components, see the README files in the `benchmarks/microbenchmarks/` directory.
diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md
new file mode 100644
index 0000000000..c862ca6c0a
--- /dev/null
+++ b/docs/source/benchmarking_user_faq.md
@@ -0,0 +1,133 @@
+# Benchmarking Use-Case FAQs
+
+This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues)
+
+## Run the benchmarking on your PR
+
+### 1. Add label to your PR
+To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps:
+
+1. Go to your pull request on GitHub.
+2. On the right sidebar, find the "Labels" section.
+3. Click on the "Labels" dropdown and select "ciflow/benchmark" from the list of available labels.
+
+Adding this label will automatically trigger the benchmarking CI workflow for your pull request.
+
+### 2. Manually trigger benchmarking workflow on your github branch
+To manually trigger the benchmarking workflow for your branch, follow these steps:
+
+1. Navigate to the "Actions" tab in your GitHub repository.
+2. Select the benchmarking workflow from the list of available workflows. For microbenchmarks, it's `Microbenchmarks-Perf-Nightly`.
+3. Click on the "Run workflow" button.
+4. In the dropdown menu, select the branch.
+5. Click the "Run workflow" button to start the benchmarking process.
+
+This will execute the benchmarking workflow on the specified branch, allowing you to evaluate the performance of your changes.
+
+## Benchmark Your API Locally
+
+For local development and testing:
+
+### 1. Quick Start
+
+Create a minimal configuration for local testing:
+
+```yaml
+# local_test.yml
+benchmark_mode: "inference"
+quantization_config_recipe_names:
+  - "baseline"
+  - "int8wo"
+  # Add your recipe here
+
+output_dir: "local_results" # Add your output directory here
+
+model_params:
+  # Add your model configurations here
+  - name: "quick_test"
+    matrix_shapes:
+      # Define a custom shape, or use one of the predefined shape generators
+      - name: "custom"
+        shapes: [[1024, 1024, 1024]]
+      - name: "small_sweep"
+    high_precision_dtype: "torch.bfloat16"
+    use_torch_compile: true
+    torch_compile_mode: "max-autotune"
+    device: "cuda"
+    model_type: "linear"
+    enable_profiler: true  # Enable profiling for this model
+    enable_memory_profiler: true  # Enable memory profiling for this model
+```
+
+> **Note:**
+> - For a list of latest supported config recipes for quantization or sparsity, please refer to `benchmarks/microbenchmarks/README.md`.
+> - For a list of all model types, please refer to `torchao/testing/model_architectures.py`.
+
+### 2. Run Local Benchmark
+
+```bash
+python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml
+```
+
+### 3. Analysing the Output
+
+The output generated after running the benchmarking script, is the form of a csv. It'll contain some of the following:
+ - time for inference for running baseline model and quantized model
+ - speedup in inference time in quantized model
+ - compile or eager mode
+ - if enabled, memory snapshot and gpu chrome trace
+
+
+## Advanced Usage
+
+### Multiple Model Configurations
+
+You can benchmark multiple model configurations in a single run:
+
+```yaml
+model_params:
+  - name: "small_models"
+    matrix_shapes:
+      - name: "pow2"
+        min_power: 10
+        max_power: 12
+    model_type: "linear"
+    device: "cuda"
+
+  - name: "transformer_models"
+    matrix_shapes:
+      - name: "llama"
+    model_type: "transformer_block"
+    device: "cuda"
+
+  - name: "cpu_models"
+    matrix_shapes:
+      - name: "custom"
+        shapes: [[512, 512, 512]]
+    model_type: "linear"
+    device: "cpu"
+```
+
+### Interpreting Results
+
+The benchmark results include:
+
+- **Speedup**: Performance improvement compared to baseline (bfloat16)
+- **Memory Usage**: Peak memory consumption during inference
+- **Latency**: Time taken for inference operations
+- **Profiling Data**: Detailed performance traces (when enabled)
+
+Results are saved in CSV format with columns for:
+
+- Model configuration
+- Quantization method
+- Shape dimensions (M, K, N)
+- Performance metrics
+- Memory metrics
+- Device information
+
+### Best Practices
+
+1. Use `small_sweep` for initial testing, `sweep` for comprehensive analysis
+2. Enable profiling only when needed (adds overhead)
+3. Test on multiple devices when possible
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 60355f761b..9c88143718 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -21,7 +21,8 @@ for an overall introduction to the library and recent highlight and updates.
    quantization
    sparsity
    contributor_guide
-   microbenchmarking
+   benchmarking_overview
+   benchmarking_user_faq
 
 .. toctree::
    :glob:

From 98eef307f8e0c33bbe7ab1627432f42767b59afe Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Wed, 9 Jul 2025 12:15:21 -0700
Subject: [PATCH 08/11] fix formatting

---
 docs/source/benchmarking_user_faq.md | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md
index c862ca6c0a..a0b2cd7486 100644
--- a/docs/source/benchmarking_user_faq.md
+++ b/docs/source/benchmarking_user_faq.md
@@ -2,7 +2,13 @@
 
 This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues)
 
-## Run the benchmarking on your PR
+## Table of Contents
+- [Run the performance benchmarking on your PR](#run-the-performance-benchmarking-on-your-pr)
+- [Benchmark Your API Locally](#benchmark-your-api-locally)
+- [Generate evaluation metrics for your quantized model](#generate-evaluation-metrics-for-your-quantized-model)
+- [Advanced Usage](#advanced-usage)
+
+## Run the performance benchmarking on your PR
 
 ### 1. Add label to your PR
 To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps:
@@ -78,6 +84,9 @@ The output generated after running the benchmarking script, is the form of a csv
  - if enabled, memory snapshot and gpu chrome trace
 
 
+## Generate evaluation metrics for your quantized model
+(Coming soon!!!)
+
 ## Advanced Usage
 
 ### Multiple Model Configurations

From 8c05b7f0e7ee69fcc2785d6ac39f3e80950ff603 Mon Sep 17 00:00:00 2001
From: Apurva Jain <appy@meta.com>
Date: Wed, 9 Jul 2025 15:54:27 -0700
Subject: [PATCH 09/11] Remove second tutorial

---
 docs/source/benchmarking_user_faq.md | 139 +--------------------------
 1 file changed, 1 insertion(+), 138 deletions(-)

diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md
index a0b2cd7486..3920bf257d 100644
--- a/docs/source/benchmarking_user_faq.md
+++ b/docs/source/benchmarking_user_faq.md
@@ -2,141 +2,4 @@
 
 This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues)
 
-## Table of Contents
-- [Run the performance benchmarking on your PR](#run-the-performance-benchmarking-on-your-pr)
-- [Benchmark Your API Locally](#benchmark-your-api-locally)
-- [Generate evaluation metrics for your quantized model](#generate-evaluation-metrics-for-your-quantized-model)
-- [Advanced Usage](#advanced-usage)
-
-## Run the performance benchmarking on your PR
-
-### 1. Add label to your PR
-To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps:
-
-1. Go to your pull request on GitHub.
-2. On the right sidebar, find the "Labels" section.
-3. Click on the "Labels" dropdown and select "ciflow/benchmark" from the list of available labels.
-
-Adding this label will automatically trigger the benchmarking CI workflow for your pull request.
-
-### 2. Manually trigger benchmarking workflow on your github branch
-To manually trigger the benchmarking workflow for your branch, follow these steps:
-
-1. Navigate to the "Actions" tab in your GitHub repository.
-2. Select the benchmarking workflow from the list of available workflows. For microbenchmarks, it's `Microbenchmarks-Perf-Nightly`.
-3. Click on the "Run workflow" button.
-4. In the dropdown menu, select the branch.
-5. Click the "Run workflow" button to start the benchmarking process.
-
-This will execute the benchmarking workflow on the specified branch, allowing you to evaluate the performance of your changes.
-
-## Benchmark Your API Locally
-
-For local development and testing:
-
-### 1. Quick Start
-
-Create a minimal configuration for local testing:
-
-```yaml
-# local_test.yml
-benchmark_mode: "inference"
-quantization_config_recipe_names:
-  - "baseline"
-  - "int8wo"
-  # Add your recipe here
-
-output_dir: "local_results" # Add your output directory here
-
-model_params:
-  # Add your model configurations here
-  - name: "quick_test"
-    matrix_shapes:
-      # Define a custom shape, or use one of the predefined shape generators
-      - name: "custom"
-        shapes: [[1024, 1024, 1024]]
-      - name: "small_sweep"
-    high_precision_dtype: "torch.bfloat16"
-    use_torch_compile: true
-    torch_compile_mode: "max-autotune"
-    device: "cuda"
-    model_type: "linear"
-    enable_profiler: true  # Enable profiling for this model
-    enable_memory_profiler: true  # Enable memory profiling for this model
-```
-
-> **Note:**
-> - For a list of latest supported config recipes for quantization or sparsity, please refer to `benchmarks/microbenchmarks/README.md`.
-> - For a list of all model types, please refer to `torchao/testing/model_architectures.py`.
-
-### 2. Run Local Benchmark
-
-```bash
-python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml
-```
-
-### 3. Analysing the Output
-
-The output generated after running the benchmarking script, is the form of a csv. It'll contain some of the following:
- - time for inference for running baseline model and quantized model
- - speedup in inference time in quantized model
- - compile or eager mode
- - if enabled, memory snapshot and gpu chrome trace
-
-
-## Generate evaluation metrics for your quantized model
-(Coming soon!!!)
-
-## Advanced Usage
-
-### Multiple Model Configurations
-
-You can benchmark multiple model configurations in a single run:
-
-```yaml
-model_params:
-  - name: "small_models"
-    matrix_shapes:
-      - name: "pow2"
-        min_power: 10
-        max_power: 12
-    model_type: "linear"
-    device: "cuda"
-
-  - name: "transformer_models"
-    matrix_shapes:
-      - name: "llama"
-    model_type: "transformer_block"
-    device: "cuda"
-
-  - name: "cpu_models"
-    matrix_shapes:
-      - name: "custom"
-        shapes: [[512, 512, 512]]
-    model_type: "linear"
-    device: "cpu"
-```
-
-### Interpreting Results
-
-The benchmark results include:
-
-- **Speedup**: Performance improvement compared to baseline (bfloat16)
-- **Memory Usage**: Peak memory consumption during inference
-- **Latency**: Time taken for inference operations
-- **Profiling Data**: Detailed performance traces (when enabled)
-
-Results are saved in CSV format with columns for:
-
-- Model configuration
-- Quantization method
-- Shape dimensions (M, K, N)
-- Performance metrics
-- Memory metrics
-- Device information
-
-### Best Practices
-
-1. Use `small_sweep` for initial testing, `sweep` for comprehensive analysis
-2. Enable profiling only when needed (adds overhead)
-3. Test on multiple devices when possible
+[Coming Soon !!!]

From 2cd3aae3248317e5de7bde88c9f9026b6378d89e Mon Sep 17 00:00:00 2001
From: Apurva Jain <appy@meta.com>
Date: Wed, 9 Jul 2025 16:06:17 -0700
Subject: [PATCH 10/11] End user benchmarking tutorial

---
 docs/source/benchmarking_user_faq.md | 139 ++++++++++++++++++++++++++-
 1 file changed, 138 insertions(+), 1 deletion(-)

diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md
index 3920bf257d..a0b2cd7486 100644
--- a/docs/source/benchmarking_user_faq.md
+++ b/docs/source/benchmarking_user_faq.md
@@ -2,4 +2,141 @@
 
 This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues)
 
-[Coming Soon !!!]
+## Table of Contents
+- [Run the performance benchmarking on your PR](#run-the-performance-benchmarking-on-your-pr)
+- [Benchmark Your API Locally](#benchmark-your-api-locally)
+- [Generate evaluation metrics for your quantized model](#generate-evaluation-metrics-for-your-quantized-model)
+- [Advanced Usage](#advanced-usage)
+
+## Run the performance benchmarking on your PR
+
+### 1. Add label to your PR
+To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps:
+
+1. Go to your pull request on GitHub.
+2. On the right sidebar, find the "Labels" section.
+3. Click on the "Labels" dropdown and select "ciflow/benchmark" from the list of available labels.
+
+Adding this label will automatically trigger the benchmarking CI workflow for your pull request.
+
+### 2. Manually trigger benchmarking workflow on your github branch
+To manually trigger the benchmarking workflow for your branch, follow these steps:
+
+1. Navigate to the "Actions" tab in your GitHub repository.
+2. Select the benchmarking workflow from the list of available workflows. For microbenchmarks, it's `Microbenchmarks-Perf-Nightly`.
+3. Click on the "Run workflow" button.
+4. In the dropdown menu, select the branch.
+5. Click the "Run workflow" button to start the benchmarking process.
+
+This will execute the benchmarking workflow on the specified branch, allowing you to evaluate the performance of your changes.
+
+## Benchmark Your API Locally
+
+For local development and testing:
+
+### 1. Quick Start
+
+Create a minimal configuration for local testing:
+
+```yaml
+# local_test.yml
+benchmark_mode: "inference"
+quantization_config_recipe_names:
+  - "baseline"
+  - "int8wo"
+  # Add your recipe here
+
+output_dir: "local_results" # Add your output directory here
+
+model_params:
+  # Add your model configurations here
+  - name: "quick_test"
+    matrix_shapes:
+      # Define a custom shape, or use one of the predefined shape generators
+      - name: "custom"
+        shapes: [[1024, 1024, 1024]]
+      - name: "small_sweep"
+    high_precision_dtype: "torch.bfloat16"
+    use_torch_compile: true
+    torch_compile_mode: "max-autotune"
+    device: "cuda"
+    model_type: "linear"
+    enable_profiler: true  # Enable profiling for this model
+    enable_memory_profiler: true  # Enable memory profiling for this model
+```
+
+> **Note:**
+> - For a list of latest supported config recipes for quantization or sparsity, please refer to `benchmarks/microbenchmarks/README.md`.
+> - For a list of all model types, please refer to `torchao/testing/model_architectures.py`.
+
+### 2. Run Local Benchmark
+
+```bash
+python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml
+```
+
+### 3. Analysing the Output
+
+The output generated after running the benchmarking script, is the form of a csv. It'll contain some of the following:
+ - time for inference for running baseline model and quantized model
+ - speedup in inference time in quantized model
+ - compile or eager mode
+ - if enabled, memory snapshot and gpu chrome trace
+
+
+## Generate evaluation metrics for your quantized model
+(Coming soon!!!)
+
+## Advanced Usage
+
+### Multiple Model Configurations
+
+You can benchmark multiple model configurations in a single run:
+
+```yaml
+model_params:
+  - name: "small_models"
+    matrix_shapes:
+      - name: "pow2"
+        min_power: 10
+        max_power: 12
+    model_type: "linear"
+    device: "cuda"
+
+  - name: "transformer_models"
+    matrix_shapes:
+      - name: "llama"
+    model_type: "transformer_block"
+    device: "cuda"
+
+  - name: "cpu_models"
+    matrix_shapes:
+      - name: "custom"
+        shapes: [[512, 512, 512]]
+    model_type: "linear"
+    device: "cpu"
+```
+
+### Interpreting Results
+
+The benchmark results include:
+
+- **Speedup**: Performance improvement compared to baseline (bfloat16)
+- **Memory Usage**: Peak memory consumption during inference
+- **Latency**: Time taken for inference operations
+- **Profiling Data**: Detailed performance traces (when enabled)
+
+Results are saved in CSV format with columns for:
+
+- Model configuration
+- Quantization method
+- Shape dimensions (M, K, N)
+- Performance metrics
+- Memory metrics
+- Device information
+
+### Best Practices
+
+1. Use `small_sweep` for initial testing, `sweep` for comprehensive analysis
+2. Enable profiling only when needed (adds overhead)
+3. Test on multiple devices when possible

From 1e6ca62753cf161cb1ad402af430de6a339c3d8f Mon Sep 17 00:00:00 2001
From: Apurva Jain <appy@meta.com>
Date: Wed, 9 Jul 2025 16:22:02 -0700
Subject: [PATCH 11/11] Update CI run instructions

---
 docs/source/benchmarking_user_faq.md | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md
index a0b2cd7486..e4f546de30 100644
--- a/docs/source/benchmarking_user_faq.md
+++ b/docs/source/benchmarking_user_faq.md
@@ -3,32 +3,34 @@
 This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues)
 
 ## Table of Contents
-- [Run the performance benchmarking on your PR](#run-the-performance-benchmarking-on-your-pr)
+- [Run the performance benchmarking in CI](#run-the-performance-benchmarking-in-ci)
 - [Benchmark Your API Locally](#benchmark-your-api-locally)
 - [Generate evaluation metrics for your quantized model](#generate-evaluation-metrics-for-your-quantized-model)
 - [Advanced Usage](#advanced-usage)
 
-## Run the performance benchmarking on your PR
+## Run the performance benchmarking in CI
 
-### 1. Add label to your PR
-To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps:
+### 1. Run the performance benchmarking on every commit in PR
 
-1. Go to your pull request on GitHub.
-2. On the right sidebar, find the "Labels" section.
-3. Click on the "Labels" dropdown and select "ciflow/benchmark" from the list of available labels.
+To trigger the benchmarking CI workflow on your pull request, add the `ciflow/benchmark` label:
 
-Adding this label will automatically trigger the benchmarking CI workflow for your pull request.
+1. Open your pull request on GitHub.
+2. In the right sidebar, locate the "Labels" section.
+3. Click "Labels" and select `ciflow/benchmark`.
+
+This will automatically run the benchmarking workflow for every commit in your PR.
+
+### 2. Run performance benchmarking on the last commit in a GitHub branch
 
-### 2. Manually trigger benchmarking workflow on your github branch
 To manually trigger the benchmarking workflow for your branch, follow these steps:
 
 1. Navigate to the "Actions" tab in your GitHub repository.
 2. Select the benchmarking workflow from the list of available workflows. For microbenchmarks, it's `Microbenchmarks-Perf-Nightly`.
 3. Click on the "Run workflow" button.
-4. In the dropdown menu, select the branch.
+4. In the dropdown menu, select the branch you want to benchmark.
 5. Click the "Run workflow" button to start the benchmarking process.
 
-This will execute the benchmarking workflow on the specified branch, allowing you to evaluate the performance of your changes.
+This will execute the benchmarking workflow on the last commit of the specified branch, allowing you to evaluate the performance of your changes.
 
 ## Benchmark Your API Locally