From e7b20cc6098db5ba31b18a98cd3af5137dae02f0 Mon Sep 17 00:00:00 2001 From: jainapurva Date: Thu, 3 Jul 2025 13:06:49 -0700 Subject: [PATCH 01/11] A dummy tutorial structure --- docs/source/microbenchmarking.rst | 493 ++++++++++++++++++++++++++++++ 1 file changed, 493 insertions(+) create mode 100644 docs/source/microbenchmarking.rst diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst new file mode 100644 index 0000000000..3f5702abdb --- /dev/null +++ b/docs/source/microbenchmarking.rst @@ -0,0 +1,493 @@ +Microbenchmarking Tutorial +========================== + +This tutorial will guide you through using the TorchAO microbenchmarking framework. The tutorial contains different use cases for benchmarking your API and integrating with the dashboard. + +1. Add an API to benchmarking recipes +2. Add a model to benchmarking recipes +3. Benchmark your API locally +4. Add an API to benchmarking CI dashboard + +1. Add an API to Benchmarking Recipes +-------------------------------------- + +To add a new quantization API to the benchmarking system, you need to ensure your quantization method is available in the TorchAO quantization recipes. + +1.1 Supported Quantization Methods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The framework currently supports these quantization types: + +- ``baseline``: No quantization (bfloat16 reference) +- ``int8wo``: 8-bit weight-only quantization +- ``int8dq``: 8-bit dynamic quantization +- ``int4wo-{group_size}``: 4-bit weight-only quantization with specified group size +- ``int4wo-{group_size}-hqq``: 4-bit weight-only quantization with HQQ +- ``float8wo``: Float8 weight-only quantization +- ``float8dq-tensor``: Float8 dynamic quantization (tensor-wise) +- ``float8dq-row``: Float8 dynamic quantization (row-wise) +- ``gemlitewo-{bit_width}-{group_size}``: 4 or 8 bit integer quantization with gemlite triton kernel + +1.2 Adding a New Quantization Recipe +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To add a new quantization method: + +1. **Implement your quantization function** in the appropriate TorchAO module (e.g., ``torchao/quantization/``) + +2. **Add the recipe to the quantization system** by ensuring it can be called with the same interface as existing methods + +3. **Test your quantization method** with a simple benchmark configuration: + +.. code-block:: yaml + + # test_my_quantization.yml + benchmark_mode: "inference" + quantization_config_recipe_names: + - "baseline" + - "my_new_quantization" # Your new method + + output_dir: "test_results" + + model_params: + - name: "test_linear" + matrix_shapes: + - name: "custom" + shapes: [[1024, 1024, 1024]] + high_precision_dtype: "torch.bfloat16" + use_torch_compile: false + device: "cuda" + model_type: "linear" + +4. **Verify the integration** by running: + +.. code-block:: bash + + python -m benchmarks.microbenchmarks.benchmark_runner --config test_my_quantization.yml + +2. Add a Model to Benchmarking Recipes +--------------------------------------- + +To add a new model architecture to the benchmarking system, you need to modify ``torchao/testing/model_architectures.py``. + +2.1 Current Model Types +~~~~~~~~~~~~~~~~~~~~~~~ + +The framework supports these model types: + +- ``linear``: Simple linear layer (``ToyLinearModel``) +- ``ln_linear_``: LayerNorm + Linear + Activation (``LNLinearActivationModel``) + + - ``ln_linear_sigmoid``: LayerNorm + Linear + Sigmoid + - ``ln_linear_relu``: LayerNorm + Linear + ReLU + - ``ln_linear_gelu``: LayerNorm + Linear + GELU + - ``ln_linear_silu``: LayerNorm + Linear + SiLU + - ``ln_linear_leakyrelu``: LayerNorm + Linear + LeakyReLU + - ``ln_linear_relu6``: LayerNorm + Linear + ReLU6 + - ``ln_linear_hardswish``: LayerNorm + Linear + Hardswish + +- ``transformer_block``: Transformer block with self-attention and MLP (``TransformerBlock``) + +2.2 Adding a New Model Architecture +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To add a new model type: + +1. **Define your model class** in ``torchao/testing/model_architectures.py``: + +.. code-block:: python + + class MyCustomModel(torch.nn.Module): + def __init__(self, input_dim, output_dim, dtype=torch.bfloat16): + super().__init__() + # Define your model architecture + self.layer1 = torch.nn.Linear(input_dim, 512, bias=False).to(dtype) + self.activation = torch.nn.ReLU() + self.layer2 = torch.nn.Linear(512, output_dim, bias=False).to(dtype) + + def forward(self, x): + x = self.layer1(x) + x = self.activation(x) + x = self.layer2(x) + return x + +2. **Update the** ``create_model_and_input_data`` **function** to handle your new model type: + +.. code-block:: python + + def create_model_and_input_data( + model_type: str, + m: int, + k: int, + n: int, + high_precision_dtype: torch.dtype = torch.bfloat16, + device: str = "cuda", + activation: str = "relu", + ): + # ... existing code ... + + elif model_type == "my_custom_model": + model = MyCustomModel(k, n, high_precision_dtype).to(device) + input_data = torch.randn(m, k, device=device, dtype=high_precision_dtype) + + # ... rest of existing code ... + +3. **Test your new model** with a benchmark configuration: + +.. code-block:: yaml + + # test_my_model.yml + benchmark_mode: "inference" + quantization_config_recipe_names: + - "baseline" + - "int8wo" + + output_dir: "test_results" + + model_params: + - name: "test_my_custom_model" + matrix_shapes: + - name: "custom" + shapes: [[1024, 1024, 1024]] + high_precision_dtype: "torch.bfloat16" + use_torch_compile: false + device: "cuda" + model_type: "my_custom_model" # Your new model type + +4. **Verify the integration**: + +.. code-block:: bash + + python -m benchmarks.microbenchmarks.benchmark_runner --config test_my_model.yml + +2.3 Model Design Considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When adding new models: + +- **Input/Output Dimensions**: Ensure your model handles the (m, k, n) dimension convention where: + + - ``m``: Batch size or sequence length + - ``k``: Input feature dimension + - ``n``: Output feature dimension + +- **Data Types**: Support the ``high_precision_dtype`` parameter (typically ``torch.bfloat16``) + +- **Device Compatibility**: Ensure your model works on CUDA, CPU, and other target devices + +- **Quantization Compatibility**: Design your model to work with TorchAO quantization methods + +3. Benchmark Your API Locally +------------------------------ + +For local development and testing: + +3.1 Quick Start +~~~~~~~~~~~~~~~ + +Create a minimal configuration for local testing: + +.. code-block:: yaml + + # local_test.yml + benchmark_mode: "inference" + quantization_config_recipe_names: + - "baseline" + - "int8wo" + + output_dir: "local_results" + + model_params: + - name: "quick_test" + matrix_shapes: + - name: "custom" + shapes: [[1024, 1024, 1024]] + high_precision_dtype: "torch.bfloat16" + use_torch_compile: false # Disable for faster iteration + device: "cuda" + model_type: "linear" + +3.2 Run Local Benchmark +~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml + +3.3 Shape Generation Options +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can use different shape generation strategies: + +**Custom Shapes:** + +.. code-block:: yaml + + matrix_shapes: + - name: "custom" + shapes: [ + [1024, 1024, 1024], # [m, k, n] + [2048, 4096, 1024] + ] + +**LLaMa Model Shapes:** + +.. code-block:: yaml + + matrix_shapes: + - name: "llama" # Uses LLaMa 2 70B single-node weight shapes + +**Power of 2 Shapes:** + +.. code-block:: yaml + + matrix_shapes: + - name: "pow2" + min_power: 10 # 2^10 = 1024 + max_power: 12 # 2^12 = 4096 + +**Extended Power of 2 Shapes:** + +.. code-block:: yaml + + matrix_shapes: + - name: "pow2_extended" + min_power: 10 # Generates: 1024, 1536, 2048, 3072, etc. + max_power: 11 + +**Small Sweep (for heatmaps):** + +.. code-block:: yaml + + matrix_shapes: + - name: "small_sweep" + min_power: 10 + max_power: 15 + +**Full Sweep:** + +.. code-block:: yaml + + matrix_shapes: + - name: "sweep" + min_power: 8 + max_power: 9 + +3.4 Enable Profiling for Debugging +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For detailed performance analysis, enable profiling: + +.. code-block:: yaml + + model_params: + - name: "debug_model" + # ... other parameters ... + enable_profiler: true # Enable standard profiling + enable_memory_profiler: true # Enable CUDA memory profiling + +This will generate: + +- Standard PyTorch profiler traces +- CUDA memory snapshots and visualizations +- Memory usage analysis in the ``memory_profiler`` subdirectory + +3.5 Device Options +~~~~~~~~~~~~~~~~~~ + +Test on different devices: + +.. code-block:: yaml + + device: "cuda" # NVIDIA GPU + # device: "xpu" # Intel GPU + # device: "mps" # Apple Silicon GPU + # device: "cpu" # CPU fallback + +3.6 Compilation Options +~~~~~~~~~~~~~~~~~~~~~~ + +Control PyTorch compilation for performance tuning: + +.. code-block:: yaml + + use_torch_compile: true + torch_compile_mode: "max-autotune" # Options: "default", "max-autotune", "false" + +4. Add an API to Benchmarking CI Dashboard +------------------------------------------ + +To integrate your API with the continuous integration dashboard: + +4.1 Modify Existing CI Configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Add your quantization method to the existing CI configuration file at ``benchmarks/dashboard/microbenchmark_quantization_config.yml``: + +.. code-block:: yaml + + # benchmarks/dashboard/microbenchmark_quantization_config.yml + benchmark_mode: "inference" + quantization_config_recipe_names: + - "int8wo" + - "int8dq" + - "float8dq-tensor" + - "float8dq-row" + - "float8wo" + - "my_new_quantization" # Add your method here + + output_dir: "benchmarks/microbenchmarks/results" + + model_params: + - name: "small_bf16_linear" + matrix_shapes: + - name: "small_sweep" + min_power: 10 + max_power: 15 + high_precision_dtype: "torch.bfloat16" + use_torch_compile: true + torch_compile_mode: "max-autotune" + device: "cuda" + model_type: "linear" + +4.2 Run CI Benchmarks +~~~~~~~~~~~~~~~~~~~~~ + +Use the CI runner to generate results in PyTorch OSS benchmark database format: + +.. code-block:: bash + + python benchmarks/dashboard/ci_microbenchmark_runner.py \ + --config benchmarks/dashboard/microbenchmark_quantization_config.yml \ + --output benchmark_results.json + +4.3 CI Output Format +~~~~~~~~~~~~~~~~~~~~ + +The CI runner outputs results in a specific JSON format required by the PyTorch OSS benchmark database: + +.. code-block:: json + + [ + { + "benchmark": { + "name": "micro-benchmark api", + "mode": "inference", + "dtype": "int8wo", + "extra_info": { + "device": "cuda", + "arch": "NVIDIA A100-SXM4-80GB" + } + }, + "model": { + "name": "1024-1024-1024", + "type": "micro-benchmark custom layer", + "origins": ["torchao"] + }, + "metric": { + "name": "speedup(wrt bf16)", + "benchmark_values": [1.25], + "target_value": 0.0 + }, + "runners": [], + "dependencies": {} + } + ] + +4.4 Integration with CI Pipeline +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To integrate with your CI pipeline, add the benchmark step to your workflow: + +.. code-block:: yaml + + # Example GitHub Actions step + - name: Run Microbenchmarks + run: | + python benchmarks/dashboard/ci_microbenchmark_runner.py \ + --config benchmarks/dashboard/microbenchmark_quantization_config.yml \ + --output benchmark_results.json + + - name: Upload Results + # Upload benchmark_results.json to your dashboard system + +Advanced Usage +-------------- + +Multiple Model Configurations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can benchmark multiple model configurations in a single run: + +.. code-block:: yaml + + model_params: + - name: "small_models" + matrix_shapes: + - name: "pow2" + min_power: 10 + max_power: 12 + model_type: "linear" + device: "cuda" + + - name: "transformer_models" + matrix_shapes: + - name: "llama" + model_type: "transformer_block" + device: "cuda" + + - name: "cpu_models" + matrix_shapes: + - name: "custom" + shapes: [[512, 512, 512]] + model_type: "linear" + device: "cpu" + +Running Tests +~~~~~~~~~~~~~ + +To verify your setup and run the test suite: + +.. code-block:: bash + + python -m unittest discover benchmarks/microbenchmarks/test + +Interpreting Results +~~~~~~~~~~~~~~~~~~~~ + +The benchmark results include: + +- **Speedup**: Performance improvement compared to baseline (bfloat16) +- **Memory Usage**: Peak memory consumption during inference +- **Latency**: Time taken for inference operations +- **Profiling Data**: Detailed performance traces (when enabled) + +Results are saved in CSV format with columns for: + +- Model configuration +- Quantization method +- Shape dimensions (M, K, N) +- Performance metrics +- Device information + +Troubleshooting +--------------- + +Common Issues +~~~~~~~~~~~~~ + +1. **CUDA Out of Memory**: Reduce batch size or matrix dimensions +2. **Compilation Errors**: Set ``use_torch_compile: false`` for debugging +3. **Missing Quantization Methods**: Ensure TorchAO is properly installed +4. **Device Not Available**: Check device availability and drivers + +Best Practices +~~~~~~~~~~~~~~ + +1. Always include a baseline configuration for comparison +2. Use ``small_sweep`` for initial testing, ``sweep`` for comprehensive analysis +3. Enable profiling only when needed (adds overhead) +4. Test on multiple devices when possible +5. Use consistent naming conventions for reproducibility + +For more detailed information about the framework components, see the README files in the ``benchmarks/microbenchmarks/`` directory. From a6a2ae0f959aabd30889b89f0900a9bf536a2aee Mon Sep 17 00:00:00 2001 From: jainapurva Date: Mon, 7 Jul 2025 09:56:24 -0700 Subject: [PATCH 02/11] update tutorial --- docs/source/microbenchmarking.rst | 230 ++++-------------------------- 1 file changed, 29 insertions(+), 201 deletions(-) diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst index 3f5702abdb..90c26ecd85 100644 --- a/docs/source/microbenchmarking.rst +++ b/docs/source/microbenchmarking.rst @@ -11,89 +11,39 @@ This tutorial will guide you through using the TorchAO microbenchmarking framewo 1. Add an API to Benchmarking Recipes -------------------------------------- -To add a new quantization API to the benchmarking system, you need to ensure your quantization method is available in the TorchAO quantization recipes. +The framework currently supports quantization and sparsity recipes, which can be run using the quantize_() or sparsity_() functions: -1.1 Supported Quantization Methods -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +To add a new recipe, add the corresponding string configuration to the function ``string_to_config()`` in ``benchmarks/microbenchmarks/utils.py``. -The framework currently supports these quantization types: - -- ``baseline``: No quantization (bfloat16 reference) -- ``int8wo``: 8-bit weight-only quantization -- ``int8dq``: 8-bit dynamic quantization -- ``int4wo-{group_size}``: 4-bit weight-only quantization with specified group size -- ``int4wo-{group_size}-hqq``: 4-bit weight-only quantization with HQQ -- ``float8wo``: Float8 weight-only quantization -- ``float8dq-tensor``: Float8 dynamic quantization (tensor-wise) -- ``float8dq-row``: Float8 dynamic quantization (row-wise) -- ``gemlitewo-{bit_width}-{group_size}``: 4 or 8 bit integer quantization with gemlite triton kernel - -1.2 Adding a New Quantization Recipe -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To add a new quantization method: - -1. **Implement your quantization function** in the appropriate TorchAO module (e.g., ``torchao/quantization/``) - -2. **Add the recipe to the quantization system** by ensuring it can be called with the same interface as existing methods - -3. **Test your quantization method** with a simple benchmark configuration: +.. code-block:: python -.. code-block:: yaml + def string_to_config( + quantization: Optional[str], sparsity: Optional[str], **kwargs + ) -> AOBaseConfig: - # test_my_quantization.yml - benchmark_mode: "inference" - quantization_config_recipe_names: - - "baseline" - - "my_new_quantization" # Your new method + # ... existing code ... - output_dir: "test_results" + elif quantization == "my_new_quantization": + # If additional information needs to be passed as kwargs, process it here + return MyNewQuantizationConfig(**kwargs) + elif sparsity == "my_new_sparsity": + return MyNewSparsityConfig(**kwargs) - model_params: - - name: "test_linear" - matrix_shapes: - - name: "custom" - shapes: [[1024, 1024, 1024]] - high_precision_dtype: "torch.bfloat16" - use_torch_compile: false - device: "cuda" - model_type: "linear" + # ... rest of existing code ... -4. **Verify the integration** by running: +Now we can use this recipe throughout the benchmarking framework. -.. code-block:: bash +.. note:: - python -m benchmarks.microbenchmarks.benchmark_runner --config test_my_quantization.yml + If the ``AOBaseConfig`` uses input parameters, like bit-width, group-size etc, you can pass them appended to the string config in input + For example, for ``GemliteUIntXWeightOnlyConfig`` we can pass it-width and group-size as ``gemlitewo--`` 2. Add a Model to Benchmarking Recipes --------------------------------------- To add a new model architecture to the benchmarking system, you need to modify ``torchao/testing/model_architectures.py``. -2.1 Current Model Types -~~~~~~~~~~~~~~~~~~~~~~~ - -The framework supports these model types: - -- ``linear``: Simple linear layer (``ToyLinearModel``) -- ``ln_linear_``: LayerNorm + Linear + Activation (``LNLinearActivationModel``) - - - ``ln_linear_sigmoid``: LayerNorm + Linear + Sigmoid - - ``ln_linear_relu``: LayerNorm + Linear + ReLU - - ``ln_linear_gelu``: LayerNorm + Linear + GELU - - ``ln_linear_silu``: LayerNorm + Linear + SiLU - - ``ln_linear_leakyrelu``: LayerNorm + Linear + LeakyReLU - - ``ln_linear_relu6``: LayerNorm + Linear + ReLU6 - - ``ln_linear_hardswish``: LayerNorm + Linear + Hardswish - -- ``transformer_block``: Transformer block with self-attention and MLP (``TransformerBlock``) - -2.2 Adding a New Model Architecture -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To add a new model type: - -1. **Define your model class** in ``torchao/testing/model_architectures.py``: +1. To add a new model type, define your model class in ``torchao/testing/model_architectures.py``: .. code-block:: python @@ -111,7 +61,7 @@ To add a new model type: x = self.layer2(x) return x -2. **Update the** ``create_model_and_input_data`` **function** to handle your new model type: +2. Update the ``create_model_and_input_data`` function to handle your new model type: .. code-block:: python @@ -132,36 +82,7 @@ To add a new model type: # ... rest of existing code ... -3. **Test your new model** with a benchmark configuration: - -.. code-block:: yaml - - # test_my_model.yml - benchmark_mode: "inference" - quantization_config_recipe_names: - - "baseline" - - "int8wo" - - output_dir: "test_results" - - model_params: - - name: "test_my_custom_model" - matrix_shapes: - - name: "custom" - shapes: [[1024, 1024, 1024]] - high_precision_dtype: "torch.bfloat16" - use_torch_compile: false - device: "cuda" - model_type: "my_custom_model" # Your new model type - -4. **Verify the integration**: - -.. code-block:: bash - - python -m benchmarks.microbenchmarks.benchmark_runner --config test_my_model.yml - -2.3 Model Design Considerations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +**Model Design Considerations** When adding new models: @@ -194,19 +115,26 @@ Create a minimal configuration for local testing: quantization_config_recipe_names: - "baseline" - "int8wo" + # Add your recipe here - output_dir: "local_results" + output_dir: "local_results" # Add your output directory here model_params: + # Add your model configurations here - name: "quick_test" matrix_shapes: + # Define a custom shape, or use one of the predefined shape generators - name: "custom" shapes: [[1024, 1024, 1024]] high_precision_dtype: "torch.bfloat16" - use_torch_compile: false # Disable for faster iteration + use_torch_compile: true device: "cuda" model_type: "linear" +.. note:: + - For a list of latest supported config recipes for quantization or sparsity, please refer to ``benchmarks/microbenchmarks/README.md``. + - For a list of all model types, please refer to ``torchao/testing/model_architectures.py``. + 3.2 Run Local Benchmark ~~~~~~~~~~~~~~~~~~~~~~~ @@ -214,106 +142,6 @@ Create a minimal configuration for local testing: python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml -3.3 Shape Generation Options -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can use different shape generation strategies: - -**Custom Shapes:** - -.. code-block:: yaml - - matrix_shapes: - - name: "custom" - shapes: [ - [1024, 1024, 1024], # [m, k, n] - [2048, 4096, 1024] - ] - -**LLaMa Model Shapes:** - -.. code-block:: yaml - - matrix_shapes: - - name: "llama" # Uses LLaMa 2 70B single-node weight shapes - -**Power of 2 Shapes:** - -.. code-block:: yaml - - matrix_shapes: - - name: "pow2" - min_power: 10 # 2^10 = 1024 - max_power: 12 # 2^12 = 4096 - -**Extended Power of 2 Shapes:** - -.. code-block:: yaml - - matrix_shapes: - - name: "pow2_extended" - min_power: 10 # Generates: 1024, 1536, 2048, 3072, etc. - max_power: 11 - -**Small Sweep (for heatmaps):** - -.. code-block:: yaml - - matrix_shapes: - - name: "small_sweep" - min_power: 10 - max_power: 15 - -**Full Sweep:** - -.. code-block:: yaml - - matrix_shapes: - - name: "sweep" - min_power: 8 - max_power: 9 - -3.4 Enable Profiling for Debugging -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -For detailed performance analysis, enable profiling: - -.. code-block:: yaml - - model_params: - - name: "debug_model" - # ... other parameters ... - enable_profiler: true # Enable standard profiling - enable_memory_profiler: true # Enable CUDA memory profiling - -This will generate: - -- Standard PyTorch profiler traces -- CUDA memory snapshots and visualizations -- Memory usage analysis in the ``memory_profiler`` subdirectory - -3.5 Device Options -~~~~~~~~~~~~~~~~~~ - -Test on different devices: - -.. code-block:: yaml - - device: "cuda" # NVIDIA GPU - # device: "xpu" # Intel GPU - # device: "mps" # Apple Silicon GPU - # device: "cpu" # CPU fallback - -3.6 Compilation Options -~~~~~~~~~~~~~~~~~~~~~~ - -Control PyTorch compilation for performance tuning: - -.. code-block:: yaml - - use_torch_compile: true - torch_compile_mode: "max-autotune" # Options: "default", "max-autotune", "false" - 4. Add an API to Benchmarking CI Dashboard ------------------------------------------ From cde732cae537f93f0f77dc99c8679184670d3d22 Mon Sep 17 00:00:00 2001 From: jainapurva Date: Mon, 7 Jul 2025 10:07:11 -0700 Subject: [PATCH 03/11] update tutorial --- docs/source/microbenchmarking.rst | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst index 90c26ecd85..9fcf48dda5 100644 --- a/docs/source/microbenchmarking.rst +++ b/docs/source/microbenchmarking.rst @@ -142,10 +142,19 @@ Create a minimal configuration for local testing: python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml +3.3 Analysing the Output +~~~~~~~~~~~~~~~~~~~~~~~~ + +The output generated after running the benchmarking script, is the form of a csv. It'll contain the following: + - time for inference for running baseline model and quantized model + - speedup in inference time in quantized model + - compile or eager mode + - if enabled, memory snapshot and gpu chrome trace + 4. Add an API to Benchmarking CI Dashboard ------------------------------------------ -To integrate your API with the continuous integration dashboard: +To integrate your API with the CI `dashboard `_: 4.1 Modify Existing CI Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From 9f994f1b46fd740bbd103059440a0406fec6913b Mon Sep 17 00:00:00 2001 From: jainapurva Date: Mon, 7 Jul 2025 14:49:03 -0700 Subject: [PATCH 04/11] update tutorial --- docs/source/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/index.rst b/docs/source/index.rst index aac72590fd..60355f761b 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -21,6 +21,7 @@ for an overall introduction to the library and recent highlight and updates. quantization sparsity contributor_guide + microbenchmarking .. toctree:: :glob: From ed6f659806e009d8ad20ebd1f7c94216debbe90d Mon Sep 17 00:00:00 2001 From: jainapurva Date: Mon, 7 Jul 2025 15:20:35 -0700 Subject: [PATCH 05/11] updates --- docs/source/microbenchmarking.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst index 9fcf48dda5..7849a345e9 100644 --- a/docs/source/microbenchmarking.rst +++ b/docs/source/microbenchmarking.rst @@ -3,10 +3,10 @@ Microbenchmarking Tutorial This tutorial will guide you through using the TorchAO microbenchmarking framework. The tutorial contains different use cases for benchmarking your API and integrating with the dashboard. -1. Add an API to benchmarking recipes -2. Add a model to benchmarking recipes -3. Benchmark your API locally -4. Add an API to benchmarking CI dashboard +1. :ref:`Add an API to benchmarking recipes` +2. :ref:`Add a model to benchmarking recipes` +3. :ref:`Benchmark your API locally` +4. :ref:`Add an API to benchmarking CI dashboard` 1. Add an API to Benchmarking Recipes -------------------------------------- From 41b99865ece08e95e49927dfcd5754c50079b1c6 Mon Sep 17 00:00:00 2001 From: jainapurva Date: Wed, 9 Jul 2025 11:32:15 -0700 Subject: [PATCH 06/11] update tutorial to .md --- docs/source/microbenchmarking.rst | 330 ------------------------------ 1 file changed, 330 deletions(-) delete mode 100644 docs/source/microbenchmarking.rst diff --git a/docs/source/microbenchmarking.rst b/docs/source/microbenchmarking.rst deleted file mode 100644 index 7849a345e9..0000000000 --- a/docs/source/microbenchmarking.rst +++ /dev/null @@ -1,330 +0,0 @@ -Microbenchmarking Tutorial -========================== - -This tutorial will guide you through using the TorchAO microbenchmarking framework. The tutorial contains different use cases for benchmarking your API and integrating with the dashboard. - -1. :ref:`Add an API to benchmarking recipes` -2. :ref:`Add a model to benchmarking recipes` -3. :ref:`Benchmark your API locally` -4. :ref:`Add an API to benchmarking CI dashboard` - -1. Add an API to Benchmarking Recipes --------------------------------------- - -The framework currently supports quantization and sparsity recipes, which can be run using the quantize_() or sparsity_() functions: - -To add a new recipe, add the corresponding string configuration to the function ``string_to_config()`` in ``benchmarks/microbenchmarks/utils.py``. - -.. code-block:: python - - def string_to_config( - quantization: Optional[str], sparsity: Optional[str], **kwargs - ) -> AOBaseConfig: - - # ... existing code ... - - elif quantization == "my_new_quantization": - # If additional information needs to be passed as kwargs, process it here - return MyNewQuantizationConfig(**kwargs) - elif sparsity == "my_new_sparsity": - return MyNewSparsityConfig(**kwargs) - - # ... rest of existing code ... - -Now we can use this recipe throughout the benchmarking framework. - -.. note:: - - If the ``AOBaseConfig`` uses input parameters, like bit-width, group-size etc, you can pass them appended to the string config in input - For example, for ``GemliteUIntXWeightOnlyConfig`` we can pass it-width and group-size as ``gemlitewo--`` - -2. Add a Model to Benchmarking Recipes ---------------------------------------- - -To add a new model architecture to the benchmarking system, you need to modify ``torchao/testing/model_architectures.py``. - -1. To add a new model type, define your model class in ``torchao/testing/model_architectures.py``: - -.. code-block:: python - - class MyCustomModel(torch.nn.Module): - def __init__(self, input_dim, output_dim, dtype=torch.bfloat16): - super().__init__() - # Define your model architecture - self.layer1 = torch.nn.Linear(input_dim, 512, bias=False).to(dtype) - self.activation = torch.nn.ReLU() - self.layer2 = torch.nn.Linear(512, output_dim, bias=False).to(dtype) - - def forward(self, x): - x = self.layer1(x) - x = self.activation(x) - x = self.layer2(x) - return x - -2. Update the ``create_model_and_input_data`` function to handle your new model type: - -.. code-block:: python - - def create_model_and_input_data( - model_type: str, - m: int, - k: int, - n: int, - high_precision_dtype: torch.dtype = torch.bfloat16, - device: str = "cuda", - activation: str = "relu", - ): - # ... existing code ... - - elif model_type == "my_custom_model": - model = MyCustomModel(k, n, high_precision_dtype).to(device) - input_data = torch.randn(m, k, device=device, dtype=high_precision_dtype) - - # ... rest of existing code ... - -**Model Design Considerations** - -When adding new models: - -- **Input/Output Dimensions**: Ensure your model handles the (m, k, n) dimension convention where: - - - ``m``: Batch size or sequence length - - ``k``: Input feature dimension - - ``n``: Output feature dimension - -- **Data Types**: Support the ``high_precision_dtype`` parameter (typically ``torch.bfloat16``) - -- **Device Compatibility**: Ensure your model works on CUDA, CPU, and other target devices - -- **Quantization Compatibility**: Design your model to work with TorchAO quantization methods - -3. Benchmark Your API Locally ------------------------------- - -For local development and testing: - -3.1 Quick Start -~~~~~~~~~~~~~~~ - -Create a minimal configuration for local testing: - -.. code-block:: yaml - - # local_test.yml - benchmark_mode: "inference" - quantization_config_recipe_names: - - "baseline" - - "int8wo" - # Add your recipe here - - output_dir: "local_results" # Add your output directory here - - model_params: - # Add your model configurations here - - name: "quick_test" - matrix_shapes: - # Define a custom shape, or use one of the predefined shape generators - - name: "custom" - shapes: [[1024, 1024, 1024]] - high_precision_dtype: "torch.bfloat16" - use_torch_compile: true - device: "cuda" - model_type: "linear" - -.. note:: - - For a list of latest supported config recipes for quantization or sparsity, please refer to ``benchmarks/microbenchmarks/README.md``. - - For a list of all model types, please refer to ``torchao/testing/model_architectures.py``. - -3.2 Run Local Benchmark -~~~~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - - python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml - -3.3 Analysing the Output -~~~~~~~~~~~~~~~~~~~~~~~~ - -The output generated after running the benchmarking script, is the form of a csv. It'll contain the following: - - time for inference for running baseline model and quantized model - - speedup in inference time in quantized model - - compile or eager mode - - if enabled, memory snapshot and gpu chrome trace - -4. Add an API to Benchmarking CI Dashboard ------------------------------------------- - -To integrate your API with the CI `dashboard `_: - -4.1 Modify Existing CI Configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Add your quantization method to the existing CI configuration file at ``benchmarks/dashboard/microbenchmark_quantization_config.yml``: - -.. code-block:: yaml - - # benchmarks/dashboard/microbenchmark_quantization_config.yml - benchmark_mode: "inference" - quantization_config_recipe_names: - - "int8wo" - - "int8dq" - - "float8dq-tensor" - - "float8dq-row" - - "float8wo" - - "my_new_quantization" # Add your method here - - output_dir: "benchmarks/microbenchmarks/results" - - model_params: - - name: "small_bf16_linear" - matrix_shapes: - - name: "small_sweep" - min_power: 10 - max_power: 15 - high_precision_dtype: "torch.bfloat16" - use_torch_compile: true - torch_compile_mode: "max-autotune" - device: "cuda" - model_type: "linear" - -4.2 Run CI Benchmarks -~~~~~~~~~~~~~~~~~~~~~ - -Use the CI runner to generate results in PyTorch OSS benchmark database format: - -.. code-block:: bash - - python benchmarks/dashboard/ci_microbenchmark_runner.py \ - --config benchmarks/dashboard/microbenchmark_quantization_config.yml \ - --output benchmark_results.json - -4.3 CI Output Format -~~~~~~~~~~~~~~~~~~~~ - -The CI runner outputs results in a specific JSON format required by the PyTorch OSS benchmark database: - -.. code-block:: json - - [ - { - "benchmark": { - "name": "micro-benchmark api", - "mode": "inference", - "dtype": "int8wo", - "extra_info": { - "device": "cuda", - "arch": "NVIDIA A100-SXM4-80GB" - } - }, - "model": { - "name": "1024-1024-1024", - "type": "micro-benchmark custom layer", - "origins": ["torchao"] - }, - "metric": { - "name": "speedup(wrt bf16)", - "benchmark_values": [1.25], - "target_value": 0.0 - }, - "runners": [], - "dependencies": {} - } - ] - -4.4 Integration with CI Pipeline -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To integrate with your CI pipeline, add the benchmark step to your workflow: - -.. code-block:: yaml - - # Example GitHub Actions step - - name: Run Microbenchmarks - run: | - python benchmarks/dashboard/ci_microbenchmark_runner.py \ - --config benchmarks/dashboard/microbenchmark_quantization_config.yml \ - --output benchmark_results.json - - - name: Upload Results - # Upload benchmark_results.json to your dashboard system - -Advanced Usage --------------- - -Multiple Model Configurations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can benchmark multiple model configurations in a single run: - -.. code-block:: yaml - - model_params: - - name: "small_models" - matrix_shapes: - - name: "pow2" - min_power: 10 - max_power: 12 - model_type: "linear" - device: "cuda" - - - name: "transformer_models" - matrix_shapes: - - name: "llama" - model_type: "transformer_block" - device: "cuda" - - - name: "cpu_models" - matrix_shapes: - - name: "custom" - shapes: [[512, 512, 512]] - model_type: "linear" - device: "cpu" - -Running Tests -~~~~~~~~~~~~~ - -To verify your setup and run the test suite: - -.. code-block:: bash - - python -m unittest discover benchmarks/microbenchmarks/test - -Interpreting Results -~~~~~~~~~~~~~~~~~~~~ - -The benchmark results include: - -- **Speedup**: Performance improvement compared to baseline (bfloat16) -- **Memory Usage**: Peak memory consumption during inference -- **Latency**: Time taken for inference operations -- **Profiling Data**: Detailed performance traces (when enabled) - -Results are saved in CSV format with columns for: - -- Model configuration -- Quantization method -- Shape dimensions (M, K, N) -- Performance metrics -- Device information - -Troubleshooting ---------------- - -Common Issues -~~~~~~~~~~~~~ - -1. **CUDA Out of Memory**: Reduce batch size or matrix dimensions -2. **Compilation Errors**: Set ``use_torch_compile: false`` for debugging -3. **Missing Quantization Methods**: Ensure TorchAO is properly installed -4. **Device Not Available**: Check device availability and drivers - -Best Practices -~~~~~~~~~~~~~~ - -1. Always include a baseline configuration for comparison -2. Use ``small_sweep`` for initial testing, ``sweep`` for comprehensive analysis -3. Enable profiling only when needed (adds overhead) -4. Test on multiple devices when possible -5. Use consistent naming conventions for reproducibility - -For more detailed information about the framework components, see the README files in the ``benchmarks/microbenchmarks/`` directory. From b8564ca08680137603eac072f14f2e6107999da5 Mon Sep 17 00:00:00 2001 From: jainapurva Date: Wed, 9 Jul 2025 12:06:04 -0700 Subject: [PATCH 07/11] update ondex.rst and tutorials --- docs/source/benchmarking_overview.md | 215 +++++++++++++++++++++++++++ docs/source/benchmarking_user_faq.md | 133 +++++++++++++++++ docs/source/index.rst | 3 +- 3 files changed, 350 insertions(+), 1 deletion(-) create mode 100644 docs/source/benchmarking_overview.md create mode 100644 docs/source/benchmarking_user_faq.md diff --git a/docs/source/benchmarking_overview.md b/docs/source/benchmarking_overview.md new file mode 100644 index 0000000000..fc415e297f --- /dev/null +++ b/docs/source/benchmarking_overview.md @@ -0,0 +1,215 @@ +# Benchmarking Overview + +This tutorial will guide you through using the TorchAO benchmarking framework. The tutorial contains integrating new APIs with the framework and dashboard. + +1. [Add an API to benchmarking recipes](#add-an-api-to-benchmarking-recipes) +2. [Add a model architecture for benchmarking recipes](#add-a-model-to-benchmarking-recipes) +3. [Add an HF model to benchmarking recipes](#add-an-hf-model-to-benchmarking-recipes) +4. [Add an API to micro-benchmarking CI dashboard](#add-an-api-to-benchmarking-ci-dashboard) + +## Add an API to Benchmarking Recipes + +The framework currently supports quantization and sparsity recipes, which can be run using the quantize_() or sparsity_() functions: + +To add a new recipe, add the corresponding string configuration to the function `string_to_config()` in `benchmarks/microbenchmarks/utils.py`. + +```python +def string_to_config( + quantization: Optional[str], sparsity: Optional[str], **kwargs +) -> AOBaseConfig: + +# ... existing code ... + +elif quantization == "my_new_quantization": + # If additional information needs to be passed as kwargs, process it here + return MyNewQuantizationConfig(**kwargs) +elif sparsity == "my_new_sparsity": + return MyNewSparsityConfig(**kwargs) + +# ... rest of existing code ... +``` + +Now we can use this recipe throughout the benchmarking framework. + +> **Note:** If the `AOBaseConfig` uses input parameters, like bit-width, group-size etc, you can pass them appended to the string config in input. For example, for `GemliteUIntXWeightOnlyConfig` we can pass bit-width and group-size as `gemlitewo--` + +## Add a Model to Benchmarking Recipes + +To add a new model architecture to the benchmarking system, you need to modify `torchao/testing/model_architectures.py`. + +1. To add a new model type, define your model class in `torchao/testing/model_architectures.py`: + +```python +class MyCustomModel(torch.nn.Module): + def __init__(self, input_dim, output_dim, dtype=torch.bfloat16): + super().__init__() + # Define your model architecture + self.layer1 = torch.nn.Linear(input_dim, 512, bias=False).to(dtype) + self.activation = torch.nn.ReLU() + self.layer2 = torch.nn.Linear(512, output_dim, bias=False).to(dtype) + + def forward(self, x): + x = self.layer1(x) + x = self.activation(x) + x = self.layer2(x) + return x +``` + +2. Update the `create_model_and_input_data` function to handle your new model type: + +```python +def create_model_and_input_data( + model_type: str, + m: int, + k: int, + n: int, + high_precision_dtype: torch.dtype = torch.bfloat16, + device: str = "cuda", + activation: str = "relu", +): + # ... existing code ... + + elif model_type == "my_custom_model": + model = MyCustomModel(k, n, high_precision_dtype).to(device) + input_data = torch.randn(m, k, device=device, dtype=high_precision_dtype) + + # ... rest of existing code ... +``` + +### Model Design Considerations + +When adding new models: + +- **Input/Output Dimensions**: Ensure your model handles the (m, k, n) dimension convention where: + - `m`: Batch size or sequence length + - `k`: Input feature dimension + - `n`: Output feature dimension + +- **Data Types**: Support the `high_precision_dtype` parameter (typically `torch.bfloat16`) + +- **Device Compatibility**: Ensure your model works on CUDA, CPU, and other target devices + +- **Quantization Compatibility**: Design your model to work with TorchAO quantization methods + +## Add an HF model to benchmarking recipes +(Coming soon!!!) + +## Add an API to Benchmarking CI Dashboard + +To integrate your API with the CI [dashboard](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fao&benchmarkName=micro-benchmark+api): + +### 1. Modify Existing CI Configuration + +Add your quantization method to the existing CI configuration file at `benchmarks/dashboard/microbenchmark_quantization_config.yml`: + +```yaml +# benchmarks/dashboard/microbenchmark_quantization_config.yml +benchmark_mode: "inference" +quantization_config_recipe_names: + - "int8wo" + - "int8dq" + - "float8dq-tensor" + - "float8dq-row" + - "float8wo" + - "my_new_quantization" # Add your method here + +output_dir: "benchmarks/microbenchmarks/results" + +model_params: + - name: "small_bf16_linear" + matrix_shapes: + - name: "small_sweep" + min_power: 10 + max_power: 15 + high_precision_dtype: "torch.bfloat16" + use_torch_compile: true + torch_compile_mode: "max-autotune" + device: "cuda" + model_type: "linear" +``` + +### 2. Run CI Benchmarks + +Use the CI runner to generate results in PyTorch OSS benchmark database format: + +```bash +python benchmarks/dashboard/ci_microbenchmark_runner.py \ + --config benchmarks/dashboard/microbenchmark_quantization_config.yml \ + --output benchmark_results.json +``` + +### 3. CI Output Format + +The CI runner outputs results in a specific JSON format required by the PyTorch OSS benchmark database: + +```json +[ + { + "benchmark": { + "name": "micro-benchmark api", + "mode": "inference", + "dtype": "int8wo", + "extra_info": { + "device": "cuda", + "arch": "NVIDIA A100-SXM4-80GB" + } + }, + "model": { + "name": "1024-1024-1024", + "type": "micro-benchmark custom layer", + "origins": ["torchao"] + }, + "metric": { + "name": "speedup(wrt bf16)", + "benchmark_values": [1.25], + "target_value": 0.0 + }, + "runners": [], + "dependencies": {} + } +] +``` + +### 4. Integration with CI Pipeline + +To integrate with your CI pipeline, add the benchmark step to your workflow: + +```yaml +# Example GitHub Actions step +- name: Run Microbenchmarks + run: | + python benchmarks/dashboard/ci_microbenchmark_runner.py \ + --config benchmarks/dashboard/microbenchmark_quantization_config.yml \ + --output benchmark_results.json + +- name: Upload Results + # Upload benchmark_results.json to your dashboard system +``` + +## Troubleshooting + +### Running Tests + +To verify your setup and run the test suite: + +```bash +python -m unittest discover benchmarks/microbenchmarks/test +``` + +### Common Issues + +1. **CUDA Out of Memory**: Reduce batch size or matrix dimensions +2. **Compilation Errors**: Set `use_torch_compile: false` for debugging +3. **Missing Quantization Methods**: Ensure TorchAO is properly installed +4. **Device Not Available**: Check device availability and drivers + +### Best Practices + +1. Use `small_sweep` for basic testing, `custom shapes` for comprehensive or model specific analysis +2. Enable profiling only when needed (adds overhead) +3. Test on multiple devices when possible +4. Use consistent naming conventions for reproducibility + +For information on different use-cases for benchmarking, refer to [Benchmarking Use-Case FAQs](benchmarking_user_faq.md) + +For more detailed information about the framework components, see the README files in the `benchmarks/microbenchmarks/` directory. diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md new file mode 100644 index 0000000000..c862ca6c0a --- /dev/null +++ b/docs/source/benchmarking_user_faq.md @@ -0,0 +1,133 @@ +# Benchmarking Use-Case FAQs + +This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues) + +## Run the benchmarking on your PR + +### 1. Add label to your PR +To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps: + +1. Go to your pull request on GitHub. +2. On the right sidebar, find the "Labels" section. +3. Click on the "Labels" dropdown and select "ciflow/benchmark" from the list of available labels. + +Adding this label will automatically trigger the benchmarking CI workflow for your pull request. + +### 2. Manually trigger benchmarking workflow on your github branch +To manually trigger the benchmarking workflow for your branch, follow these steps: + +1. Navigate to the "Actions" tab in your GitHub repository. +2. Select the benchmarking workflow from the list of available workflows. For microbenchmarks, it's `Microbenchmarks-Perf-Nightly`. +3. Click on the "Run workflow" button. +4. In the dropdown menu, select the branch. +5. Click the "Run workflow" button to start the benchmarking process. + +This will execute the benchmarking workflow on the specified branch, allowing you to evaluate the performance of your changes. + +## Benchmark Your API Locally + +For local development and testing: + +### 1. Quick Start + +Create a minimal configuration for local testing: + +```yaml +# local_test.yml +benchmark_mode: "inference" +quantization_config_recipe_names: + - "baseline" + - "int8wo" + # Add your recipe here + +output_dir: "local_results" # Add your output directory here + +model_params: + # Add your model configurations here + - name: "quick_test" + matrix_shapes: + # Define a custom shape, or use one of the predefined shape generators + - name: "custom" + shapes: [[1024, 1024, 1024]] + - name: "small_sweep" + high_precision_dtype: "torch.bfloat16" + use_torch_compile: true + torch_compile_mode: "max-autotune" + device: "cuda" + model_type: "linear" + enable_profiler: true # Enable profiling for this model + enable_memory_profiler: true # Enable memory profiling for this model +``` + +> **Note:** +> - For a list of latest supported config recipes for quantization or sparsity, please refer to `benchmarks/microbenchmarks/README.md`. +> - For a list of all model types, please refer to `torchao/testing/model_architectures.py`. + +### 2. Run Local Benchmark + +```bash +python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml +``` + +### 3. Analysing the Output + +The output generated after running the benchmarking script, is the form of a csv. It'll contain some of the following: + - time for inference for running baseline model and quantized model + - speedup in inference time in quantized model + - compile or eager mode + - if enabled, memory snapshot and gpu chrome trace + + +## Advanced Usage + +### Multiple Model Configurations + +You can benchmark multiple model configurations in a single run: + +```yaml +model_params: + - name: "small_models" + matrix_shapes: + - name: "pow2" + min_power: 10 + max_power: 12 + model_type: "linear" + device: "cuda" + + - name: "transformer_models" + matrix_shapes: + - name: "llama" + model_type: "transformer_block" + device: "cuda" + + - name: "cpu_models" + matrix_shapes: + - name: "custom" + shapes: [[512, 512, 512]] + model_type: "linear" + device: "cpu" +``` + +### Interpreting Results + +The benchmark results include: + +- **Speedup**: Performance improvement compared to baseline (bfloat16) +- **Memory Usage**: Peak memory consumption during inference +- **Latency**: Time taken for inference operations +- **Profiling Data**: Detailed performance traces (when enabled) + +Results are saved in CSV format with columns for: + +- Model configuration +- Quantization method +- Shape dimensions (M, K, N) +- Performance metrics +- Memory metrics +- Device information + +### Best Practices + +1. Use `small_sweep` for initial testing, `sweep` for comprehensive analysis +2. Enable profiling only when needed (adds overhead) +3. Test on multiple devices when possible diff --git a/docs/source/index.rst b/docs/source/index.rst index 60355f761b..9c88143718 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -21,7 +21,8 @@ for an overall introduction to the library and recent highlight and updates. quantization sparsity contributor_guide - microbenchmarking + benchmarking_overview + benchmarking_user_faq .. toctree:: :glob: From 98eef307f8e0c33bbe7ab1627432f42767b59afe Mon Sep 17 00:00:00 2001 From: jainapurva Date: Wed, 9 Jul 2025 12:15:21 -0700 Subject: [PATCH 08/11] fix formatting --- docs/source/benchmarking_user_faq.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md index c862ca6c0a..a0b2cd7486 100644 --- a/docs/source/benchmarking_user_faq.md +++ b/docs/source/benchmarking_user_faq.md @@ -2,7 +2,13 @@ This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues) -## Run the benchmarking on your PR +## Table of Contents +- [Run the performance benchmarking on your PR](#run-the-performance-benchmarking-on-your-pr) +- [Benchmark Your API Locally](#benchmark-your-api-locally) +- [Generate evaluation metrics for your quantized model](#generate-evaluation-metrics-for-your-quantized-model) +- [Advanced Usage](#advanced-usage) + +## Run the performance benchmarking on your PR ### 1. Add label to your PR To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps: @@ -78,6 +84,9 @@ The output generated after running the benchmarking script, is the form of a csv - if enabled, memory snapshot and gpu chrome trace +## Generate evaluation metrics for your quantized model +(Coming soon!!!) + ## Advanced Usage ### Multiple Model Configurations From 8c05b7f0e7ee69fcc2785d6ac39f3e80950ff603 Mon Sep 17 00:00:00 2001 From: Apurva Jain Date: Wed, 9 Jul 2025 15:54:27 -0700 Subject: [PATCH 09/11] Remove second tutorial --- docs/source/benchmarking_user_faq.md | 139 +-------------------------- 1 file changed, 1 insertion(+), 138 deletions(-) diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md index a0b2cd7486..3920bf257d 100644 --- a/docs/source/benchmarking_user_faq.md +++ b/docs/source/benchmarking_user_faq.md @@ -2,141 +2,4 @@ This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues) -## Table of Contents -- [Run the performance benchmarking on your PR](#run-the-performance-benchmarking-on-your-pr) -- [Benchmark Your API Locally](#benchmark-your-api-locally) -- [Generate evaluation metrics for your quantized model](#generate-evaluation-metrics-for-your-quantized-model) -- [Advanced Usage](#advanced-usage) - -## Run the performance benchmarking on your PR - -### 1. Add label to your PR -To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps: - -1. Go to your pull request on GitHub. -2. On the right sidebar, find the "Labels" section. -3. Click on the "Labels" dropdown and select "ciflow/benchmark" from the list of available labels. - -Adding this label will automatically trigger the benchmarking CI workflow for your pull request. - -### 2. Manually trigger benchmarking workflow on your github branch -To manually trigger the benchmarking workflow for your branch, follow these steps: - -1. Navigate to the "Actions" tab in your GitHub repository. -2. Select the benchmarking workflow from the list of available workflows. For microbenchmarks, it's `Microbenchmarks-Perf-Nightly`. -3. Click on the "Run workflow" button. -4. In the dropdown menu, select the branch. -5. Click the "Run workflow" button to start the benchmarking process. - -This will execute the benchmarking workflow on the specified branch, allowing you to evaluate the performance of your changes. - -## Benchmark Your API Locally - -For local development and testing: - -### 1. Quick Start - -Create a minimal configuration for local testing: - -```yaml -# local_test.yml -benchmark_mode: "inference" -quantization_config_recipe_names: - - "baseline" - - "int8wo" - # Add your recipe here - -output_dir: "local_results" # Add your output directory here - -model_params: - # Add your model configurations here - - name: "quick_test" - matrix_shapes: - # Define a custom shape, or use one of the predefined shape generators - - name: "custom" - shapes: [[1024, 1024, 1024]] - - name: "small_sweep" - high_precision_dtype: "torch.bfloat16" - use_torch_compile: true - torch_compile_mode: "max-autotune" - device: "cuda" - model_type: "linear" - enable_profiler: true # Enable profiling for this model - enable_memory_profiler: true # Enable memory profiling for this model -``` - -> **Note:** -> - For a list of latest supported config recipes for quantization or sparsity, please refer to `benchmarks/microbenchmarks/README.md`. -> - For a list of all model types, please refer to `torchao/testing/model_architectures.py`. - -### 2. Run Local Benchmark - -```bash -python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml -``` - -### 3. Analysing the Output - -The output generated after running the benchmarking script, is the form of a csv. It'll contain some of the following: - - time for inference for running baseline model and quantized model - - speedup in inference time in quantized model - - compile or eager mode - - if enabled, memory snapshot and gpu chrome trace - - -## Generate evaluation metrics for your quantized model -(Coming soon!!!) - -## Advanced Usage - -### Multiple Model Configurations - -You can benchmark multiple model configurations in a single run: - -```yaml -model_params: - - name: "small_models" - matrix_shapes: - - name: "pow2" - min_power: 10 - max_power: 12 - model_type: "linear" - device: "cuda" - - - name: "transformer_models" - matrix_shapes: - - name: "llama" - model_type: "transformer_block" - device: "cuda" - - - name: "cpu_models" - matrix_shapes: - - name: "custom" - shapes: [[512, 512, 512]] - model_type: "linear" - device: "cpu" -``` - -### Interpreting Results - -The benchmark results include: - -- **Speedup**: Performance improvement compared to baseline (bfloat16) -- **Memory Usage**: Peak memory consumption during inference -- **Latency**: Time taken for inference operations -- **Profiling Data**: Detailed performance traces (when enabled) - -Results are saved in CSV format with columns for: - -- Model configuration -- Quantization method -- Shape dimensions (M, K, N) -- Performance metrics -- Memory metrics -- Device information - -### Best Practices - -1. Use `small_sweep` for initial testing, `sweep` for comprehensive analysis -2. Enable profiling only when needed (adds overhead) -3. Test on multiple devices when possible +[Coming Soon !!!] From 2cd3aae3248317e5de7bde88c9f9026b6378d89e Mon Sep 17 00:00:00 2001 From: Apurva Jain Date: Wed, 9 Jul 2025 16:06:17 -0700 Subject: [PATCH 10/11] End user benchmarking tutorial --- docs/source/benchmarking_user_faq.md | 139 ++++++++++++++++++++++++++- 1 file changed, 138 insertions(+), 1 deletion(-) diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md index 3920bf257d..a0b2cd7486 100644 --- a/docs/source/benchmarking_user_faq.md +++ b/docs/source/benchmarking_user_faq.md @@ -2,4 +2,141 @@ This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues) -[Coming Soon !!!] +## Table of Contents +- [Run the performance benchmarking on your PR](#run-the-performance-benchmarking-on-your-pr) +- [Benchmark Your API Locally](#benchmark-your-api-locally) +- [Generate evaluation metrics for your quantized model](#generate-evaluation-metrics-for-your-quantized-model) +- [Advanced Usage](#advanced-usage) + +## Run the performance benchmarking on your PR + +### 1. Add label to your PR +To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps: + +1. Go to your pull request on GitHub. +2. On the right sidebar, find the "Labels" section. +3. Click on the "Labels" dropdown and select "ciflow/benchmark" from the list of available labels. + +Adding this label will automatically trigger the benchmarking CI workflow for your pull request. + +### 2. Manually trigger benchmarking workflow on your github branch +To manually trigger the benchmarking workflow for your branch, follow these steps: + +1. Navigate to the "Actions" tab in your GitHub repository. +2. Select the benchmarking workflow from the list of available workflows. For microbenchmarks, it's `Microbenchmarks-Perf-Nightly`. +3. Click on the "Run workflow" button. +4. In the dropdown menu, select the branch. +5. Click the "Run workflow" button to start the benchmarking process. + +This will execute the benchmarking workflow on the specified branch, allowing you to evaluate the performance of your changes. + +## Benchmark Your API Locally + +For local development and testing: + +### 1. Quick Start + +Create a minimal configuration for local testing: + +```yaml +# local_test.yml +benchmark_mode: "inference" +quantization_config_recipe_names: + - "baseline" + - "int8wo" + # Add your recipe here + +output_dir: "local_results" # Add your output directory here + +model_params: + # Add your model configurations here + - name: "quick_test" + matrix_shapes: + # Define a custom shape, or use one of the predefined shape generators + - name: "custom" + shapes: [[1024, 1024, 1024]] + - name: "small_sweep" + high_precision_dtype: "torch.bfloat16" + use_torch_compile: true + torch_compile_mode: "max-autotune" + device: "cuda" + model_type: "linear" + enable_profiler: true # Enable profiling for this model + enable_memory_profiler: true # Enable memory profiling for this model +``` + +> **Note:** +> - For a list of latest supported config recipes for quantization or sparsity, please refer to `benchmarks/microbenchmarks/README.md`. +> - For a list of all model types, please refer to `torchao/testing/model_architectures.py`. + +### 2. Run Local Benchmark + +```bash +python -m benchmarks.microbenchmarks.benchmark_runner --config local_test.yml +``` + +### 3. Analysing the Output + +The output generated after running the benchmarking script, is the form of a csv. It'll contain some of the following: + - time for inference for running baseline model and quantized model + - speedup in inference time in quantized model + - compile or eager mode + - if enabled, memory snapshot and gpu chrome trace + + +## Generate evaluation metrics for your quantized model +(Coming soon!!!) + +## Advanced Usage + +### Multiple Model Configurations + +You can benchmark multiple model configurations in a single run: + +```yaml +model_params: + - name: "small_models" + matrix_shapes: + - name: "pow2" + min_power: 10 + max_power: 12 + model_type: "linear" + device: "cuda" + + - name: "transformer_models" + matrix_shapes: + - name: "llama" + model_type: "transformer_block" + device: "cuda" + + - name: "cpu_models" + matrix_shapes: + - name: "custom" + shapes: [[512, 512, 512]] + model_type: "linear" + device: "cpu" +``` + +### Interpreting Results + +The benchmark results include: + +- **Speedup**: Performance improvement compared to baseline (bfloat16) +- **Memory Usage**: Peak memory consumption during inference +- **Latency**: Time taken for inference operations +- **Profiling Data**: Detailed performance traces (when enabled) + +Results are saved in CSV format with columns for: + +- Model configuration +- Quantization method +- Shape dimensions (M, K, N) +- Performance metrics +- Memory metrics +- Device information + +### Best Practices + +1. Use `small_sweep` for initial testing, `sweep` for comprehensive analysis +2. Enable profiling only when needed (adds overhead) +3. Test on multiple devices when possible From 1e6ca62753cf161cb1ad402af430de6a339c3d8f Mon Sep 17 00:00:00 2001 From: Apurva Jain Date: Wed, 9 Jul 2025 16:22:02 -0700 Subject: [PATCH 11/11] Update CI run instructions --- docs/source/benchmarking_user_faq.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/docs/source/benchmarking_user_faq.md b/docs/source/benchmarking_user_faq.md index a0b2cd7486..e4f546de30 100644 --- a/docs/source/benchmarking_user_faq.md +++ b/docs/source/benchmarking_user_faq.md @@ -3,32 +3,34 @@ This guide is intended to provide instructions for the most fequent benchmarking use-case. If you have any use-case that is not answered here, please create an issue here: [TorchAO Issues](https://github.com/pytorch/ao/issues) ## Table of Contents -- [Run the performance benchmarking on your PR](#run-the-performance-benchmarking-on-your-pr) +- [Run the performance benchmarking in CI](#run-the-performance-benchmarking-in-ci) - [Benchmark Your API Locally](#benchmark-your-api-locally) - [Generate evaluation metrics for your quantized model](#generate-evaluation-metrics-for-your-quantized-model) - [Advanced Usage](#advanced-usage) -## Run the performance benchmarking on your PR +## Run the performance benchmarking in CI -### 1. Add label to your PR -To trigger the benchmarking CI workflow on your pull request, you need to add a specific label to your PR. Follow these steps: +### 1. Run the performance benchmarking on every commit in PR -1. Go to your pull request on GitHub. -2. On the right sidebar, find the "Labels" section. -3. Click on the "Labels" dropdown and select "ciflow/benchmark" from the list of available labels. +To trigger the benchmarking CI workflow on your pull request, add the `ciflow/benchmark` label: -Adding this label will automatically trigger the benchmarking CI workflow for your pull request. +1. Open your pull request on GitHub. +2. In the right sidebar, locate the "Labels" section. +3. Click "Labels" and select `ciflow/benchmark`. + +This will automatically run the benchmarking workflow for every commit in your PR. + +### 2. Run performance benchmarking on the last commit in a GitHub branch -### 2. Manually trigger benchmarking workflow on your github branch To manually trigger the benchmarking workflow for your branch, follow these steps: 1. Navigate to the "Actions" tab in your GitHub repository. 2. Select the benchmarking workflow from the list of available workflows. For microbenchmarks, it's `Microbenchmarks-Perf-Nightly`. 3. Click on the "Run workflow" button. -4. In the dropdown menu, select the branch. +4. In the dropdown menu, select the branch you want to benchmark. 5. Click the "Run workflow" button to start the benchmarking process. -This will execute the benchmarking workflow on the specified branch, allowing you to evaluate the performance of your changes. +This will execute the benchmarking workflow on the last commit of the specified branch, allowing you to evaluate the performance of your changes. ## Benchmark Your API Locally