[ET-VK] linear_qta8a_qga4w graph pass #12574

ahmtox · 2025-07-17T00:43:08Z

Stack from ghstack (oldest at bottom):

Changes

Introduce linear_qta8a_qga4w custom operator in custom_ops_lib.py to handle dynamic activation + grouped weight quantized linear operations
Add pattern matching and fusion logic in FuseQuantizedOpsTransform to detect and replace dequant + dequant + linear sequences with the new fused operator
Implement comprehensive test coverage in test_vulkan_passes.py for the QTA8A_QGA4W fusion pattern validation
Add 4-bit weight packing utilities and grouped quantization support for efficient memory usage

Motivation

The existing quantization workflow in Vulkan backend processes dynamic activation + grouped weight quantized linear operations as separate quantize/dequantize/linear steps, which creates performance overhead through:

Multiple kernel dispatches instead of a single fused operation
Intermediate tensor allocations for dequantized weights and activations
Suboptimal memory bandwidth utilization

The new linear_qta8a_qga4w operator fuses the entire sequence into a single operation that:

Directly processes 8-bit quantized activations with per-token scales/zero-points
Handles 4-bit grouped quantized weights with configurable group sizes
Eliminates intermediate dequantization steps by performing dequantization inline
Reduces memory footprint through packed 4-bit weight storage

This aligns with the broader goal of optimizing quantized model inference in the Vulkan backend by leveraging graph-level transformations to improve computational efficiency while maintaining numerical accuracy.

Differential Revision: D78291269

# Changes * Introduce `linear_qta8a_qga4w` custom operator in `custom_ops_lib.py` to handle dynamic activation + grouped weight quantized linear operations * Add pattern matching and fusion logic in `FuseQuantizedOpsTransform` to detect and replace dequant + dequant + linear sequences with the new fused operator * Implement comprehensive test coverage in `test_vulkan_passes.py` for the QTA8A_QGA4W fusion pattern validation * Add 4-bit weight packing utilities and grouped quantization support for efficient memory usage # Motivation The existing quantization workflow in Vulkan backend processes dynamic activation + grouped weight quantized linear operations as separate quantize/dequantize/linear steps, which creates performance overhead through: * Multiple kernel dispatches instead of a single fused operation * Intermediate tensor allocations for dequantized weights and activations * Suboptimal memory bandwidth utilization The new `linear_qta8a_qga4w` operator fuses the entire sequence into a single operation that: * Directly processes 8-bit quantized activations with per-token scales/zero-points * Handles 4-bit grouped quantized weights with configurable group sizes * Eliminates intermediate dequantization steps by performing dequantization inline * Reduces memory footprint through packed 4-bit weight storage This aligns with the broader goal of optimizing quantized model inference in the Vulkan backend by leveraging graph-level transformations to improve computational efficiency while maintaining numerical accuracy. Differential Revision: [D78291269](https://our.internmc.facebook.com/intern/diff/D78291269/) [ghstack-poisoned]

pytorch-bot · 2025-07-17T00:43:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12574

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 340530c with merge base b6b7a16 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-07-17T00:43:38Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

facebook-github-bot · 2025-07-17T00:43:41Z