[ET-VK][Ops] torchao.dequantize_affine vulkan impl and shader and cleanup #12576

ahmtox · 2025-07-17T00:43:14Z

Stack from ghstack (oldest at bottom):

Changes

Implement torchao.dequantize_affine operator in Vulkan backend with comprehensive texture and buffer storage support
Add block-wise dequantization mode in dequantize_texture.glsl and dequantize_buffer.glsl shaders for configurable tensor block dequantization
Extend dequantization infrastructure in Dequantize.cpp to handle affine transformations with configurable block sizes and quantization parameters
Support integer-to-floating-point conversion with precise reconstruction of original values

BE: Improved the documentation in the shader logic which is more detailed and clear

Motivation

The existing Vulkan quantization infrastructure lacked support for the torchao.dequantize_affine operator, which is essential for completing the quantization-dequantization cycle in dynamic quantization workflows. The dequantize_affine operator provides flexible block-wise dequantization that reconstructs floating-point values from quantized integer blocks, enabling:

Block-wise Dequantization: Reconstructs floating-point values from configurable tensor blocks using separate scale and zero-point parameters, enabling precise recovery of original data distributions
Affine Transformation: Uses the formula value = (qvalue - zero_point) * scale for accurate integer-to-floating-point mapping
TorchAO Integration: Seamless compatibility with TorchAO quantization workflows and completes the quantization-dequantization round-trip

Operator Description

The dequantize_affine operator converts n-bit integer tensor values back to floating-point representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise dequantization divides tensors into blocks and applies separate dequantization parameters to each block, allowing fine-grained reconstruction of the original floating-point precision.

The dequantization formula is: value = (qvalue - zero_point) * scale

Storage Requirements: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping. Input tensors must be integer types (kByte, kChar, kInt).

Block-wise Dequantization Implementation

Block-wise dequantization enables fine-grained reconstruction by dividing tensors into blocks and applying separate dequantization parameters to each block. The implementation uses the same key data structures computed in Dequantize.cpp:

block_size_vec: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks)
tensor_size_whcn: Input tensor dimensions converted to WHCN layout using utils::make_whcn_ivec4()
num_blocks_vec: Number of blocks per dimension calculated as tensor_size_whcn / block_size_vec
block_stride_vec: Pre-computed linear strides for block grid indexing {1, #W, #W*#H, #W*#H*#C} to enable efficient block ID calculation

The block coordinate calculation uses: bcoord = tidx / blockSize where tidx is the tensor coordinate in WHCN layout, then the linear block ID is computed as: block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w

Shader Algorithm Overview

Texture Storage Implementation (`dequantize_texture.glsl`)

Workgroup Configuration:

Global WG Size: Default sizing based on texture dimensions
Local WG Size: Default with special handling for batch dimension dequantization (Z dimension set to 1 for proper workgroup dispatching when global_workgroup_size[2] > 1)

Block-wise Mode Algorithm:
The shader processes 3D texture positions where each position represents a texel containing 4 width-packed integer components. For each texel at position pos, it calculates a base tensor index base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0) to account for width-packing.

For each of the 4 components in the texel, it computes the actual tensor coordinate: tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total)) where foldedZ = pos.z handles batch-channel folding in 4D tensors and C_total = numBlocks.z * blockSize.z represents the total channel dimension.

The block coordinate is calculated using integer division: bcoord = tidx / blockSize, then the linear block ID uses pre-computed strides: block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w.

Each integer component is dequantized using its corresponding block's parameters: value = dequantize_val(qvalue, t_scale[block_id], t_zero_point[block_id]) where dequantize_val() applies the formula (qvalue - zero_point) * scale. The reconstructed floating-point values are written to the output texel with proper type handling for double precision outputs.

Buffer Storage Implementation (`dequantize_buffer.glsl`)

Workgroup Configuration:

Global WG Size: Default sizing based on buffer element count
Local WG Size: Default sizing without special constraints

Block-wise Mode Algorithm:
The shader processes linear buffer indices using gl_GlobalInvocationID.x as the output buffer index. It converts this to tensor coordinates using bufi_to_tidx(out_bufi, t_out_strides, out_dim_order) which handles the buffer-to-tensor index mapping with proper stride calculations.

For each element, it computes the block coordinate directly: bcoord = out_tidx / blockSize where out_tidx is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w.

The quantized integer value is loaded using the corresponding input buffer index: qvalue = t_in[in_bufi] where in_bufi = tidx_to_bufi(out_tidx, t_in_strides). Dequantization applies the block-specific parameters: value = dequantize_val(qvalue, t_scale[block_id], t_zero_point[block_id]) to reconstruct the original floating-point value.

Future Improvements: Dynamic workgroup sizing based on block dimensions

Differential Revision: D78435552

cc @SS-JIA @manuelcandales @cbilgin

…anup # Changes * Implement `torchao.dequantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support * Add block-wise dequantization mode in `dequantize_texture.glsl` and `dequantize_buffer.glsl` shaders for configurable tensor block dequantization * Extend dequantization infrastructure in `Dequantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters * Support integer-to-floating-point conversion with precise reconstruction of original values BE: Improved the documentation in the shader logic which is more detailed and clear # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.dequantize_affine` operator, which is essential for completing the quantization-dequantization cycle in dynamic quantization workflows. The `dequantize_affine` operator provides flexible block-wise dequantization that reconstructs floating-point values from quantized integer blocks, enabling: * **Block-wise Dequantization**: Reconstructs floating-point values from configurable tensor blocks using separate scale and zero-point parameters, enabling precise recovery of original data distributions * **Affine Transformation**: Uses the formula `value = (qvalue - zero_point) * scale` for accurate integer-to-floating-point mapping * **TorchAO Integration**: Seamless compatibility with TorchAO quantization workflows and completes the quantization-dequantization round-trip # Operator Description The `dequantize_affine` operator converts n-bit integer tensor values back to floating-point representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise dequantization divides tensors into blocks and applies separate dequantization parameters to each block, allowing fine-grained reconstruction of the original floating-point precision. The dequantization formula is: `value = (qvalue - zero_point) * scale` **Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping. Input tensors must be integer types (kByte, kChar, kInt). # Block-wise Dequantization Implementation Block-wise dequantization enables fine-grained reconstruction by dividing tensors into blocks and applying separate dequantization parameters to each block. The implementation uses the same key data structures computed in `Dequantize.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec` * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w` # Shader Algorithm Overview ## Texture Storage Implementation (`dequantize_texture.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on texture dimensions - **Local WG Size**: Default with special handling for batch dimension dequantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`) **Block-wise Mode Algorithm**: The shader processes 3D texture positions where each position represents a texel containing 4 width-packed integer components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing. For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension. The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. Each integer component is dequantized using its corresponding block's parameters: `value = dequantize_val(qvalue, t_scale[block_id], t_zero_point[block_id])` where `dequantize_val()` applies the formula `(qvalue - zero_point) * scale`. The reconstructed floating-point values are written to the output texel with proper type handling for double precision outputs. ## Buffer Storage Implementation (`dequantize_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on buffer element count - **Local WG Size**: Default sizing without special constraints **Block-wise Mode Algorithm**: The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations. For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. The quantized integer value is loaded using the corresponding input buffer index: `qvalue = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Dequantization applies the block-specific parameters: `value = dequantize_val(qvalue, t_scale[block_id], t_zero_point[block_id])` to reconstruct the original floating-point value. **Future Improvements**: Dynamic workgroup sizing based on block dimensions Differential Revision: [D78435552](https://our.internmc.facebook.com/intern/diff/D78435552/) [ghstack-poisoned]

pytorch-bot · 2025-07-17T00:43:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12576

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e2e6bf2 with merge base b6b7a16 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-07-17T00:43:46Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

facebook-github-bot · 2025-07-17T00:43:59Z