[Feature] support c16 prefix_cache in flash_attention_v3 #2766

lizhenyun01 · 2025-07-09T03:44:07Z

support c16 prefix_cache in flash_attention_v3

paddle-bot · 2025-07-09T03:44:18Z

Thanks for your contribution!

Copilot

Pull Request Overview

This PR adds support for 16-bit (c16) prefix cache in flash_attention_v3 by introducing a new dequantization kernel and updating the dispatch logic.

Added append_dequant_cache_kv_c16 kernel for c16 (no-quant) cache paths.
Extended AppendDequantCache to launch the c16 kernel when cache_quant_type == "none".
Removed an unused include for flash_attn_v3_kernel.h.

Copilot · 2025-07-09T04:22:12Z

custom_ops/gpu_ops/append_attn/gqa_rope_write_cache.cu

+  const uint32_t batch_id = batch_ids[tile_idx];
+  const uint32_t start_kv_idx = tile_ids_per_batch[tile_idx] * BLOCK_SIZE;
+  const uint32_t end_idx = seq_lens_decoder[batch_id] - start_kv_idx;
+  if (seq_lens_this_time <= 0) {


The condition is comparing a pointer seq_lens_this_time instead of its value. You likely meant to check the sequence length element, e.g., seq_lens_this_time[tile_idx] <= 0 or seq_lens_this_time[batch_id] <= 0.

Suggested change

if (seq_lens_this_time <= 0) {

if (seq_lens_this_time[batch_id] <= 0) {

Copilot · 2025-07-09T04:22:12Z

custom_ops/gpu_ops/append_attn/gqa_rope_write_cache.cu

+        k_tile_ptr0[8] = frag_dq_T[6];
+        k_tile_ptr0[9] = frag_dq_T[7];


This write is using k_tile_ptr0 instead of k_tile_ptr1 for the second half of the fragment. It should be k_tile_ptr1[8] = frag_dq_T[6]; (and similarly for index 9).

Suggested change

k_tile_ptr0[8] = frag_dq_T[6];

k_tile_ptr0[9] = frag_dq_T[7];

k_tile_ptr1[8] = frag_dq_T[6];

k_tile_ptr1[9] = frag_dq_T[7];

Copilot · 2025-07-09T04:22:13Z

custom_ops/gpu_ops/append_attn/gqa_rope_write_cache.cu

+        v_tile_ptr0[8] = frag_dq_T[6];
+        v_tile_ptr0[9] = frag_dq_T[7];


Similar to the K path, this write uses v_tile_ptr0 instead of v_tile_ptr1 for the second row. It should be v_tile_ptr1[8] = frag_dq_T[6]; (and likewise for element 9).

Suggested change

v_tile_ptr0[8] = frag_dq_T[6];

v_tile_ptr0[9] = frag_dq_T[7];

v_tile_ptr1[8] = frag_dq_T[6];

v_tile_ptr1[9] = frag_dq_T[7];

support c16 prompt_cache in fa3

1b1ecb3

Jiang-Jia-Jun requested a review from Copilot July 9, 2025 04:20

Copilot AI reviewed Jul 9, 2025

View reviewed changes

fix prefix_cache in fa3

e477ccf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] support c16 prefix_cache in flash_attention_v3 #2766

[Feature] support c16 prefix_cache in flash_attention_v3 #2766

Uh oh!

lizhenyun01 commented Jul 9, 2025

Uh oh!

paddle-bot bot commented Jul 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 9, 2025

Uh oh!

Copilot AI Jul 9, 2025

Uh oh!

Copilot AI Jul 9, 2025

Uh oh!

Uh oh!

	if (seq_lens_this_time <= 0) {
	if (seq_lens_this_time[batch_id] <= 0) {

		k_tile_ptr0[8] = frag_dq_T[6];
		k_tile_ptr0[9] = frag_dq_T[7];

		v_tile_ptr0[8] = frag_dq_T[6];
		v_tile_ptr0[9] = frag_dq_T[7];

[Feature] support c16 prefix_cache in flash_attention_v3 #2766

Are you sure you want to change the base?

[Feature] support c16 prefix_cache in flash_attention_v3 #2766

Uh oh!

Conversation

lizhenyun01 commented Jul 9, 2025

Uh oh!

paddle-bot bot commented Jul 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!