Skip to content

Conversation

zluda-violet
Copy link
Collaborator

Uses the HIP implementation of __byte_perm.

Uses the HIP implementation of `__byte_perm`.
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements support for the PTX prmt.slow instruction by utilizing the HIP implementation of __byte_perm. The change adds proper handling for the PrmtSlow instruction that was previously unimplemented.

  • Adds a C++ implementation using HIP's __byte_perm function
  • Routes PrmtSlow instructions through the function call mechanism instead of direct LLVM emission
  • Includes test coverage for the new functionality

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ptx/src/test/spirv_run/prmt_slow.ptx Test case PTX assembly for prmt.slow instruction
ptx/src/test/spirv_run/mod.rs Test registration with input/output validation
ptx/src/test/ll/prmt_slow.ll Expected LLVM IR output for the test
ptx/src/pass/replace_instructions_with_functions.rs Routes PrmtSlow to function call mechanism
ptx/src/pass/llvm/emit.rs Removes direct emission handling, marks as unreachable
ptx/lib/zluda_ptx_impl.cpp Implements prmt_b32 function using HIP's __byte_perm

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@zluda-violet zluda-violet requested a review from vosen September 19, 2025 18:06
@vosen
Copy link
Owner

vosen commented Sep 19, 2025

Did you check with ptx_tests? I think ptx_tests actually cover PrmtSlow

@zluda-violet
Copy link
Collaborator Author

Seems like __byte_perm doesn't handle the sign extend bit. I'm working on fixing it.

Passes the prmt default mode ptx_tests (other modes are still unimplemented).
@zluda-violet
Copy link
Collaborator Author

With the latest change this now passes 100% of the prmt default mode test cases from ptx_tests.

The assembly looks like:

prmt_slow:
	s_load_b128 s[0:3], s[0:1], null                           // 000000001600: F4080000 F8000000
	s_waitcnt lgkmcnt(0)                                       // 000000001608: BF89FC07
	v_mov_b32_e32 v8, 0xff0000                                 // 00000000160C: 7E1002FF 00FF0000
	v_mov_b32_e32 v9, 0xff00                                   // 000000001614: 7E1202FF 0000FF00
	v_dual_mov_b32 v7, 0xff000000 :: v_dual_mov_b32 v0, s0     // 00000000161C: CA1000FF 07000000 FF000000
	v_mov_b32_e32 v1, s1                                       // 000000001628: 7E020201
	flat_load_b96 v[0:2], v[0:1]                               // 00000000162C: DC580000 007C0000
	s_waitcnt vmcnt(0) lgkmcnt(0)                              // 000000001634: BF890007
	v_and_b32_e32 v3, 7, v2                                    // 000000001638: 36060487
	v_lshlrev_b32_e32 v6, 8, v2                                // 00000000163C: 300C0488
	v_lshlrev_b32_e32 v4, 12, v2                               // 000000001640: 3008048C
	v_and_b32_e32 v10, 0x800, v2                               // 000000001644: 361404FF 00000800
	v_and_b32_e32 v11, 0x80, v2                                // 00000000164C: 361604FF 00000080
	s_delay_alu instid0(VALU_DEP_4) | instskip(SKIP_2) | instid1(VALU_DEP_2)// 000000001654: BF870134
	v_and_or_b32 v3, 0x70000, v6, v3                           // 000000001658: D6570003 040E0CFF 00070000
	v_and_b32_e32 v6, 0x8000, v2                               // 000000001664: 360C04FF 00008000
	v_and_b32_e32 v4, 0x7000000, v4                            // 00000000166C: 360808FF 07000000
	v_cmp_eq_u32_e32 vcc_lo, 0, v6                             // 000000001674: 7C940C80
	v_lshlrev_b32_e32 v5, 4, v2                                // 000000001678: 300A0484
	v_and_b32_e32 v2, 8, v2                                    // 00000000167C: 36040488
	s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)// 000000001680: BF870092
	v_and_b32_e32 v5, 0x700, v5                                // 000000001684: 360A0AFF 00000700
	v_or3_b32 v3, v3, v4, v5                                   // 00000000168C: D6580003 04160903
	v_cndmask_b32_e32 v4, 0x80000000, v7, vcc_lo               // 000000001694: 02080EFF 80000000
	v_cmp_eq_u32_e32 vcc_lo, 0, v10                            // 00000000169C: 7C941480
	s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_2) | instid1(VALU_DEP_3)// 0000000016A0: BF8701B3
	v_perm_b32 v0, v1, v0, v3                                  // 0000000016A4: D6440000 040E0101
	v_cndmask_b32_e32 v5, 0x800000, v8, vcc_lo                 // 0000000016AC: 020A10FF 00800000
	v_cmp_eq_u32_e32 vcc_lo, 0, v11                            // 0000000016B4: 7C941680
	v_and_b32_e32 v3, 0x80, v0                                 // 0000000016B8: 360600FF 00000080
	v_cndmask_b32_e32 v1, 0x8000, v9, vcc_lo                   // 0000000016C0: 020212FF 00008000
	v_cmp_eq_u32_e32 vcc_lo, 0, v2                             // 0000000016C8: 7C940480
	s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)// 0000000016CC: BF870092
	v_or3_b32 v1, v4, v5, v1                                   // 0000000016D0: D6580001 04060B04
	v_dual_cndmask_b32 v2, v3, v0 :: v_dual_and_b32 v3, v0, v1 // 0000000016D8: CA640103 02020300
	v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, s3             // 0000000016E0: CA100002 00000003
	s_delay_alu instid0(VALU_DEP_2)                            // 0000000016E8: BF870002
	v_and_or_b32 v2, 0xff, v2, v3                              // 0000000016EC: D6570002 040E04FF 000000FF
	flat_store_b32 v[0:1], v2                                  // 0000000016F8: DC680000 007C0200
	s_waitcnt lgkmcnt(0)                                       // 000000001700: BF89FC07
	s_waitcnt_vscnt null, 0x0                                  // 000000001704: BC7C0000
	s_endpgm                                                   // 000000001708: BFB00000

This is a lot, but the semantics of v_perm_b32 differ from prmt.b32 significantly enough that I'm not sure if there's any better way to do this. I could add a branch that only handles the sign bits if s & 0x8888 – I don't know if that would be an improvement.

I believe LLVM should be able to constant fold this to a single v_perm_b32 in the case of a constant selector, but I'll double check to make sure.

@zluda-violet
Copy link
Collaborator Author

I've verified that identical assembly is generated by Prmt and PrmtSlow, so I'll remove the distinction.

@vosen
Copy link
Owner

vosen commented Sep 23, 2025

Looks ok, I don't think there's any usage of non-constant selectors in the wild (though it's not always passed as an immediate). Question, what happens if there's a sign extension (in constant selector)? Do we get folder into a single v_perm_b32 if it's supported by the HW? What happens if the pattern is not support by the HW? My reading of the RDNA ISA and PTX is that not every source byte can get sign extended

@zluda-violet
Copy link
Collaborator Author

I've added a change that improves constant folding of a constant selector such that it will compile to a single v_perm_b32 instruction if the AMD hardware supports it. It also makes non-constant selectors even slower.

@vosen vosen merged commit 93820e3 into vosen:master Sep 23, 2025
6 checks passed
@zluda-violet zluda-violet deleted the prmt-slow branch September 23, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants