Fix bug with invert_permutation on the GPU #4194

pfultz2 · 2025-08-01T15:15:50Z

No description provided.

Copilot

Pull Request Overview

This PR fixes a bug with the invert_permutation function on GPU by implementing a new algorithm that correctly inverts permutations. The previous implementation incorrectly used reorder_dims which led to wrong results.

Replaces incorrect invert_permutation implementation with proper algorithm
Adds static assertions to validate shape transformations work correctly
Includes new test case for reshape-transpose-reshape-broadcast-sub operations

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
test/verify/test_reshape_transpose_reshape_broadcase_sub.cpp	Adds test case to verify fix for reshape-transpose-reshape-broadcast-sub pattern
src/targets/gpu/kernels/include/migraphx/kernels/shape.hpp	Adds static assertion to validate shape transformations
src/targets/gpu/kernels/include/migraphx/kernels/permutation.hpp	Implements corrected invert_permutation algorithm
src/targets/gpu/kernels/include/migraphx/kernels/copy.hpp	Adds static assertion and removes unused include

Copilot · 2025-08-01T16:45:25Z

test/verify/test_reshape_transpose_reshape_broadcase_sub.cpp

+
+#include <cassert>
+
+struct test_reshape_transpose_reshape_broadcast_sub


The struct name contains a typo: 'broadcase' should be 'broadcast' to match the actual operation being tested.

Copilot · 2025-08-01T16:45:26Z

test/verify/test_reshape_transpose_reshape_broadcase_sub.cpp

+        auto x1 =
+            mm->add_parameter("x1", migraphx::shape{migraphx::shape::float_type, {1, 512, 16, 16}});
+        auto reshape1 =
+            mm->add_instruction(migraphx::make_op("reshape", {{"dims", {1, 32, 16, 16, 16}}}), x1);


The reshape operation transforms a tensor from shape {1, 512, 16, 16} to {1, 32, 16, 16, 16}, but the element count doesn't match: 15121616 = 131072 vs 132161616 = 131072. However, 512 should be factored as 3216, not result in an extra dimension.

src/targets/gpu/kernels/include/migraphx/kernels/permutation.hpp

lakhinderwalia · 2025-08-04T00:09:36Z

test/verify/test_reshape_transpose_reshape_broadcase_sub.cpp

@@ -0,0 +1,55 @@
+/*


This filename likely has a typo: broadcase. It should be renamed in that case. Thanks.

test/verify/test_reshape_transpose_reshape_broadcase_sub.cpp

CharlieL7

Explain the original bug and testcase. Otherwise looks OK.

pfultz2 · 2025-08-04T22:01:45Z

Explain the original bug and testcase. Otherwise looks OK.

The input is transposed with a permutation that invert_permutation would return incorrect result. Since the input is transposed, it enabled the tiling to copy to LDS which then tried to pack the tensor using find_permutation and invert_permutation. This made the tensor in the wrong order producing incorrect results. I also added static_asserts in the kernel for a couple of functions to check this at compile-time just in case we see a similar problem in the future.

TedThemistokleous · 2025-08-05T13:24:23Z

@pfultz2 onnx failure has already been resolved due to something in upstream. Your changes didn't break that at all.

migraphx-bot · 2025-08-08T06:58:19Z

Test	Batch	Rate new f1817c	Rate old 284037	Diff	Compare
torchvision-resnet50	64	3,248.81	3,247.43	0.04%	✅
torchvision-resnet50_fp16	64	6,949.51	6,948.21	0.02%	✅
torchvision-densenet121	32	2,450.96	2,450.59	0.02%	✅
torchvision-densenet121_fp16	32	4,167.00	4,171.75	-0.11%	✅
torchvision-inceptionv3	32	1,635.87	1,634.45	0.09%	✅
torchvision-inceptionv3_fp16	32	2,762.65	2,761.48	0.04%	✅
cadene-inceptionv4	16	771.00	771.24	-0.03%	✅
cadene-resnext64x4	16	814.08	814.23	-0.02%	✅
slim-mobilenet	64	7,460.42	7,457.62	0.04%	✅
slim-nasnetalarge	64	211.18	211.13	0.02%	✅
slim-resnet50v2	64	3,340.71	3,343.56	-0.09%	✅
bert-mrpc-onnx	8	1,146.33	1,144.72	0.14%	✅
bert-mrpc-tf	1	445.76	442.76	0.68%	✅
pytorch-examples-wlang-gru	1	294.06	353.41	-16.79%	🔴
pytorch-examples-wlang-lstm	1	408.73	406.54	0.54%	✅
torchvision-resnet50_1	1	759.90	762.51	-0.34%	✅
cadene-dpn92_1	1	386.25	389.54	-0.84%	✅
cadene-resnext101_1	1	387.26	393.49	-1.58%	✅
onnx-taau-downsample	1	395.77	396.45	-0.17%	✅
dlrm-criteoterabyte	1	33.78	33.77	0.04%	✅
dlrm-criteoterabyte_fp16	1	51.21	51.22	-0.02%	✅
agentmodel	1	8,952.63	9,068.49	-1.28%	✅
unet_fp16	2	59.19	59.14	0.09%	✅
resnet50v1_fp16	1	989.13	966.77	2.31%	✅
resnet50v1_int8	1	1,029.07	1,044.17	-1.45%	✅
bert_base_cased_fp16	64	1,107.35	1,107.24	0.01%	✅
bert_large_uncased_fp16	32	345.33	345.26	0.02%	✅
bert_large_fp16	1	197.12	196.54	0.30%	✅
distilgpt2_fp16	16	2,118.20	2,117.52	0.03%	✅
yolov5s	1	576.24	575.45	0.14%	✅
tinyllama	1	43.97	43.97	0.00%	✅
vicuna-fastchat	1	45.23	45.35	-0.26%	✅
whisper-tiny-encoder	1	417.65	417.21	0.11%	✅
whisper-tiny-decoder	1	400.39	399.93	0.11%	✅
llama2_7b	1	19.16	19.15	0.02%	✅
qwen1.5-7b	1	23.53	23.53	0.00%	✅
phi3-3.8b	1	26.68	26.66	0.05%	✅
mask-rcnn	1	12.40	12.39	0.08%	✅
llama3-8b	1	21.74	21.77	-0.13%	✅
whisper-large-encoder	1	10.22	10.22	0.03%	✅
whisper-large-decoder	1	96.57	96.06	0.54%	✅
mistral-7b	1	23.73	23.72	0.02%	✅
FLUX.1-schnell	1	741.12	745.73	-0.62%	✅
nan	nan	nan	nan	nan%	❌

This build is not recommended to merge 🔴

migraphx-bot · 2025-08-08T06:58:22Z

✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

❌bert-mrpc-tf: ERROR - check error output

error: unknown warning option '-Wnrvo' [-Werror,-Wunknown-warning-option]

error: unknown warning option '-Wnrvo' [-Werror,-Wunknown-warning-option]

2025-08-08 01:03:07.179307: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1754632992.466190 173439 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 62951 MB memory: -> device: 0, name: AMD Instinct MI250X/MI250, pci bus id: 0000:b3:00.0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1754632993.339199 173439 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
2025-08-08 01:03:21.917913: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918059: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918106: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918158: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918205: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918256: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918307: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918338: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
2025-08-08 01:03:21.919352: E tensorflow/compiler/mlir/tools/kernel_gen/tf_framework_c_interface.cc:228] INTERNAL: Generating device code failed.
2025-08-08 01:03:21.920537: W tensorflow/core/framework/op_kernel.cc:1829] UNKNOWN: JIT compilation failed.
2025-08-08 01:03:21.920557: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
2025-08-08 01:03:21.920567: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
[[import/loss/output/_21]]
2025-08-08 01:03:21.920582: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11217777527359497193
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1407, in _do_call
return fn(*args)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1390, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1483, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
[[import/loss/output/_21]]
(1) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 359, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 335, in main
y_out = sess.run(y, feed_dict=tf_dict)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 977, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1220, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1400, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1426, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'import/bert/embeddings/LayerNorm/moments/SquaredDifference' defined at (most recent call last):
Node: 'import/bert/embeddings/LayerNorm/moments/SquaredDifference'
Detected at node 'import/bert/embeddings/LayerNorm/moments/SquaredDifference' defined at (most recent call last):
Node: 'import/bert/embeddings/LayerNorm/moments/SquaredDifference'
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
[[import/loss/output/_21]]
(1) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'import/bert/embeddings/LayerNorm/moments/SquaredDifference':

✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

✅ agentmodel: PASSED: MIGraphX meets tolerance

✅ unet: PASSED: MIGraphX meets tolerance

✅ resnet50v1: PASSED: MIGraphX meets tolerance

✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

✅ bert_large: PASSED: MIGraphX meets tolerance

✅ yolov5s: PASSED: MIGraphX meets tolerance

✅ tinyllama: PASSED: MIGraphX meets tolerance

✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

✅ llama2_7b: PASSED: MIGraphX meets tolerance

✅ qwen1.5-7b: PASSED: MIGraphX meets tolerance

✅ phi3-3.8b: PASSED: MIGraphX meets tolerance

🔴mask-rcnn: FAILED: MIGraphX is not within tolerance - check verbose output

✅ llama3-8b: PASSED: MIGraphX meets tolerance

✅ whisper-large-decoder: PASSED: MIGraphX meets tolerance

✅ mistral-7b: PASSED: MIGraphX meets tolerance

✅ FLUX.1-schnell: PASSED: MIGraphX meets tolerance

lakhinderwalia

(Thanks for renaming the test file. Approving this PR).

pfultz2 added 4 commits July 31, 2025 19:38

Fix bug with invert_permutation

d30e50d

Format

6d16027

Add test

159fb24

Format

819af13

pfultz2 requested a review from causten as a code owner August 1, 2025 15:15

pfultz2 self-assigned this Aug 1, 2025

kahmed10 approved these changes Aug 1, 2025

View reviewed changes

causten requested review from aarushjain29 and Copilot August 1, 2025 16:44

Copilot AI reviewed Aug 1, 2025

View reviewed changes

lakhinderwalia requested changes Aug 4, 2025

View reviewed changes

lakhinderwalia requested a review from CharlieL7 August 4, 2025 00:17

CharlieL7 requested changes Aug 4, 2025

View reviewed changes

CharlieL7 approved these changes Aug 4, 2025

View reviewed changes

aarushjain29 approved these changes Aug 7, 2025

View reviewed changes

pfultz2 added 2 commits August 7, 2025 22:58

Rename

343d27c

Update license

f1817c4

pfultz2 added the high priority A PR with high priority for review and merging. label Aug 11, 2025

Merge branch 'develop' into gpu-invert-perm-bug

1c01e36

lakhinderwalia approved these changes Aug 13, 2025

View reviewed changes

pfultz2 merged commit beb6bee into develop Aug 14, 2025
45 of 49 checks passed

pfultz2 deleted the gpu-invert-perm-bug branch August 14, 2025 02:09

pfultz2 mentioned this pull request Sep 3, 2025

Add initial unit tests for jit gpu kernels #4278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bug with invert_permutation on the GPU #4194

Fix bug with invert_permutation on the GPU #4194

pfultz2 commented Aug 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 1, 2025

Uh oh!

Copilot AI Aug 1, 2025

Uh oh!

Uh oh!

lakhinderwalia Aug 4, 2025

Uh oh!

Uh oh!

Uh oh!

CharlieL7 left a comment

Uh oh!

pfultz2 commented Aug 4, 2025

Uh oh!

TedThemistokleous commented Aug 5, 2025

Uh oh!

migraphx-bot commented Aug 8, 2025

Uh oh!

migraphx-bot commented Aug 8, 2025

Uh oh!

lakhinderwalia left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants


		#include <cassert>

		struct test_reshape_transpose_reshape_broadcast_sub

Fix bug with invert_permutation on the GPU #4194

Fix bug with invert_permutation on the GPU #4194

Conversation

pfultz2 commented Aug 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lakhinderwalia Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CharlieL7 left a comment

Choose a reason for hiding this comment

Uh oh!

pfultz2 commented Aug 4, 2025

Uh oh!

TedThemistokleous commented Aug 5, 2025

Uh oh!

migraphx-bot commented Aug 8, 2025

Uh oh!

migraphx-bot commented Aug 8, 2025

Uh oh!

lakhinderwalia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants