Skip to content

Conversation

@pfultz2
Copy link
Collaborator

@pfultz2 pfultz2 commented Aug 1, 2025

No description provided.

@pfultz2 pfultz2 requested a review from causten as a code owner August 1, 2025 15:15
@pfultz2 pfultz2 self-assigned this Aug 1, 2025
@causten causten requested review from aarushjain29 and Copilot August 1, 2025 16:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug with the invert_permutation function on GPU by implementing a new algorithm that correctly inverts permutations. The previous implementation incorrectly used reorder_dims which led to wrong results.

  • Replaces incorrect invert_permutation implementation with proper algorithm
  • Adds static assertions to validate shape transformations work correctly
  • Includes new test case for reshape-transpose-reshape-broadcast-sub operations

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
test/verify/test_reshape_transpose_reshape_broadcase_sub.cpp Adds test case to verify fix for reshape-transpose-reshape-broadcast-sub pattern
src/targets/gpu/kernels/include/migraphx/kernels/shape.hpp Adds static assertion to validate shape transformations
src/targets/gpu/kernels/include/migraphx/kernels/permutation.hpp Implements corrected invert_permutation algorithm
src/targets/gpu/kernels/include/migraphx/kernels/copy.hpp Adds static assertion and removes unused include


#include <cassert>

struct test_reshape_transpose_reshape_broadcast_sub
Copy link

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct name contains a typo: 'broadcase' should be 'broadcast' to match the actual operation being tested.

Copilot uses AI. Check for mistakes.
auto x1 =
mm->add_parameter("x1", migraphx::shape{migraphx::shape::float_type, {1, 512, 16, 16}});
auto reshape1 =
mm->add_instruction(migraphx::make_op("reshape", {{"dims", {1, 32, 16, 16, 16}}}), x1);
Copy link

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reshape operation transforms a tensor from shape {1, 512, 16, 16} to {1, 32, 16, 16, 16}, but the element count doesn't match: 15121616 = 131072 vs 132161616 = 131072. However, 512 should be factored as 3216, not result in an extra dimension.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,55 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filename likely has a typo: broadcase. It should be renamed in that case. Thanks.

Copy link
Collaborator

@CharlieL7 CharlieL7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the original bug and testcase. Otherwise looks OK.

@pfultz2
Copy link
Collaborator Author

pfultz2 commented Aug 4, 2025

Explain the original bug and testcase. Otherwise looks OK.

The input is transposed with a permutation that invert_permutation would return incorrect result. Since the input is transposed, it enabled the tiling to copy to LDS which then tried to pack the tensor using find_permutation and invert_permutation. This made the tensor in the wrong order producing incorrect results. I also added static_asserts in the kernel for a couple of functions to check this at compile-time just in case we see a similar problem in the future.

@TedThemistokleous
Copy link
Collaborator

@pfultz2 onnx failure has already been resolved due to something in upstream. Your changes didn't break that at all.

@migraphx-bot
Copy link
Collaborator

Test Batch Rate new
f1817c
Rate old
284037
Diff Compare
torchvision-resnet50 64 3,248.81 3,247.43 0.04%
torchvision-resnet50_fp16 64 6,949.51 6,948.21 0.02%
torchvision-densenet121 32 2,450.96 2,450.59 0.02%
torchvision-densenet121_fp16 32 4,167.00 4,171.75 -0.11%
torchvision-inceptionv3 32 1,635.87 1,634.45 0.09%
torchvision-inceptionv3_fp16 32 2,762.65 2,761.48 0.04%
cadene-inceptionv4 16 771.00 771.24 -0.03%
cadene-resnext64x4 16 814.08 814.23 -0.02%
slim-mobilenet 64 7,460.42 7,457.62 0.04%
slim-nasnetalarge 64 211.18 211.13 0.02%
slim-resnet50v2 64 3,340.71 3,343.56 -0.09%
bert-mrpc-onnx 8 1,146.33 1,144.72 0.14%
bert-mrpc-tf 1 445.76 442.76 0.68%
pytorch-examples-wlang-gru 1 294.06 353.41 -16.79% 🔴
pytorch-examples-wlang-lstm 1 408.73 406.54 0.54%
torchvision-resnet50_1 1 759.90 762.51 -0.34%
cadene-dpn92_1 1 386.25 389.54 -0.84%
cadene-resnext101_1 1 387.26 393.49 -1.58%
onnx-taau-downsample 1 395.77 396.45 -0.17%
dlrm-criteoterabyte 1 33.78 33.77 0.04%
dlrm-criteoterabyte_fp16 1 51.21 51.22 -0.02%
agentmodel 1 8,952.63 9,068.49 -1.28%
unet_fp16 2 59.19 59.14 0.09%
resnet50v1_fp16 1 989.13 966.77 2.31%
resnet50v1_int8 1 1,029.07 1,044.17 -1.45%
bert_base_cased_fp16 64 1,107.35 1,107.24 0.01%
bert_large_uncased_fp16 32 345.33 345.26 0.02%
bert_large_fp16 1 197.12 196.54 0.30%
distilgpt2_fp16 16 2,118.20 2,117.52 0.03%
yolov5s 1 576.24 575.45 0.14%
tinyllama 1 43.97 43.97 0.00%
vicuna-fastchat 1 45.23 45.35 -0.26%
whisper-tiny-encoder 1 417.65 417.21 0.11%
whisper-tiny-decoder 1 400.39 399.93 0.11%
llama2_7b 1 19.16 19.15 0.02%
qwen1.5-7b 1 23.53 23.53 0.00%
phi3-3.8b 1 26.68 26.66 0.05%
mask-rcnn 1 12.40 12.39 0.08%
llama3-8b 1 21.74 21.77 -0.13%
whisper-large-encoder 1 10.22 10.22 0.03%
whisper-large-decoder 1 96.57 96.06 0.54%
mistral-7b 1 23.73 23.72 0.02%
FLUX.1-schnell 1 741.12 745.73 -0.62%
nan nan nan nan nan%

This build is not recommended to merge 🔴

@migraphx-bot
Copy link
Collaborator


     ✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

❌bert-mrpc-tf: ERROR - check error outputerror: unknown warning option '-Wnrvo' [-Werror,-Wunknown-warning-option]

error: unknown warning option '-Wnrvo' [-Werror,-Wunknown-warning-option]

2025-08-08 01:03:07.179307: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1754632992.466190 173439 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 62951 MB memory: -> device: 0, name: AMD Instinct MI250X/MI250, pci bus id: 0000:b3:00.0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1754632993.339199 173439 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
2025-08-08 01:03:21.917913: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918059: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918106: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918158: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918205: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918256: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918307: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-08-08 01:03:21.918338: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
2025-08-08 01:03:21.919352: E tensorflow/compiler/mlir/tools/kernel_gen/tf_framework_c_interface.cc:228] INTERNAL: Generating device code failed.
2025-08-08 01:03:21.920537: W tensorflow/core/framework/op_kernel.cc:1829] UNKNOWN: JIT compilation failed.
2025-08-08 01:03:21.920557: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
2025-08-08 01:03:21.920567: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
[[import/loss/output/_21]]
2025-08-08 01:03:21.920582: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11217777527359497193
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1407, in _do_call
return fn(*args)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1390, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1483, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
[[import/loss/output/_21]]
(1) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 359, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 335, in main
y_out = sess.run(y, feed_dict=tf_dict)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 977, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1220, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1400, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1426, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'import/bert/embeddings/LayerNorm/moments/SquaredDifference' defined at (most recent call last):
Node: 'import/bert/embeddings/LayerNorm/moments/SquaredDifference'
Detected at node 'import/bert/embeddings/LayerNorm/moments/SquaredDifference' defined at (most recent call last):
Node: 'import/bert/embeddings/LayerNorm/moments/SquaredDifference'
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
[[import/loss/output/_21]]
(1) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'import/bert/embeddings/LayerNorm/moments/SquaredDifference':


     ✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

     ✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

     ✅ agentmodel: PASSED: MIGraphX meets tolerance

     ✅ unet: PASSED: MIGraphX meets tolerance

     ✅ resnet50v1: PASSED: MIGraphX meets tolerance

     ✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output


     ✅ bert_large: PASSED: MIGraphX meets tolerance

     ✅ yolov5s: PASSED: MIGraphX meets tolerance

     ✅ tinyllama: PASSED: MIGraphX meets tolerance

     ✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

     ✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

     ✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

     ✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

     ✅ llama2_7b: PASSED: MIGraphX meets tolerance

     ✅ qwen1.5-7b: PASSED: MIGraphX meets tolerance

     ✅ phi3-3.8b: PASSED: MIGraphX meets tolerance

🔴mask-rcnn: FAILED: MIGraphX is not within tolerance - check verbose output


     ✅ llama3-8b: PASSED: MIGraphX meets tolerance

     ✅ whisper-large-decoder: PASSED: MIGraphX meets tolerance

     ✅ mistral-7b: PASSED: MIGraphX meets tolerance

     ✅ FLUX.1-schnell: PASSED: MIGraphX meets tolerance

@pfultz2 pfultz2 added the high priority A PR with high priority for review and merging. label Aug 11, 2025
Copy link
Contributor

@lakhinderwalia lakhinderwalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Thanks for renaming the test file. Approving this PR).

@pfultz2 pfultz2 merged commit beb6bee into develop Aug 14, 2025
45 of 49 checks passed
@pfultz2 pfultz2 deleted the gpu-invert-perm-bug branch August 14, 2025 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority A PR with high priority for review and merging.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants