Optimize tbe_input_combine_with_length_cuda on AMD #4430

JChunX · 2025-07-01T21:40:31Z

Summary:
DPER frontend benchmark show tbe_input_combine_with_length_cuda as one of the top contributors to local net latency on CMF fully remote model. Especially on AMD, where latency for this kernel is ~2x of NVIDIA (albeit AMD executes with more kernels in parallel).

VEC_WIDTH=32 Increase items processed on AMD per thread, improving memory access patterns and taking advantage of AMD GPU larger memory bandwidth.

Differential Revision: D75886673

netlify · 2025-07-01T21:40:36Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`c7d94bb`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68657e01dff42b000800d88a
😎 Deploy Preview	https://deploy-preview-4430--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot · 2025-07-01T21:40:42Z

This pull request was exported from Phabricator. Differential Revision: D75886673

facebook-github-bot · 2025-07-02T18:27:52Z

This pull request was exported from Phabricator. Differential Revision: D75886673

Summary: Pull Request resolved: pytorch#4430 X-link: facebookresearch/FBGEMM#1496 DPER frontend benchmark show tbe_input_combine_with_length_cuda as one of the top contributors to local net latency on CMF fully remote model. Especially on AMD, where latency for this kernel is ~2x of NVIDIA (albeit AMD executes with more kernels in parallel). VEC_WIDTH=32 Increase items processed on AMD per thread, improving memory access patterns and taking advantage of AMD GPU larger memory bandwidth. Reviewed By: q10 Differential Revision: D75886673

facebook-github-bot · 2025-07-02T18:44:10Z

This pull request was exported from Phabricator. Differential Revision: D75886673

facebook-github-bot added the cla signed label Jul 1, 2025

facebook-github-bot added the fb-exported label Jul 1, 2025

JChunX force-pushed the export-D75886673 branch 2 times, most recently from 3cd28ce to 0f2c26c Compare July 2, 2025 18:24

JChunX force-pushed the export-D75886673 branch from 0f2c26c to 50880c9 Compare July 2, 2025 18:28

JChunX force-pushed the export-D75886673 branch from 50880c9 to c7d94bb Compare July 2, 2025 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize tbe_input_combine_with_length_cuda on AMD #4430

Optimize tbe_input_combine_with_length_cuda on AMD #4430

Uh oh!

JChunX commented Jul 1, 2025

Uh oh!

netlify bot commented Jul 1, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 1, 2025

Uh oh!

facebook-github-bot commented Jul 2, 2025

Uh oh!

facebook-github-bot commented Jul 2, 2025

Uh oh!

Uh oh!

Optimize tbe_input_combine_with_length_cuda on AMD #4430

Are you sure you want to change the base?

Optimize tbe_input_combine_with_length_cuda on AMD #4430

Uh oh!

Conversation

JChunX commented Jul 1, 2025

Uh oh!

netlify bot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Jul 1, 2025

Uh oh!

facebook-github-bot commented Jul 2, 2025

Uh oh!

facebook-github-bot commented Jul 2, 2025

Uh oh!

Uh oh!

netlify bot commented Jul 1, 2025 •

edited

Loading