You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Invoke AMD specific kernel reorder_batched_ad_indices_kernel_vec (#4412)
Summary:
Pull Request resolved: #4412
X-link: facebookresearch/FBGEMM#1483
For the benchmark in the codebase, the larger the profuct of length and num-ads is, the better performance.
Two optimization:
1. Vector loading in a warp.
2. The product of batch-size and table-size determines the # of thread blocks (https://www.internalfb.com/code/fbsource/[cecfed562b79afad0eb9c44259141f50352da342]/fbcode/deeplearning/fbgemm/fbgemm_gpu/src/sparse_ops/sparse_reorder_batched_ad.cu?lines=361). In MRS models, we expect more thread blocks in our user cases. As such, we shrink the block size to achieve more thread blocks, thus improving compute utilization.
Performance results and local test benchmarks: D77066925
Reviewed By: jwfromm, jianyuh, q10
Differential Revision: D77459476
fbshipit-source-id: 178a111cbcc67a59986410027bacbe75fc92ab26
0 commit comments