Replies: 2 comments 1 reply
-
@lijh5 what is the UCX version? Is it possible to run "perf top" on sender and receiver side to see what takes the most CPU time? |
Beta Was this translation helpful? Give feedback.
-
@yosefe 1. use hpcx-v2.14, ucx-v1.15.0 recevier: ucp_am_bw: recevier: From this perspective, it is true that tag_bw has done a lot of mem copying in the reviewer. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The test command is as follows:
use tag_bw
UCX_TLS=rc UCX_NET_DEVICES=mlx5_0:1 UCX_ZCOPY_THRESH=16384 UCX_RNDV_THRESH=16384 numactl -N 0 ucx_perftest -t tag_bw -s 4088
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0] 1596961 0.230 0.626 0.626 6224.36 6224.36 1596554 1596554
[thread 0] 3188257 0.230 0.629 0.627 6203.02 6213.69 1591081 1593818
[thread 0] 4785185 0.230 0.626 0.627 6224.57 6217.32 1596607 1594747
[thread 0] 6381601 0.230 0.626 0.627 6223.32 6218.82 1596287 1595132
[thread 0] 7979553 0.230 0.626 0.627 6229.42 6220.94 1597851 1595676
[thread 0] 9576481 0.230 0.626 0.627 6224.44 6221.52 1596575 1595826
Final: 10000000 0.230 0.626 0.627 6229.85 6221.87 1597961 1595916
use ucp_am_bw
UCX_TLS=rc UCX_NET_DEVICES=mlx5_0:1 UCX_MAX_EAGER_LANES=4 UCX_MAX_RNDV_LANES=4 UCX_ZCOPY_THRESH=16384 UCX_RNDV_THRESH=16384 numactl -N 0 ucx_perftest -t ucp_am_bw -s 4088
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0] 2340027 0.260 0.427 0.427 9133.24 9133.24 2342684 2342684
[thread 0] 4999162 0.240 0.376 0.400 10378.73 9755.98 2662154 2502419
[thread 0] 7635845 0.280 0.379 0.392 10291.09 9934.35 2639674 2548170
Final: 10000000 0.240 0.378 0.389 10311.26 10020.95 2644847 2570383
Question:
The results of tests tag_bw and ucp_am_bw are significantly different. Can tag_bw adjust any parameters to achieve the performance of ucp_am_bw?
Because MPI applications, such as osu_bw, use tag matching, will result in very low performance!
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions