-
Notifications
You must be signed in to change notification settings - Fork 340
Description
Hi, we have a CPU-only setup and trying to build a custom program that uses Gloo's point-to-point API (i.e. createSendBuffer
, send
, etc.) with the ibverbs transport. However, we're unable to saturate the NIC bandwidth. Specifically,
HW setup: 1 host, 2 CX-5 NICs with 2 ports each (100G / port, total of 4 ports). All ports are connected through a single tofino switch. We run 4 duplicate processes, where 1 process is mapped to 1 port. This setup is intended to 'mock' 4 hosts with 1 NIC each.
As part of debugging this issue, we tried running gloo-benchmark
, and saw the same issue (not being able to saturate NIC bandwidth)
username@hostname:(path)/gloo$ ./build/gloo/benchmark/benchmark --size 4 --rank 0 --redis-host 127.0.0.1 --redis-port 6379 --prefix 000 --transport ibverbs --ib-device mlx5_0 allreduce_ring --iteration-count 10 --warmup-iters 5 --no-verify --elements 1048576
My addr for i: 1 is LID: 0 QPN: 3320 PSN: 9127271
My addr for i: 2 is LID: 0 QPN: 3321 PSN: 8070086
My addr for i: 3 is LID: 0 QPN: 3322 PSN: 3971177
====================================================================================================
ALLREDUCE_RING
Device: ibverbs, pci=0000:41:00.0, dev=mlx5_0, port=1, index=0
Options: processes=4, inputs=1, threads=1, verify=false
====================================================================================================
BENCHMARK RESULTS
size (B) elements min (us) p50 (us) p99 (us) max (us) bandwidth (GB/s) iterations
4194304 1048576 22090 25028 29469 29469 0.155 10
We think there are largely 3 possible reasons: 1. The algorithm (AllreduceRing
) is faulty, 2. The way ibverbs backend uses the libibverbs API is faulty, or 3. Our HW/SW setup is faulty.
We do not think 3 is the case. This is becaus with perftest we are able to saturate the bandwidth much easily.
---------------------------------------------------------------------------------------
username@hostname:~/perftest$ taskset -c 16-31 ib_write_bw -d mlx5_1 -D 20 10.10.10.1 --report_gbits
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0bab PSN 0x1f096e RKey 0x04379b VAddr 0x007f57c5673000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:02
remote address: LID 0000 QPN 0x0cdf PSN 0xf3396e RKey 0x007a02 VAddr 0x007fe65d37d000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:01
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 1498.322000 != 3150.070000. CPU Frequency is not max.
65536 1551679 0.00 81.35 0.155167
Concerning 2., we see the following differences with perftest. We are not sure if they are critical enough to create the performance difference.
- Gloo uses a single thread that polls the CQ, and communicates via Conditional variables when it sees a completion event
- Gloo allocates one big MR and one big WR, whereas perftest allocates one big MR but many small WRs.
Could you help provide some lead/insight into debugging/understanding this issue?
Thank you.