Trying to understand why IBVerbs transport performs badly

Hi, we have a CPU-only setup and trying to build a custom program that uses Gloo's point-to-point API (i.e. `createSendBuffer`, `send`, etc.) with the ibverbs transport. However, we're unable to saturate the NIC bandwidth. Specifically,

HW setup: 1 host, 2 CX-5 NICs with 2 ports each (100G / port, total of 4 ports). All ports are connected through a single tofino switch. We run 4 duplicate processes, where 1 process is mapped to 1 port. This setup is intended to 'mock' 4 hosts with 1 NIC each. 

As part of debugging this issue, we tried running `gloo-benchmark`, and saw the same issue (not being able to saturate NIC bandwidth)

```
username@hostname:(path)/gloo$ ./build/gloo/benchmark/benchmark --size 4 --rank 0 --redis-host 127.0.0.1 --redis-port 6379 --prefix 000 --transport ibverbs --ib-device mlx5_0 allreduce_ring --iteration-count 10 --warmup-iters 5 --no-verify --elements 1048576
My addr for i: 1 is LID: 0 QPN: 3320 PSN: 9127271
My addr for i: 2 is LID: 0 QPN: 3321 PSN: 8070086
My addr for i: 3 is LID: 0 QPN: 3322 PSN: 3971177
====================================================================================================
                                          ALLREDUCE_RING

Device:      ibverbs, pci=0000:41:00.0, dev=mlx5_0, port=1, index=0
Options:     processes=4, inputs=1, threads=1, verify=false

====================================================================================================
                                        BENCHMARK RESULTS

   size (B)   elements   min (us)   p50 (us)   p99 (us)   max (us)   bandwidth (GB/s)   iterations
    4194304    1048576      22090      25028      29469      29469              0.155           10
```


We think there are largely 3 possible reasons: 1. The algorithm (`AllreduceRing`) is faulty, 2. The way ibverbs backend uses the libibverbs API is faulty, or 3. Our HW/SW setup is faulty.

We do not think 3 is the case. This is becaus with perftest we are able to saturate the bandwidth much easily.

```
---------------------------------------------------------------------------------------
username@hostname:~/perftest$ taskset -c 16-31 ib_write_bw -d mlx5_1  -D 20 10.10.10.1  --report_gbits
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0bab PSN 0x1f096e RKey 0x04379b VAddr 0x007f57c5673000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:02
 remote address: LID 0000 QPN 0x0cdf PSN 0xf3396e RKey 0x007a02 VAddr 0x007fe65d37d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:01
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1498.322000 != 3150.070000. CPU Frequency is not max.
 65536      1551679          0.00               81.35              0.155167
```

Concerning 2., we see the following differences with perftest. We are not sure if they are critical enough to create the performance difference. 

- Gloo uses a single thread that polls the CQ, and communicates via Conditional variables when it sees a completion event
- Gloo allocates one big MR and one big WR, whereas perftest allocates one big MR but many small WRs. 

Could you help provide some lead/insight into debugging/understanding this issue?
Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trying to understand why IBVerbs transport performs badly #436

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trying to understand why IBVerbs transport performs badly #436

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions