Skip to content

Trying to understand why IBVerbs transport performs badly #436

@jinsun-yoo

Description

@jinsun-yoo

Hi, we have a CPU-only setup and trying to build a custom program that uses Gloo's point-to-point API (i.e. createSendBuffer, send, etc.) with the ibverbs transport. However, we're unable to saturate the NIC bandwidth. Specifically,

HW setup: 1 host, 2 CX-5 NICs with 2 ports each (100G / port, total of 4 ports). All ports are connected through a single tofino switch. We run 4 duplicate processes, where 1 process is mapped to 1 port. This setup is intended to 'mock' 4 hosts with 1 NIC each.

As part of debugging this issue, we tried running gloo-benchmark, and saw the same issue (not being able to saturate NIC bandwidth)

username@hostname:(path)/gloo$ ./build/gloo/benchmark/benchmark --size 4 --rank 0 --redis-host 127.0.0.1 --redis-port 6379 --prefix 000 --transport ibverbs --ib-device mlx5_0 allreduce_ring --iteration-count 10 --warmup-iters 5 --no-verify --elements 1048576
My addr for i: 1 is LID: 0 QPN: 3320 PSN: 9127271
My addr for i: 2 is LID: 0 QPN: 3321 PSN: 8070086
My addr for i: 3 is LID: 0 QPN: 3322 PSN: 3971177
====================================================================================================
                                          ALLREDUCE_RING

Device:      ibverbs, pci=0000:41:00.0, dev=mlx5_0, port=1, index=0
Options:     processes=4, inputs=1, threads=1, verify=false

====================================================================================================
                                        BENCHMARK RESULTS

   size (B)   elements   min (us)   p50 (us)   p99 (us)   max (us)   bandwidth (GB/s)   iterations
    4194304    1048576      22090      25028      29469      29469              0.155           10

We think there are largely 3 possible reasons: 1. The algorithm (AllreduceRing) is faulty, 2. The way ibverbs backend uses the libibverbs API is faulty, or 3. Our HW/SW setup is faulty.

We do not think 3 is the case. This is becaus with perftest we are able to saturate the bandwidth much easily.

---------------------------------------------------------------------------------------
username@hostname:~/perftest$ taskset -c 16-31 ib_write_bw -d mlx5_1  -D 20 10.10.10.1  --report_gbits
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0bab PSN 0x1f096e RKey 0x04379b VAddr 0x007f57c5673000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:02
 remote address: LID 0000 QPN 0x0cdf PSN 0xf3396e RKey 0x007a02 VAddr 0x007fe65d37d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:01
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1498.322000 != 3150.070000. CPU Frequency is not max.
 65536      1551679          0.00               81.35              0.155167

Concerning 2., we see the following differences with perftest. We are not sure if they are critical enough to create the performance difference.

  • Gloo uses a single thread that polls the CQ, and communicates via Conditional variables when it sees a completion event
  • Gloo allocates one big MR and one big WR, whereas perftest allocates one big MR but many small WRs.

Could you help provide some lead/insight into debugging/understanding this issue?
Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions