Time to change the default "balance" algorithm ? #3042

wtarreau · 2025-07-24T06:24:46Z

wtarreau
Jul 24, 2025
Maintainer

Problem statement

In 3.2, a lot of work was done to reduce the locking cost in the LB algorithms, making roundrobin and leastconn way more performant than they were before, particularly on machines with many cores and multiple core complexes like EPYC CPUs where leastconn used to be really bad in the past.

However, this work raised a question that's still valid: why do we still keep roundrobin as the default algorithm, except for historical reasons ? In many of our tests we're using random(2) aka random, because it's much better in that it's not purely random, it also considers the server's load (it picks the server with the least outstanding request between two random ones, thus stepping a little bit on leastconn's feet). This is something that roundrobin cannot consider, by definition.

I've run a few tests to compare roundrobin, leastconn and random. The tests were run on an AMD EPYC 64-core/128-thread system, connected to one load generator connected over 25Gbps and 4 servers (25 Gbps each). The tests consisted in retrieving a 1 kB object from the servers.

The request rate in requests/s, the bandwidth in bits/s, and the CPU usage of the LB algorithm / locking are reported. The performance ratio between the algo and roundrobin are reported for each algo (measured in req/s).

First test, using 2 thread groups (64 threads each). It can correspond to systems started with just thread-groups 2 without other NUMA-specific directives, and may illustrate to some extents how it would behave on unified cache CPUs (e.g. intel) with the limitation that locking costs are generally lower on such systems:

algo	roundrobin	leastconn	random
req/s	910k	1.02M	1.06M
BW	7.8Gbps	8.6Gbps	9.1Gbps
LB CPU	48%	14%	1.04%
ratio	100%	112%	116%

We can see that in this case the locking of the RR algo was still quite dominant, resulting in less overall performance.

Second test, enabling cpu-policy performance, which starts 8 thread groups of 16 threads each:

algo	roundrobin	leastconn	random
req/s	1.47M	1.3M	1.37M
BW	12.3Gbps	11.2Gbps	11.6Gbps
LB CPU	0.5%	19%	1.55%
ratio	100%	88%	93%

Here the LB algo has the lowest locking cost thanks to the fact that there are in fact 8 independent round-robin schedulers running.

Third test, with one of the 4 server's connection was degraded from 25 Gbps to 1 Gbps.

algo	roundrobin	leastconn	random
req/s	430k	1.3M	1.3M
BW	3.66Gbps	11.1Gbps	11.1Gbps
LB CPU	0.1%	19%	1.4%
ratio	100%	302%	302%

This last test, even if cheated by degrading one server's connection, perfectly illustrates what happens when a server's processing capacity is not stable over time. It might be taking varying loads, or dealing with unequal requests, or simply experiencing bursts of garbage collector activity that slows it down for a few hundreds of milliseconds every few seconds. roundrobin isn't optimal in this case as it will methodically assign the request to these servers. Leastconn is optimal in dealing with such cases but it's more suited to long-lasting connections, and will not fairly distribute short-request traffic among servers weighted differently. For example, if some servers have a slowstart parameter, it will progressively raise their weight when they're added, but this will only affect their ability to accept new concurrent requests/connections, but not directly their share of the traffic.

And here random excels. By default it will make two draws (the number of draws is the argument and defaults to 2), and compare the server's current number of connections, to pick the least loaded one. This way if a server experiences a temporary slowdown, it will keep its current requests longer and have a higher count than others, making it less eligible to new requests under the random algorithm.

Proposal

Currently, the default load balancing algorithm is roundrobin when not specified. The proposal here in order to improve the situation would be in 3.3 to simply switch to random when not specified, the point being that if users do not specific the algorithm they want to use, it's that they're fine with the ones chosen for them by the product's designers. This one gives a better distribution on irregular farms and uses less CPU across varying hardware.

Any objection / support ?

Note: also posted here on the mailing list.

git001 · 2025-07-24T10:46:44Z

git001
Jul 24, 2025

Here my answer to the ML

Thank you for the insights.

How about to take a look into #1570 to enhance the LB Algorithms?

I personally use almost always Leastconn because I like the ideas to balance between all be servers. But my use-case is quite simple and I have only 2/3 servers normally.

As I refer always to this excellent Blog Post Test driving “power of two random choices” load balancing maybe an update to this test with the current HAProxy Version could be interesting if something was changed?

I refer to this block at the end.

As explained in the papers cited, Power of Two was designed as a poor man’s Least Connections to be used in situations where a true Least Connections would be impractical, difficult to implement, or come with a significant performance hit. While it’s an excellent workaround, especially as an alternative to Round Robin, whenever you have the choice, you should still prefer Least Connections since it demonstrates better server response times and reduces the cost of server load and processing power.

From my point of view +1 for random as new default but that will confuse some peoples because random is set in peoples head as "random is bad as it's 'random'" which isn't true in HAPRoxy case.

How about to create an alias like "power-of-two" Algo setting so that the people out there got the idea why "random(2)" isn't that bad in HAProxy case.

Jm2c.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAProxy

Time to change the default "balance" algorithm ? #3042

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

HAProxy

Time to change the default "balance" algorithm ? #3042

Uh oh!

Uh oh!

wtarreau Jul 24, 2025 Maintainer

Problem statement

Proposal

Replies: 1 comment

Uh oh!

git001 Jul 24, 2025

wtarreau
Jul 24, 2025
Maintainer

git001
Jul 24, 2025