Try to write code with fewer branches? #126

gvanrossum · 2021-10-19T15:54:12Z

gvanrossum
Oct 19, 2021
Maintainer

The more I read about processor architecture the more I realize that branch prediction is fickle and the penalty for a failed prediction is high (10-20 clock cycles). Even when branch prediction is effective, having too many branches close together can slow things down (I read something about branch prediction being slower on some i7 version if there is more than 1 branch per 16 bytes of instructions).

So perhaps it would behoove us to use idioms that combine multiple (easily-tested) conditions in one word before branching. Something like

if (x | y) {
    ...
}

(where it works, say) might actually be faster than

if (x || y) {
    ...
}

Then again, maybe the compiler (optimizer) also knows this and will do it for us?

iritkatriel · 2021-10-19T16:00:25Z

iritkatriel
Oct 19, 2021
Maintainer

Wouldn't that miss out on short-circuiting?

0 replies

gvanrossum · 2021-10-19T17:17:08Z

gvanrossum
Oct 19, 2021
Maintainer Author

Wouldn't that miss out on short-circuiting?

Indeed, that's the whole point -- if you have a condition where short-circuiting is not essential, you may be better off combining two flags using e.g. x | y since modern pipelining CPUs can do that much faster than x && y.

Of course this isn't an option if condition x is a check that a pointer isn't NULL and y dereferences that pointer. But there are some cases where we just check for independent flags, and what I've learned seems to indicate that one branch is easier than two (if the branches are very close together in the machine code).

0 replies

iritkatriel · 2021-10-19T18:15:34Z

iritkatriel
Oct 19, 2021
Maintainer

Also the branch prediction table is a finite size cache, right? so more branches mean more evictions.

0 replies

gvanrossum · 2021-10-19T18:28:44Z

gvanrossum
Oct 19, 2021
Maintainer Author

Also the branch prediction table is a finite size cache, right? so more branches mean more evictions.

Kind of, though there also seem to be ways to separately detect loops and "unimportant" branches. How these works is all very complicated... (Try reading about it in this paper: https://www.agner.org/optimize/microarchitecture.pdf)

0 replies

brandtbucher · 2021-10-19T18:37:25Z

brandtbucher
Oct 19, 2021
Maintainer

For anyone curious what this looks like in practice:

https://www.youtube.com/watch?v=bVJ-mWWL7cE

0 replies

Fidget-Spinner · 2021-10-20T12:29:19Z

Fidget-Spinner
Oct 20, 2021
Collaborator

For anyone curious what this looks like in practice:

https://www.youtube.com/watch?v=bVJ-mWWL7cE

My takeaway from that video is that the compiler is smarter than us 😢 ....and it seems that branchless is sometimes slower. When I reduced branches for LOAD_METHOD in 3.11, I couldn't see any speedup there https://github.com/python/cpython/pull/26014/files.

We could also consider using the UNLIKELY and LIKELY (GCC) macros in our branches. Note that they're already in CPython, just barely used. C++20 will add likely/unlikely, so MSVC will eventually support that too. I suspect this will achieve much of the benefits of branchless programming, without the pitfalls.

We will need very careful profiling though. E.g. I know that 90% of the time, _PyObject_GetMethod returns an unbound method, so I can use LIKELY for its many branches there.

Note: these don't emit hints to the CPU, they hint the compiler to rearrange the branches to aid branch prediction for us.

0 replies

Fidget-Spinner · 2021-10-21T04:18:20Z

Fidget-Spinner
Oct 21, 2021
Collaborator

Some clarification:

Clang outright ignores likely/unlikely during PGO as it already collects branch weights.

GCC seems to combine the weights with PGO data. (No clue if that information is still up to date).

MSVC seems to also collect branch weights during PGO.

It seems that good PGO training is ever more important, linking back to #99.

0 replies

erlend-aasland · 2021-10-21T08:01:34Z

erlend-aasland
Oct 21, 2021

LIKELY/UNLIKELY could still be interesting for debug builds (FWIW) and other build configurations (for example -Os and -Oz both run without PGO, AFAIK).

0 replies

iritkatriel · 2021-10-21T17:04:34Z

iritkatriel
Oct 21, 2021
Maintainer

(Try reading about it in this paper: https://www.agner.org/optimize/microarchitecture.pdf)

Interesting.

He leaves out Itanium, where the compiler is expected to use predication instead of branching for small code blocks. The predicate value needs to be known only at the end of the pipeline, before the instruction commits its effects, and you don't need to flush if it's false, just not commit. https://www.cs.nmsu.edu/~rvinyard/itanium/predication.htm

0 replies

gvanrossum · 2021-10-21T17:23:23Z

gvanrossum
Oct 21, 2021
Maintainer Author

The textbook I'm reading indicates that Itanium was a multi-billion-dollar mistake. Probably in part because it was relying too much on the compiler.

0 replies

iritkatriel · 2021-10-21T20:22:05Z

iritkatriel
Oct 21, 2021
Maintainer

The textbook I'm reading indicates that Itanium was a multi-billion-dollar mistake. Probably in part because it was relying too much on the compiler.

Hmm, interesting. That was the whole idea - why make the chip do at runtime things that the compiler can do?

0 replies

markshannon · 2021-10-22T14:13:05Z

markshannon
Oct 22, 2021
Collaborator

The cost of branching is almost entirely down to mispredictions. A correctly predicted branch, when the comparison involves values in registers or small constants, has basically zero cost.

However, if the branch is not predictable, then branchless is likely to be a win.

As Guido says, one way to remove a branch is to combine two of them. All the major compilers will do this for you if they can,
converting if (x && y) into if (x & y)
Another thing most compilers will do is make conditional assignments non-branching, at least some of the time: https://godbolt.org/z/8fYez4To4

The LIKELY, UNLIKELY macros have no effect on branch prediction, just code placement.

    if cond:
         A
    else:
         B

would normally be laid out:

     if !cond goto else_label
     A
     goto end
else_label:
     B
end:

If cond is LIKELY then it might be laid out as:

     if !cond goto else_label
     A
end:

   ...

else_label:
     B
     goto end

0 replies

FreddieWitherden · 2021-11-05T15:27:13Z

FreddieWitherden
Nov 5, 2021

The LIKELY, UNLIKELY macros have no effect on branch prediction, just code placement.

The two are connected. At least on x86 the convention is that, absent any other information, forward branches are predicted non-taken and backwards branches are taken. Further, if a condition really is likely then the compiler can shift the else block to a different page (improving effective code density).

Another thing most compilers will do is make conditional assignments non-branching

Linus Torvalds has a good rant on CMOV and friends and why branch-less code is often not all what it is cracked up to be on modern CPUs.

Modern branch predictors are extremely good, even for interpreters such as Python. See Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore with Table 3 being of particular interest. From the paper, we observe that the MPKI rate (Mispredictions Per 1000 Instructions) for the Python 3 interpreter has dropped by the best part an order of magnitude due to improvements in CPU branch predictors, to the point where you're looking at ~50 cycles of branch misprediction stall per 1,000 instructions.

In fact there are good arguments for more branchy code. A good example of this is devirtualization where, from PGO data, compilers will see if a function pointer call tends to go a certain way and, if so, rewrite the code to explicitly check the function pointer against a set of known targets such that they can be called directly rather than through a pointer.

What matters nowadays is cache and data layout. CPUs have the ILP to handle more work per cycle and are good at spotting trends. What they don't like is pointer chasing.

0 replies

Try to write code with fewer branches? #126

Uh oh!

gvanrossum Oct 19, 2021 Maintainer

Replies: 13 comments

Uh oh!

iritkatriel Oct 19, 2021 Maintainer

Uh oh!

gvanrossum Oct 19, 2021 Maintainer Author

Uh oh!

iritkatriel Oct 19, 2021 Maintainer

Uh oh!

gvanrossum Oct 19, 2021 Maintainer Author

Uh oh!

brandtbucher Oct 19, 2021 Maintainer

Uh oh!

Uh oh!

Fidget-Spinner Oct 20, 2021 Collaborator

Uh oh!

Uh oh!

Fidget-Spinner Oct 21, 2021 Collaborator

Uh oh!

erlend-aasland Oct 21, 2021

Uh oh!

iritkatriel Oct 21, 2021 Maintainer

Uh oh!

Uh oh!

gvanrossum Oct 21, 2021 Maintainer Author

Uh oh!

iritkatriel Oct 21, 2021 Maintainer

Uh oh!

markshannon Oct 22, 2021 Collaborator

Uh oh!

FreddieWitherden Nov 5, 2021

gvanrossum
Oct 19, 2021
Maintainer

iritkatriel
Oct 19, 2021
Maintainer

gvanrossum
Oct 19, 2021
Maintainer Author

iritkatriel
Oct 19, 2021
Maintainer

gvanrossum
Oct 19, 2021
Maintainer Author

brandtbucher
Oct 19, 2021
Maintainer

Fidget-Spinner
Oct 20, 2021
Collaborator

Fidget-Spinner
Oct 21, 2021
Collaborator

erlend-aasland
Oct 21, 2021

iritkatriel
Oct 21, 2021
Maintainer

gvanrossum
Oct 21, 2021
Maintainer Author

iritkatriel
Oct 21, 2021
Maintainer

markshannon
Oct 22, 2021
Collaborator

FreddieWitherden
Nov 5, 2021