-
Notifications
You must be signed in to change notification settings - Fork 361
Description
This is not really a issue, more of a FYI and question if someone else has looked into these things that maybe could fit on the forum, but I feel its more visible to put it here..
Anyway, just a couple of days ago I added a batch api to Pymunk (Python 2d physic library built on Chipmunk) to get some data quicker, since its quite expensive to call C code from Python. In my simple test case I was mainly bottle-necked by Chipmunk performance and not Python as is the usual case. So, I started to look into how I could increase the performance of Chipmunk (on desktop) if possible. This is a short report so far:
I did all tests on Windows 11 in WSL (Ubuntu) on my Thinkpad X1 get 7 laptop with a i5-8265U CPU (~Skylake). To compare performance I used bench.c, but shortened it to 1000 steps.
I tried to reorder the structs:
- More or less half the time is spent in
cpArbiterApplyImpulse
if I run the demo bench.c throughperf record
. - Using
pahole
I could see that bothcpArbiter
struct and thecpBody
struct are not cache line aligned for how they are used in the apply impulse function. - As a quick and easy test if making it more cache friendly could help I reordered the struct fields to put all the fields used in
cpArbiterApplyImpulse
first in those two structs. - This resulted in a 4.5% time saving of total time on the benchmarks (averaged over 6 runs)!
I also tried to compile with march=skylake
. Not sure how I would use this in a real case with Pymunk, but worth testing at least. It saved another 5% of the remaining times for a total saving of 9% .
These 2 things were the easiest I could think of (after I researched a bit how easy SIMD for x86 would be) to try.
Some other things I thought about to put all the data needed closer in memory (if they help or not I do not know yet)
- Inline the cpContact struct into cpArbiter
- Separate arbiters with 1 and 2 contacts
- Read out body fields and put into arbiter on collision, and then use them instead of going to body in apply impulse function.
- Collect the resulting velocities in a separate array that is written back to bodies afterwards
- Reorder things in
cpArbiterApplyImpulse
I should note that I had 0 experience of optimizing C code before this. Actually I have almost 0 experience writing C code at all.
Any input is welcome!