Skip to content

Perf improvements #237

@viblo

Description

@viblo

This is not really a issue, more of a FYI and question if someone else has looked into these things that maybe could fit on the forum, but I feel its more visible to put it here..

Anyway, just a couple of days ago I added a batch api to Pymunk (Python 2d physic library built on Chipmunk) to get some data quicker, since its quite expensive to call C code from Python. In my simple test case I was mainly bottle-necked by Chipmunk performance and not Python as is the usual case. So, I started to look into how I could increase the performance of Chipmunk (on desktop) if possible. This is a short report so far:

I did all tests on Windows 11 in WSL (Ubuntu) on my Thinkpad X1 get 7 laptop with a i5-8265U CPU (~Skylake). To compare performance I used bench.c, but shortened it to 1000 steps.

I tried to reorder the structs:

  1. More or less half the time is spent in cpArbiterApplyImpulse if I run the demo bench.c through perf record.
  2. Using pahole I could see that both cpArbiter struct and the cpBody struct are not cache line aligned for how they are used in the apply impulse function.
  3. As a quick and easy test if making it more cache friendly could help I reordered the struct fields to put all the fields used in cpArbiterApplyImpulse first in those two structs.
  4. This resulted in a 4.5% time saving of total time on the benchmarks (averaged over 6 runs)!

I also tried to compile with march=skylake. Not sure how I would use this in a real case with Pymunk, but worth testing at least. It saved another 5% of the remaining times for a total saving of 9% .

These 2 things were the easiest I could think of (after I researched a bit how easy SIMD for x86 would be) to try.

Some other things I thought about to put all the data needed closer in memory (if they help or not I do not know yet)

  1. Inline the cpContact struct into cpArbiter
  2. Separate arbiters with 1 and 2 contacts
  3. Read out body fields and put into arbiter on collision, and then use them instead of going to body in apply impulse function.
  4. Collect the resulting velocities in a separate array that is written back to bodies afterwards
  5. Reorder things in cpArbiterApplyImpulse

I should note that I had 0 experience of optimizing C code before this. Actually I have almost 0 experience writing C code at all.

Any input is welcome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions