Replies: 1 comment
-
Very cool! I agree with you, some kind of way to access homogenous C buffers from Pallene would be nice. LuaJIT does this as part of their FFI, maybe we should also put it in the (to be done) FFI for Pallene? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I just wanted to share the results of another little Pallene benchmark I did that is very positive.
For reference I was implementing the algorithm/equation found here, which computes something called a "Money Flow Index" at a specific point on a stock:
https://tulipindicators.org/mfi
In my version, I am calculating the MFI for all candle bars in a stock chart history, which means I put this inside a for-loop.
The slowest part of the formula is the summation. Essentially, this is summation of values in an array within a small sliding window (default window size is 14, but it is a variable parameter.)
So this is basically a smaller nested for-loop inside the main for-loop, so O(N^2).
So in a quick benchmark test I whipped up using os.clock():
Because in C-land, this seems like a terrific candidate for loop-unrolling and vectorization, I tried activating those flags. -O3 wasn't sufficient to change anything by itself, so my flags were:
So, about another 6% decrease.
I separated out the flags and found that the vectorizer did nothing, and it was the -funroll-loops that had the impact.
Also, FYI, these verbose flags show all the missed things the optimizer couldn't handle. Lot's of Pallene/Lua C functions are highlighted as "clobbering memory". Probably all expected.
Anyway, I thought this was overall very positive for Pallene, and I thought you would enjoy seeing it.
But this does make me wonder if Pallene in the future could detect cases like this where it might be a good candidate for SIMD/vectorization, and generate better code that could help the autovectorizer do its thing.
Also, on a semi-related tangent, my program has to load lots of stock history datasets which have become a bottleneck. I started playing with serialized binary formats, such as Google Protobufs. While my performance is better, I know that it must use the standard Lua C API to push each element into Lua. It got me wondering if Pallene could offer a more efficient C API to:
These are not nearly my most pressing concerns, but I thought I would stir the pot of ideas before moving on.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions