-
Notifications
You must be signed in to change notification settings - Fork 7.6k
LPSPI optimizations 5/19/25 #90182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LPSPI optimizations 5/19/25 #90182
Conversation
ef4a9a4
to
253d691
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@hakehuang can you help test this one |
@decsny , I meet issue with mimxrt1020_evk, which is OK in mainline code.
then timeout |
As for your screenshot of the SCKPCS and PCSSCK, I'll see if I can find an RT1170 board and reproduce, that delay of 5 us should have been possible I would think. It looks like the algorithm should get a scaler of 239, so I'll see if it actually gets that or not on hardware. |
I'm not sure you're clear on what I'm saying. I did note that you found it "works" if you use a higher The fact that you configured the max frequency of the spi device to be 1 MHz is irrelevant to what I'm saying. But, I saw from your screenshot that the frequency of your SCK was 1 MHz and you said CCR was written as 0x30, so that's how I calculated that your source clock frequency to LPSPI is probably around 48 MHz. Like I said, it's this source clock frequency, not the SCK frequency, that determines how the delays are calculated. So if transfer-delay is set to the default of 0, the driver will configure the absolute minimum delay, which is 2 cycles of the source clock, i.e. 41 ns, so you need triple digits of MS/s to see that pulse, probably at least 100 MS/s according to nyquist theorem. |
f24b628
to
528458d
Compare
@pdgendt thanks for your suggestion, here is the diff, please try when you get the chance. |
528458d
to
fa41b3a
Compare
What I just realized is that I have no idea why even do a looping algorithm at all, it's literally just a single unknown algebra equation, we can figure out the delays in O(1) time ??? updated PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @decsny! The changes do make a lot of sense!
Will test them soon and report the results.
@@ -126,62 +326,55 @@ int spi_mcux_configure(const struct device *dev, const struct spi_config *spi_cf | |||
return ret; | |||
} | |||
|
|||
/* For the purpose of configuring the LPSPI, 8 is the minimum frame size for the hardware */ | |||
word_size = word_size < 8 ? 8 : word_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not changed yet?
fa41b3a
to
9b7bc0b
Compare
@pdgendt I pushed a minor change for you to review based on the comments above, but at this time of the day in the week, I didn't test on hardware yet, I'll do it on monday |
For optimization purpose, remove calls to SDK. Since we know exactly what we want, this results in smaller code size. Also, this code calculates the SCK parameters more efficiently than the SDK driver did it by using a binary (instead of linear) search. Lastly, remove call to LPSPI_Reset in the init call and replace with native driver code, and remove inclusion of SDK header. Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
Since the LPSPI drivers no long use MCUX at all, remove the MCUX branding, to avoid confusion. In the future if an implementation uses the MCUX SDK driver, it should specifically be called by MCUX in the name. Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
There is no need to update the tx context in interrupt instead of directly after the fill, this just makes the code more complex. Also, the spi context header already handled iterating over buffers so we can remove that code too. Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
Fixing some issues with the TX fifo fill logic in two places: First in the normal fill function, it didn't take into account a situation where the TX fifo is already partially filled. This currently doesn't cause a problem because the driver is written in a way that the watermark is always 0 for TDF, but in case the watermark were anything else it would cause a problem. Second, when filling the TX fifo with NOPS in order to clock the rest of the RX in from the bus, the calculation regarding the current TX fifo length was just wrong and was leading to a bug in some cases where there was a subtraction underflow and billions of NOPs were being filled. Also, there could be a problem where a few extra NOPs are put in the TX fifo if we don't count what we already have in the TX fifo, so fixing that as well. Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
The buf_len parameter of lpspi_fill_tx_fifo is supposed to be bytes, so we do not need to convert it. This could cause an issue if the end of the buffer is less bytes than the word size. Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
This stupid errata will not leave me alone, here is another bandaid to deal with an issue where an extra byte was being sent on version 1 LPSPIs due to the algorithm of filling NOPs when only RX is left was not expecting the situation where the LPSPI actually consumed everything from the fifo but is not sending it due to this ridiculous stalling errata. Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
9b7bc0b
to
ccd7222
Compare
just realized the sckdiv algorithm was completely unnecessary as well... |
Great, will try it out tomorrow |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work!
if (arbitrary_cycle_limit-- < 0) { | ||
LOG_WRN("Failed waiting for TX fifo empty"); | ||
return -EIO; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use k_busy_wait()
here for maybe 1usec per iteration and use the Kconfig to determine total the wait time in usec. This way the behavior will not vary by the processor speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's supposed to vary by processor speed. It's just a number of attempts to try before quitting. It's simply a basic safety mechanism, which is just about being more safe than sorry, not a unique feature. It shouldn't even be needed. Most likely this consumption by the fifo happens immediately, and this loop isn't necessary. But people wanted to expand the scope of this PR, so here it is. I'm not overcomplicating it any more by making it time -based based on CPU frequency calculations.
Also the specific suggestion to use k_busy_wait is not a good idea. That function is not isr safe. This function is called in isr context. k_busy_wait is implemented by polling kernel time, which can't increment due to this function would be blocking in isr context, causing a likely infinite loop in the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is polling a register waiting for some truth expression with a timeout, WAIT_FOR is the appropriate tool. It uses cycles not ticks while accepting us timeouts. If you see any bugs in this do send fixes.
https://docs.zephyrproject.org/apidoc/latest/group__sys-util.html#ga68eb35df6b4715714312df90209accee
Update to optimize the LPSPI configuration latency and code size, as well as fix a few bugs. This is the revival of #85010 plus a few other things.
Changelist:
Removing SDK calls in common file results:
The measurement of 3.7 in #85010 is obviously still valid, but most of the other stuff from that PR already merged, and the comparison from the theoretical 4.1 to this PR is very close, maybe only 1 or 2 hundred bytes more now due to many of the other PRs that have merged over the last few months. The only thing left was removing this LPSPI_MasterInit call in the configure function, so I will hone in on that, which is also very important to the latency of transfers as well.
All measurements taken on same platform (MCXN947) with same configuration (the one in the spi loopback test). The exact numbers will vary slightly based on platform and configuration (just as before, but should be better in all cases now)
Code size improved overall by almost 2x for configure code from around 2000 to 1300 bytes. Or you could consider it to be almost 3x since the new code added is about 400 bytes in the zephyr driver while we remove the 1KB of the SDK doing the same. This is mostly due to SDK having generic code for taking parameters to configure LPSPI in many ways, and having shim code needed to call on this API. With native zephyr driver, we know already for the most part which configuration we need, so we can hardcode it. And the parts we don't know either come from DT (close to hardcoded, just need to read a struct member), or from the spi_config in the API (just needing to decode a bitmask from a struct member). So there removes a lot of the overhead to use a two layer driver.
Latency improved for configure (and the configure step is the biggest part of software to hardware transfer start latency in the driver) by also about 2x, from almost 5000 to close to 2500 clock cycles. This is mostly due to switching to use a binary search algorithm for the clock divs instead of a linear search. So this improvement is actually algorithmic going from (O(n)) to O(log(n)) depending on which div is optimal. I have suggested to the MCUX SDK team to also make a similar change on that codebase. But I will admit that this is the type of function which would benefit from a direct unit test, which I do not know if that is possible with zephyr. The algorithm is more complex.
Fixes #91015