Skip to content

LPSPI optimizations 5/19/25 #90182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 18, 2025

Conversation

decsny
Copy link
Member

@decsny decsny commented May 19, 2025

Update to optimize the LPSPI configuration latency and code size, as well as fix a few bugs. This is the revival of #85010 plus a few other things.

Changelist:

  • Removes call to SDK functions. More info on that below
  • Add a wait time limit for the wait_tx_fifo_empty to avoid infinite loop in ISR
  • Add major version ID to driver data
  • Remove the check for already configured that was preventing register write due to clocking being done in SDK call, now we clocked the LPSPI in the zephyr driver init already, so this was no longer necessary. Also remove a redundant CR write.
  • Add clarifying comments to common file for future developers.
  • Removes a useless pointer from the driver data to save 4 bytes of RAM per instance.

Removing SDK calls in common file results:
The measurement of 3.7 in #85010 is obviously still valid, but most of the other stuff from that PR already merged, and the comparison from the theoretical 4.1 to this PR is very close, maybe only 1 or 2 hundred bytes more now due to many of the other PRs that have merged over the last few months. The only thing left was removing this LPSPI_MasterInit call in the configure function, so I will hone in on that, which is also very important to the latency of transfers as well.

All measurements taken on same platform (MCXN947) with same configuration (the one in the spi loopback test). The exact numbers will vary slightly based on platform and configuration (just as before, but should be better in all cases now)

Before this PR:
spi_mcux_configure: 900 bytes
fsl_lpspi.c: 1020 bytes
Total: **1920 bytes**
Configuration took 4942 clock cycles

After this PR:
lpspi_configure: 604 bytes
fsl_lpspi: 0 bytes (removed from build)
Total: **604 bytes**
Configuration took 2070 clock cycles

Code size improved overall by almost 2x for configure code from around 2000 to 1300 bytes. Or you could consider it to be almost 3x since the new code added is about 400 bytes in the zephyr driver while we remove the 1KB of the SDK doing the same. This is mostly due to SDK having generic code for taking parameters to configure LPSPI in many ways, and having shim code needed to call on this API. With native zephyr driver, we know already for the most part which configuration we need, so we can hardcode it. And the parts we don't know either come from DT (close to hardcoded, just need to read a struct member), or from the spi_config in the API (just needing to decode a bitmask from a struct member). So there removes a lot of the overhead to use a two layer driver.

Latency improved for configure (and the configure step is the biggest part of software to hardware transfer start latency in the driver) by also about 2x, from almost 5000 to close to 2500 clock cycles. This is mostly due to switching to use a binary search algorithm for the clock divs instead of a linear search. So this improvement is actually algorithmic going from (O(n)) to O(log(n)) depending on which div is optimal. I have suggested to the MCUX SDK team to also make a similar change on that codebase. But I will admit that this is the type of function which would benefit from a direct unit test, which I do not know if that is possible with zephyr. The algorithm is more complex.

Fixes #91015

@decsny decsny force-pushed the feature/lpspi_remove_sdk branch 2 times, most recently from ef4a9a4 to 253d691 Compare May 19, 2025 21:06
@decsny decsny marked this pull request as ready for review May 20, 2025 17:13
@github-actions github-actions bot added Release Notes To be mentioned in the release notes platform: NXP NXP platform: NXP Drivers NXP Semiconductors, drivers area: SPI SPI bus area: Clock Control labels May 20, 2025
tbursztyka
tbursztyka previously approved these changes May 20, 2025
Copy link
Contributor

@tbursztyka tbursztyka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@decsny decsny added the DNM This PR should not be merged (Do Not Merge) label May 21, 2025
@decsny
Copy link
Member Author

decsny commented May 21, 2025

@hakehuang can you help test this one

@hakehuang
Copy link
Contributor

hakehuang commented May 21, 2025

@hakehuang can you help test this one

@decsny , I meet issue with mimxrt1020_evk, which is OK in mainline code.

*** Booting Zephyr OS build v4.1.0-4229-g67b105059d84 ***
SPI test on buffers TX/RX 0x800040a0/0x80004080, frame size = 8, DMA enabled
Polling...Running TESTSUITE spi_extra_api_features
===================================================================
START - test_spi_lock_release

then timeout

@decsny
Copy link
Member Author

decsny commented Jun 13, 2025

As for your screenshot of the SCKPCS and PCSSCK, I'll see if I can find an RT1170 board and reproduce, that delay of 5 us should have been possible I would think. It looks like the algorithm should get a scaler of 239, so I'll see if it actually gets that or not on hardware.

@decsny
Copy link
Member Author

decsny commented Jun 13, 2025

@pdgendt the issue of the "missing clock pulse" is most likely I think a signal analyzer sampling rate issue, or maybe even a pinctrl issue, or both combined. I'm assuming that since you're using the RT1170 that the function clock frequency of the lpspi is actually extremely fast (probably something like 48 MHz i'm guessing is how you have it configured?) and the default delay is 2 cycles of that, so you would need a sampling rate of probably a few hundred MHz to see it cleanly assuming the pad can even reach the low threshold in that time

The target device is an nRF7002, which has a max clock frequency of 8 MHz, but with the logic analyzer I set it to 1 MHz and sample at 40MS/s, so I don't think the analyzer is the issue.

Also note my previous comment that it works if I use a high enough transfer-delay, see (8 MHz)

I'm not sure you're clear on what I'm saying. I did note that you found it "works" if you use a higher transfer-delay, which is the main reason that led me to say what I said. I'm saying the fact that the transfer delay is so short is making it look like it didn't happen due to the sampling resolution or could be not reaching low threshold due to pinctrl capability.

The fact that you configured the max frequency of the spi device to be 1 MHz is irrelevant to what I'm saying. But, I saw from your screenshot that the frequency of your SCK was 1 MHz and you said CCR was written as 0x30, so that's how I calculated that your source clock frequency to LPSPI is probably around 48 MHz. Like I said, it's this source clock frequency, not the SCK frequency, that determines how the delays are calculated. So if transfer-delay is set to the default of 0, the driver will configure the absolute minimum delay, which is 2 cycles of the source clock, i.e. 41 ns, so you need triple digits of MS/s to see that pulse, probably at least 100 MS/s according to nyquist theorem.

@decsny decsny force-pushed the feature/lpspi_remove_sdk branch from f24b628 to 528458d Compare June 13, 2025 18:07
@decsny
Copy link
Member Author

decsny commented Jun 13, 2025

@pdgendt thanks for your suggestion, here is the diff, please try when you get the chance.

@decsny decsny force-pushed the feature/lpspi_remove_sdk branch from 528458d to fa41b3a Compare June 13, 2025 18:31
@decsny
Copy link
Member Author

decsny commented Jun 13, 2025

What I just realized is that I have no idea why even do a looping algorithm at all, it's literally just a single unknown algebra equation, we can figure out the delays in O(1) time ??? updated PR

Copy link
Contributor

@pdgendt pdgendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @decsny! The changes do make a lot of sense!

Will test them soon and report the results.

@@ -126,62 +326,55 @@ int spi_mcux_configure(const struct device *dev, const struct spi_config *spi_cf
return ret;
}

/* For the purpose of configuring the LPSPI, 8 is the minimum frame size for the hardware */
word_size = word_size < 8 ? 8 : word_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not changed yet?

@decsny decsny force-pushed the feature/lpspi_remove_sdk branch from fa41b3a to 9b7bc0b Compare June 13, 2025 21:20
@decsny
Copy link
Member Author

decsny commented Jun 13, 2025

@pdgendt I pushed a minor change for you to review based on the comments above, but at this time of the day in the week, I didn't test on hardware yet, I'll do it on monday

pdgendt
pdgendt previously approved these changes Jun 16, 2025
decsny added 6 commits June 16, 2025 15:23
For optimization purpose, remove calls to SDK. Since we know exactly
what we want, this results in smaller code size.

Also, this code calculates the SCK parameters more efficiently than the
SDK driver did it by using a binary (instead of linear) search.

Lastly, remove call to LPSPI_Reset in the init call and replace with
native driver code, and remove inclusion of SDK header.

Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
Since the LPSPI drivers no long use MCUX at all, remove the MCUX
branding, to avoid confusion. In the future if an implementation uses
the MCUX SDK driver, it should specifically be called by MCUX in the
name.

Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
There is no need to update the tx context in interrupt instead of
directly after the fill, this just makes the code more complex. Also,
the spi context header already handled iterating over buffers so we can
remove that code too.

Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
Fixing some issues with the TX fifo fill logic in two places:

First in the normal fill function, it didn't take into account a
situation where the TX fifo is already partially filled. This currently
doesn't cause a problem because the driver is written in a way that the
watermark is always 0 for TDF, but in case the watermark were anything
else it would cause a problem.

Second, when filling the TX fifo with NOPS in order to clock the rest of
the RX in from the bus, the calculation regarding the current TX fifo
length was just wrong and was leading to a bug in some cases where there
was a subtraction underflow and billions of NOPs were being filled.
Also, there could be a problem where a few extra NOPs are put in the TX
fifo if we don't count what we already have in the TX fifo, so fixing
that as well.

Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
The buf_len parameter of lpspi_fill_tx_fifo is supposed to be bytes, so
we do not need to convert it. This could cause an issue if the end of
the buffer is less bytes than the word size.

Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
This stupid errata will not leave me alone, here is another bandaid to
deal with an issue where an extra byte was being sent on version 1
LPSPIs due to the algorithm of filling NOPs when only RX is left was not
expecting the situation where the LPSPI actually consumed everything
from the fifo but is not sending it due to this ridiculous stalling errata.

Signed-off-by: Declan Snyder <declan.snyder@nxp.com>
@decsny
Copy link
Member Author

decsny commented Jun 16, 2025

just realized the sckdiv algorithm was completely unnecessary as well...

@decsny decsny linked an issue Jun 16, 2025 that may be closed by this pull request
@pdgendt
Copy link
Contributor

pdgendt commented Jun 16, 2025

just realized the sckdiv algorithm was completely unnecessary as well...

Great, will try it out tomorrow

Copy link

@decsny decsny added this to the v4.2.0 milestone Jun 16, 2025
Copy link
Contributor

@EmilioCBen EmilioCBen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work!

if (arbitrary_cycle_limit-- < 0) {
LOG_WRN("Failed waiting for TX fifo empty");
return -EIO;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use k_busy_wait() here for maybe 1usec per iteration and use the Kconfig to determine total the wait time in usec. This way the behavior will not vary by the processor speed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's supposed to vary by processor speed. It's just a number of attempts to try before quitting. It's simply a basic safety mechanism, which is just about being more safe than sorry, not a unique feature. It shouldn't even be needed. Most likely this consumption by the fifo happens immediately, and this loop isn't necessary. But people wanted to expand the scope of this PR, so here it is. I'm not overcomplicating it any more by making it time -based based on CPU frequency calculations.

Also the specific suggestion to use k_busy_wait is not a good idea. That function is not isr safe. This function is called in isr context. k_busy_wait is implemented by polling kernel time, which can't increment due to this function would be blocking in isr context, causing a likely infinite loop in the kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is polling a register waiting for some truth expression with a timeout, WAIT_FOR is the appropriate tool. It uses cycles not ticks while accepting us timeouts. If you see any bugs in this do send fixes.

https://docs.zephyrproject.org/apidoc/latest/group__sys-util.html#ga68eb35df6b4715714312df90209accee

@nashif nashif merged commit 84d2f7f into zephyrproject-rtos:main Jun 18, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Clock Control area: SPI SPI bus platform: NXP Drivers NXP Semiconductors, drivers platform: NXP NXP Release Notes To be mentioned in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nxp: lpspi: regression accessing spi flash tests: spi: mimxrt1180_evk_cm33: test fail with spi_async case