SIMD StoreAlignedNonTemporal vs StoreAligned help/question #95386

dbriard · 2023-11-29T08:21:39Z

dbriard
Nov 29, 2023

Hi,
I am not familiar with NonTemporal SIMD functions, but according to my benchmarks, that give me a big performance boost.

However I read on the Internet:

"Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion."

So I would like to know if there is something to take care when using StoreAlignedNonTemporal. Do I have to call an additional instruction at the end of my loops? SFENCE? Where is this instruction in c#?

My use cases are for image processing function.

For example, to convert a 26 megapixels image from RGBA32 (byte) to RGBA128(float), it take 30 ms with StoreAligned, and only 13 ms using StoreAlignedNonTemporal... so that is really interesting, but I am not sure about side effects...

For example can I chain several functions that use NonTemporal store/load ?
Can I read any pixel in my image after the last NonTemporal store in the loop?

Thanks for helping!

Answered by gfoidl

Nov 30, 2023

be read/written by the same CPU Core, using the Non-Temporal version should theoretically have no side effects, and the StoreFence might not be necessary, correct?

Imagine empty caches and you load the data as usual (i.e. temporal load). Then it goes RAM -> L3 -> L2 -> L1 -> cpu.
The you perform a non-temporal store (cpu -> RAM), and after that the same cpu needs the data again, so it will try to read from L1, and if it's there uses it w/o noticing that the data changed in the meantime.

So for "correct?" it's not in every case, especially if the same cpu reads the data again.

Non-temporal stores can be used when data is read once, then written back to memory bypassing the caches in a wo…

View full answer

gfoidl · 2023-11-29T10:43:49Z

gfoidl
Nov 29, 2023

if there is something to take care when using StoreAlignedNonTemporal

It depends on when you need the data again.
When you need it soon (whatever that may mean in cpu-terms) you'll need to re-load it from RAM -> caches (L3, L2, L1) -> cpu.
Then it's about visibility. Say another processor of the cpu has fetched the data also, then it is not notified about a change in that data due the non-temporal store (which for temporal stores protocols like MOESI take care off). Therefore that processor may operate on stale data. To prevent that you have to issue the fence instructions in order to tell the processors (put simple) when to sync the data.

There's some good write-up in (so I don't have to write similar my own):

SFENCE? Where is this instruction in c#?

Sse.StoreFence

Personally I'd use the regular (temporal) stores, except benchmarking shows that non-temporal stores are beneficial. But then one needs good real-world testcase and acceptable load to cover all potential troubles (that will most likely result from the different visibility).

5 replies

dbriard Nov 29, 2023
Author

Many thanks for your good explanations @gfoidl

If my understanding is correct, if I'm absolutely certain that the data will only be read/written by the same CPU Core, using the Non-Temporal version should theoretically have no side effects, and the StoreFence might not be necessary, correct?

I do loop (e.g. convert pixel format) -> loop (e.g. exposure adjust) -> loop (e.g. sharpen) ones after the others in the same thread. Each thread work on a different tile of the source image, and so use different intermediate buffers.

My goal is to substitute C++ Intel IPP, designed for x86/x64, with C# SIMD to achieve cross-platform functionality (including targeting the Mac M processors). I can't check the Intel IPP code, but in benchmarking, I noticed that a basic function converting an image from byte to float was slower in my C# implementation compared to the Intel IPP counterpart (30 ms vs. 15 ms), even compared to a dummy C# function that just store zeros in the destination image using Vector128.StoreAligned. Eventually, I tried StoreAlignedNonTemporal, and it matched or even outperformed the Intel IPP function (12 ms vs. 15 ms).

So I suspect that Intel IPP use Non-Temporal store (I'm utilizing their dll in single-threaded mode).

gfoidl Nov 30, 2023

be read/written by the same CPU Core, using the Non-Temporal version should theoretically have no side effects, and the StoreFence might not be necessary, correct?

Imagine empty caches and you load the data as usual (i.e. temporal load). Then it goes RAM -> L3 -> L2 -> L1 -> cpu.
The you perform a non-temporal store (cpu -> RAM), and after that the same cpu needs the data again, so it will try to read from L1, and if it's there uses it w/o noticing that the data changed in the meantime.

So for "correct?" it's not in every case, especially if the same cpu reads the data again.

Non-temporal stores can be used when data is read once, then written back to memory bypassing the caches in a workload where updating the caches is pure overhead when the data isn't read back (by any cpu, including the one that does the write) -- as there's no need to keep the (updated) data in caches.

So I suspect that Intel IPP use Non-Temporal store

Yep, that seems reasonable. That's a workload where non-temporal stores fit.
Read -> process -> write back (w/o the need to update the caches).
But for instance when you want to store the image to disk you need fencing to guarantee the freshness of the data. Otherwise there might be used stale data from a cache (but this is hard to say w/o knowing the code, types, etc. involved).

Answer selected by dbriard

dbriard Nov 30, 2023
Author

Thank you @gfoidl
Hum, it look like that can be risky to use Non-Temporal store... as I am not an expert of caches issues, and as I will of course read the data again on the same CPU.

My typical workflow is:
Load (buffer1) -> process 1 -> Store (buffer2 or buffer1 if inplace) -> Load (result of previous process) -> process 2 -> Store (in same or another buffer) -> Load (previous buffer)... many times in sequence.
generally working on buffers of 512x512x16 bytes (about 4MB each) but sometimes on several buffers at the same times.

That's a very interresting topic that I will need to understand more one day of the other!

tannergooding Nov 30, 2023
Collaborator

👍 to what @gfoidl said. One additional thing I'd note, however, is that non-temporal operations are generally meant to be used when you're processing large amounts of data and where that data is unlikely to be needed in the near future.

The optimization manuals specifically document "large" as something that would fill more than 50% of the largest cache (typically the L3). So if you have 1MB L3 per core, you want "large" to be roughly 512KB. A lot of CRT functions mark this cutoff as around 256KB so they don't need to query the size at runtime and to best support more common hardware. New machines, like the AMD Zen processors, can often have much larger caches (like 64MB total, or roughly 2-4MB per core/hardware thread)

I then wouldn't necessarily quantify data that's likely to be accessed shortly on another core as "non-temporal". That (and almost any case you might consider needing sfence) is likely a case where you want to use regular loads/stores

dbriard Dec 1, 2023
Author

Thank you for your additional information!

SIMD StoreAlignedNonTemporal vs StoreAligned help/question #95386

Uh oh!

Uh oh!

dbriard Nov 29, 2023

Replies: 1 comment · 5 replies

Uh oh!

gfoidl Nov 29, 2023

Uh oh!

dbriard Nov 29, 2023 Author

Uh oh!

gfoidl Nov 30, 2023

Uh oh!

Uh oh!

dbriard Nov 30, 2023 Author

Uh oh!

Uh oh!

tannergooding Nov 30, 2023 Collaborator

Uh oh!

dbriard Dec 1, 2023 Author

dbriard
Nov 29, 2023

Replies: 1 comment 5 replies

gfoidl
Nov 29, 2023

dbriard Nov 29, 2023
Author

dbriard Nov 30, 2023
Author

tannergooding Nov 30, 2023
Collaborator

dbriard Dec 1, 2023
Author