[FAQ] Why are openEMS simulations slow, and how to make them faster #29

biergaizi · 2023-02-14T06:18:51Z

biergaizi
Feb 14, 2023

The question, "Why are openEMS simulations slow, and how to make them faster?" is often the first question that many newcomers may ask - I, too, asked this question to myself, and found the answer independently. Thus, I wrote this FAQ to answer it so it doesn't have to be repeated.

Understanding the Memory-Bound FDTD Kernel

At its heart, FDTD is a surprisingly simple algorithm. Its core is just 3 nested "for" loops that iterate all cells in 3D space. To obtain the electric field at each cell, it takes the magnetic field of the current cell, the magnetic field of the adjacent cell on the X axis, the Y axis, and the Z axis, take their differences, multiply them by some constants, and sums them up. Once the loop is finished, the same process is repeated to obtain the magnetic field from the electric field.

It's a magnificent fact that 20 lines of code is enough to model all classical electromagnetism in the universe - but not surprising since it's literally what Maxwell's equations say - a changing electric field creates a magnetic field, and a changing magnetic field in turn creates an electric field.

The following code is a highly simplified version of a 3D FDTD kernel.

for (i = 1; i < nx; i++) {
    for (j = 1; j <ny; j++) {
        for (k = 1; k < nz; k++) {
            m=id[i][j][k];
            Ex[i][j][k] = Ce[m] * Ex[i][j][k] 
                      + Cery[m] * (Hz[i][j][k] - Hz[i][j-1][k])
                      + Cerz[m] * (Hy[i][j][k] - Hy[i][j][k-1]);
            Ey[i][j][k] = Ce[m] * Ey[i][j][k] 
                      + Cerz[m] * (Hx[i][j][k] - Hx[i][j][k-1])
                      + Cerx[m] * (Hz[i][j][k] - Hz[i-1][j][k]);
            Ez[i][j][k] = Ce[m] * Ez[i][j][k] 
                      + Cerx[m] * (Hy[i][j][k] - Hy[i-1][j][k])
                      + Cery[m] * (Hx[i][j][k] - Hx[i][j-1][k]);
        }
    }
}

for (i = 1; i < nx; i++) {
    for (j = 1; j < ny; j++) {
        for (k = 1; k < nz; k++) {
            m=id[i][j][k];
            Hx[i][j][k] = Hx[i][j][k] 
                     + Chry[m] * (Ez[i][j+1][k] - Ez[i][j][k])
                     + Chrz[m] * (Ey[i][j][k+1] - Ey[i][j][k]);
            Hy[i][j][k] = Hy[i][j][k] 
                     + Chrz[m] * (Ex[i][j][k+1] - Ex[i][j][k])
                     + Chrx[m] * (Ez[i+1][j][k] - Ez[i][j][k]);
            Hz[i][j][k] = Hz[i][j][k] 
                     + Chrx[m] * (Ey[i+1][j][k] - Ey[i][j][k])
                     + Chry[m] * (Ex[i][j+1][k] - Ex[i][j][k]);
        }
    }
}

It shows everything about the inherent slow performance of FDTD simulations.

The kernel has an extremely low arithmetic intensity, that is, there's very little for the CPU to do - only 4 additions and 3 multiplications on the X, Y, Z polarizations of the electric and magnetic fields at a cell, and that is all. The rest of the work is simply reading and writing values from and to memory, over and over again - millions of times in each time step. Most of the time, the CPU core is idle (even when task manager shows "100%" usage).

To make it worse, this memory access pattern has a moderate level of memory spatial locality but nonexistent memory temporal locality.

Despite the name "random" in Random Access Memory, accessing a random location in memory has significant latency, since a new DRAM row and column must be selected. To minimize latency and maximize throughput, a linear walk in memory is much preferred, this is known as a stride-1 memory access pattern. For this kind of accesses, the DRAM can transmit sequential data in a burst, the CPU would also notice this pattern and prefetch the next addresses to its cache while the current computation is still being processed. This property is known as memory spatial locality.

Similarly, once data is loaded from DRAM to the CPU, it's temporarily stored inside CPU cache with the assumption that it would be used later, this is known as memory temporal locality. Accessing L1 cache is 100x faster than accessing DRAM. Thus, if and only if the algorithm repeatedly works on the same data values, the CPU can operate at full speed. Otherwise, it's not possible to reach the peak FLOPS performance of the CPU as measured by LINPACK, and performance suffers.

And speaking of LINPACK... LINPACK's main author Jack Dongarra is well aware of this problem:

“When we look at performance today on our machines, the data movement is the thing that’s the killer,” Dongarra explained. “We’re looking at the floating point execution rate divided by the data movement rate, and we’re looking at different processors. In the old days, we had processors that basically had a match of one flops per one data movement – that’s how they were balanced. And if you guys remember the old Cray-1s, you could do two floating point operations and three data movements all simultaneously. So this is trying to get a get a handle on that. But over time, the processors have changed the balance. What has happened over the course of the next twenty years, from the beginning here is that an order of magnitude was lost. That is, we can now do ten floating point operations for every data movement that we make. And more recently, we’ve seen that number grow to 100 floating point operations for every data movement. And even some machines today are in the 200 range. That says there’s a tremendous imbalance between the floating point and data movement. So we have tremendous floating point capability – we are overprovision for floating point – but we don’t have the mechanism for moving data very effectively around in our system.”

How does FDTD perform in terms of memory spatial and temporal locality?

Spatial locality: Not ideal, but okay. A FDTD simulation is not a linear walk, but large stride-N memory accesses - to access the adjacent cell on the Y axis, the entire Z dimension must be skipped, and to access the adjacent cell on the X axis, the entire Y x Z dimensions must be skipped. Throughout simulation, these large jumps can slow down memory accesses. This is not ideal, but fortunately, the overhead is not too high - it occurs only at the beginning of each new X and Y dimension, all later accesses are stride 1. (In openEMS, some extensions may increase the overhead as some may iterate the cells along the X and Y directions).

Temporal locality: Poor. When the electric field values of all cells in the simulation domain are calculated, the FDTD kernel immediately repeat the loop again to calculate the magnetic field values at the beginning of the simulation domain - at this point there's nothing useful left in cache anymore, it only contains data at the end of the simulation domain, but now we're processing the beginning of the simulation domain.

As a result, FDTD simulations are inherently bottlenecked by memory bandwidth.

To quote Fabian “ryg” Giesen's words (the original context was a naive implementation of FFT):

This is not really a compute kernel so much as a memory streaming exercise with a part-time gig in arithmetic.

In scientific computing, this kind of code is known as Iterative Stencil Loops and has been the subject of studies for decades. As far as I know, the FDTD is one of the more difficult problems due to its 3D nature.

Unworkable Solutions

Due to the nature of FDTD algorithms, many conventional software and hardware improvements have little to no effect. For CPU-bound applications, assembly code optimization is often the solution - it would be possible to unroll the loop, hand-tune the assembly, use longer SIMD vectors, etc,. to increase its performance. This is common in video encoding, file compression and cryptography. Unfortunately, for the FDTD kernel, the bottleneck is caused by the stalled CPU pipeline due to data starvation, not due to slow CPU computation, and it only has a small effect.

Increasing the clock frequency of the CPU or upgrading to a faster CPU core is also not productive due to the same reason.

The current trend in the industry is ever-increasing CPU core count. A single high-end server can have 100 CPU cores these days. Unfortunately, conventional FDTD also scales poorly with modern multicore CPUs and cannot take advantage of these development. Using multiple threads creates a speedup but beyond a couple of threads, the entire CPU's memory bandwidth is saturated. At this point, running more threads can only slow it down.

Do and Don't

Special notice to FreeBSD and macOS users - please compile openEMS with GCC, not LLVM/clang. I have identified a to-be-investigated technical problem that makes clang to generate inefficient object code for the SSE/SIMD engine, reducing the simulation to a fraction of the nominal speed!
Use the latest memory generation. The memory bandwidth of DDR memory increases in each generation. Going from DDR3 to DDR4, from DDR4 to DDR5 should produce a significant speedup due to increased bandwidth. This means picking an up-to-date CPU, not for the CPU core performance, but for the memory support.
Use dual-channel, quad-channel, and octa-channel memory. Don't run a desktop with single-channel memory, this causes a significant slowdown due to halved memory bandwidth. Upgrading from single-channel to dual-channel memory can create a 150% to 200% speedup in openEMS simulations. On workstation and server hardware, multi-channel memory configurations are available - use them.

Here's an example with DDR3, but similar results apply to later generations as well.

Use multiple CPU sockets. A dual-socket server has the potential to make openEMS's multi-threading scale further. However, currently the code is not NUMA-aware (I plan to improve that in the future), so the scaling may be limited. In this case, it's better to run two openEMS simulations simultaneously, each pinned to a CPU socket. Make sure to test your simulations before ordering it - multiple small single-socket servers is often more reasonable than a big dual-socket server. For tasks like parametric sweep, each server can execute independent jobs. For extremely large simulations, creating an MPI cluster is an option.
Disable "powersave" CPU governors and policies. Experiments have found that a "powersave" CPU frequency governor or energy policy can cause slowdowns in some simulations.
Don't run openEMS simulations with the maximum number of threads, similarly, don't run multiple openEMS simulations simultaneously without reducing the number of threads. A single CPU core often cannot saturate memory bandwidth, thus, using multi-threading increases simulation speed, but only up to a point. Beyond this point, increases the thread number further can only slow it down (having higher memory bandwidth or multiple CPU sockets may allow you to push it further). Blindly set the number of threads to the same number of CPU cores (or threads) can create a significant slowdown. Because the memory access pattern in each simulation is different, depending on the extensions used, finding the optimal number of threads require some experimentation. openEMS now automatically detects the fastest number of threads, but the result is occasionally wrong.
Use a dedicated and fast machine for simulation. For some workloads, using the desktop at the same time when they're running only has a moderate performance hit, as long as you're not using all the CPU cores. But for openEMS, during my development, I found other applications can significant influence the simulation speed. I assume it's the result of L3 cache pollution. For the best result, use a dedicated machine, perhaps a headless server. Don't browse the web, use CAD, or run games on the same machine while waiting for the simulation results.
Use the minimum number of cells in a simulation. For the fastest simulation speed, don't create a mesh finer than what's strictly necessary, careful meshing can produce more accurate results with fewer cells. For example, by correctly applying the 1/3-2/3 rule, a microstrip transmission line can be accurately simulated without a very fine mesh.
If possible, try using Mur ABC instead of PML ABC as boundary condition. PML ABC works much better than Mur ABC and it's much easier to use, but it comes with a higher computational overhead. One must make a tradeoff between ease of use, accuracy, and simulation speed. Sometimes, Mur ABC is acceptable.
Disable field dump when it's not necessary. Field dump can create a significant overhead. If the simulation has been fully debugged, field dumps can be disabled in later runs to speed them up. Unfortunately it's not possible for antenna simulations as the field dumps are required.

Potentially Workable Solutions

Better Memory Access Pattern

Although tuning the assembly code is unlikely to produce a significant speedup, but tuning the memory access pattern may help.

For example, the current multi-dimensional array has a suboptimal memory layout, improving it may create a 10%-30% speedup (I've seen 70% in an extreme case, though not relevant to practical applications), work is currently ongoing at thliebig/openEMS#100.

Another change worth considering is the engine architecture. The current engine implementation creates more memory overhead than strictly necessary. To run a simulation, the flowchart is basically: run most points in 3D space through the plugins to do pre-update, run all points in 3D space again through the main FDTD kernel, and finally run most points in 3D space again through the plugin to do post-update. The extra redundant loads and stores create significant memory bandwidth overhead...

Instead of going through the entire 3D space at a time, perhaps the update can be splitted into multiple blocks. Instead of running pre-update, main-update and post-update across the entire 3D space at once, it can be done in smaller chunks to promote cache reuse. Another potential solution is "fused" engine extension, allowing a plugin to insert itself dynamically into main field update loop of the main kernel to "fuse" the computation together, rather than using its own redundant loop. This way a few memory accesses can be saved. Both are being discussed at thliebig/openEMS#100.

Long-term Solutions

Some solutions have been discussed in the literature and they may produce a significant speedup, but the implementation is difficult. They may worth exploring and brainstorming, but don't expect to see them any time soon.

High-Order FDTD

The most "straightforward" solution is simply accepting defeat that FDTD's inherent memory-bound problem is unsolvable, and instead we should exploit it to do more work on the CPU by switching to high-order update equations. It would make computation even slower, but at least the CPU is being used for productive work to compute more accurate results. And for some kinds of problems, it may even allow one to reduce the number of cells, achieving a net speedup.

Multi-timestep & Time Skewing Techniques

In conventional FDTD, the memory bottleneck is created mainly due to the cache-unfriendly access pattern. However, modified FDTD algorithms have been invented to overcome this problem. Theoretically, one possible solution is time skewing, also known as time-space tiling. The main idea is that, instead of running the entire 3D space one time step at a time for all points, it can be time-stepped asynchronously. Once a single cell is time-stepped, one can immediately time-step the surrounding cells without throwing the register and cache values away. So at a simulation cycle, each point in space will be at different time steps, forming a diamond-shaped region. The possible speedup is reportedly 10x or even higher.

See:

Fukaya, T., & Iwashita, T. (2018). Time-space tiling with tile-level parallelism for the 3D FDTD method. Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region - HPC Asia 2018. doi:10.1145/3149457.3149478
Andrey Zakirov, et al., Using memory-efficient algorithm for large-scale time-domain modeling of surface plasmon polaritons propagation in organic light emitting diodes.
Andrey Zakirov, et al., High performance FDTD code implementation for GPGPU supercomputers
Takayuki Muranushi1 and Junichiro Makino, Optimal Temporal Blocking for Stencil Computation, RIKEN AICS

Andrey Zakirov, et al's work, the DTmaxwell4 engine, is available on GitHub: https://github.com/zakirovandrey/DTmaxwell4

Moreover, the EMPIRE solver used a multi-timestep algorithm and it was marketed as the "XPU" technology, it's likely also a similar idea.

Simon, W., Baggen, R., & Lauer, A. (2013). EMPIRE XCcel: Efficient simulation of RF MEMS and antennas with XPU FDTD technology. 2013 IEEE Antennas and Propagation Society International Symposium (APSURSI). doi:10.1109/aps.2013.6711822

However, openEMS, like other conventional FDTD engines, depends on synchronous time stepping. It can't be implemented without completely reinventing the algorithm and engine. So this project is a long-term project at best and requires huge effort to implement, the implementation of this engine can be someone's PhD project...

GPU Computing

In the 2010s, a customer-grade CPU had a memory bandwidth of ~10 GiB/s, while GPU had ~100 GiB/s. Today's GPUs are even faster, so perhaps it's possible to side-step the entire memory access bottleneck simply by using the GPU. Though, communication and memory access pattern remain challenges. In CPU programming we worry a lot about memory access patterns and cacheline alignment, while in GPU there's memory coalescing, which is more tricky to get right. VRAM size limits the maximum possible simulation domain, the communication overhead can be a dealbreaker in this case (though, consider that modern PC games are using 8 to 12 GiB of VRAM, this is no longer a problem. The future of GPU computing is bright).

According to gprMax's result, a 10x speedup is possible on a common customer-grade CPU like a Nvidia GeForce GTX 1080 Ti. When a HPC-class GPU like the Nvidia Tesla V100 ($15000) is used, an even higher speedup is possible.

See:

Craig Warren, et al., A CUDA-based GPU engine for gprMax: Open source FDTD electromagnetic simulation software, Computer Physics Communications
Lei Xu, Implementation and Optimization of Three-dimension UPML-FDTD Algorithm on GPU Platform, Shanghai Supercomputer Center
I Pershin, V Levchenko, A Perepelkina, GPU Implementation of a Stencil Code with More Than 90% of the Peak Theoretical Performance, Keldysh Institute of Applied Mathematics RAS, Moscow

Again, plausible but not going to happen in the short term...

thliebig · 2023-02-15T17:23:25Z

thliebig
Feb 15, 2023
Maintainer

Thank you for that very long and very detailed answer to this indeed frequently asked question.
You literally wrote what I always wanted to write myself but have not found the time yet to do.
Thanks!

0 replies

ringo458 · 2023-02-16T13:14:28Z

ringo458
Feb 16, 2023

Awesome! -a new user

0 replies

thliebig · 2023-02-16T17:29:56Z

thliebig
Feb 16, 2023
Maintainer

@biergaizi if you use data from gprmax you should be aware that they obviously have a CPU FDTD code that is almost as slow as openEMS... really impressive ;)

If you want to see what is possible on a CPU you should have a loook here.
Empire XPU is able to do 50.000MCells/s (50GC/s) on a Dual Epyc CPU at the moment. That is about 10 times a Nvidia Tesla V100.
On a "normal" Ryzen 9 something 10GC/s (or twice a Tesla V100) is possible...

That said, I'm looking forward to your speed improvements on openEMS :D

1 reply

thliebig Feb 16, 2023
Maintainer

By the way, to cite the Empire Webpage:

The unique acceleration technique inside EMPIRE XPU enables fast and efficient FDTD simulations on modern CPUs. A smart algorithm calculates multiple time steps in the cache memory of the CPU. This increases the simulation speed drastically as the simulation speed is no longer limited by the main memory interface.

That pretty much sums up what you wrote above... ;)

PinkMushroom · 2023-04-29T18:28:09Z

PinkMushroom
Apr 29, 2023

I thought people might like to see my results just for a point of reference. This is for the Microstrip notch example as is (I didn't modify at all)

Create FDTD operator (compressed SSE + multi-threading)
FDTD simulation size: 235x82x14 --> 269780 FDTD cells
FDTD timestep is: 1.73955e-13 s; Nyquist rate: 410 timesteps @7.01052e+09 Hz
Excitation signal length is: 4706 timesteps (8.1863e-10s)
Max. number of timesteps: 1000000000 ( --> 212495 * Excitation signal length)
Create FDTD engine (compressed SSE + multi-threading)
Running FDTD engine... this may take a while... grab a cup of coffee?!?
[@ 4s] Timestep: 3264 || Speed: 218.3 MC/s (1.236e-03 s/TS) || Energy: ~1.16e-14 (- 0.00dB)
[@ 8s] Timestep: 9078 || Speed: 390.5 MC/s (6.908e-04 s/TS) || Energy: ~3.46e-18 (-35.26dB)
[@ 12s] Timestep: 17340 || Speed: 551.2 MC/s (4.894e-04 s/TS) || Energy: ~4.81e-21 (-63.83dB)
Time for 17340 iterations with 269780.00 cells : 12.09 sec
Speed: 386.81 MCells/s

my processor
processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 97
model name : AMD Ryzen 7 7700X 8-Core Processor
stepping : 2
microcode : 0xa601201
cpu MHz : 3000.000
cache size : 1024 KB

memory is 2x16GB, 4800MHz DDR5

6 replies

biergaizi Apr 29, 2023
Author

@biergaizi seems faster than Graviton3? (I'm still on the lookout for the fastest EC2 instance on AWS slightly_smiling_face)

First, Graviton3 may get faster when a bigger VM or bare-metal server is allocated. Because of resource contention of a shared server and CPU core-count limitation, a small 4-core instance may be insufficient to saturate its full memory bandwidth. I noticed that on a bare-metal server (I tried a 96-thread Intel Xeon), openEMS can scale to 10 or more threads before it gets bottlenecked. The only accurate way to determine the true performance of each server is renting a bare-metal one, which is a very expensive test without a sponsor. Also, MSL_NotchFilter.py can be a flawed example for benchmarking. The simulation domain is too small, and it finishes within seconds without a chance of getting stabilized or determining the optimal number of threads.

BTW, in case that anyone isn't aware, I published performance data of openEMS simulation on more than 10 CPU platforms, see thliebig/openEMS#105

0xCoto Apr 29, 2023

First, Graviton3 may get faster when a bigger VM or bare-metal server is allocated.

Yeah, that's true. I've noticed small variations from time to time on c7g.xlarge. Nothing huge, but I might do some benchmark test on a dedicated one to see if there's a considerable improvement.

I noticed that on a bare-metal server (I tried a 96-thread Intel Xeon), openEMS can scale to 10 or more threads before it gets bottlenecked.

Interesting, everyone I had seen was reporting diminishing returns at around 4 threads or so on their (home) machines. Curious if you saw any superb results at 10 threads on Xeon.

which is a very expensive test without a sponsor

In case you're interested, I'd be willing to support your efforts and tell you a bit more about what I've been working on (feel free to shoot me an email).

Also, MSL_NotchFilter.py can be a flawed example for benchmarking.

yea it's far from ideal (e.g. the ABCs aren't all PML for example, plus the reasons you mentioned), but the observed MC/s definitely caught my eye. I wouldn't expect stabilization over a larger simulation/lower end criteria to push the rate way too low compared to the initially-observed rate, but it'd definitely be useful to run a proper benchmark of course.

PinkMushroom Apr 30, 2023

Helical_Antenna.py
Last one, I'm not intending to spam this thread with benchmarks, but I wanted to provide one with a much higher cell count. Also I forgot to put in the version I was using, although as I implied, I just built it yesterday.

openEMS 64bit -- version v0.0.35-108-gc651cce

Multithreaded Engine: Best performance found using 3 threads.
Time for 3136 iterations with 1226242.00 cells : 16.08 sec
Speed: 239.15 MCells/s

Hope this is helpful. Now I'm off to figure out how to display the simulation geometry :-)

biergaizi Apr 30, 2023
Author

I noticed that on a bare-metal server (I tried a 96-thread Intel Xeon), openEMS can scale to 10 or more threads before it gets bottlenecked.

Interesting, everyone I had seen was reporting diminishing returns at around 4 threads or so on their (home) machines. Curious if you saw any superb results at 10 threads on Xeon.

On my AMD Zen 3 (Ryzen 7 5700X) with 2x DDR4 3200 MT/s DIMM, MSL_NotchFilter.py runs at 332 MC/s (upstream) or 362 MC/s (patched), and Helical_Antenna.py runs at 217 MC/s (upstream) or 243 MC/s (patched). I'm glad to see that my patch can sometimes make AMD Zen 3 as fast as AMD Zen 4. As you can see, my performance is similar to AWS's 4-core Graviton3 data I published.

Thus, I conclude that the memory bandwidth is far from being saturated using these small AWS virtual machines, since our regular desktops can beat them. A larger instance may be faster still.

I believe dual-channel memory is the bottleneck that prevents further scaling, 3 threads are enough to saturate the 50 GB/s DDR4 bandwidth in a STREAM benchmark, and a FDTD simulation is basically a STREAM benchmark. It would be interesting to see the same tests on Intel Xeon or AMD Threadripper or EPYC CPUs with 4 or 8 channels. AMD Genoa has 12 channels. These should give a massive performance boost. I expect that Apple's M1 Ultra should also be as fast as these workstation and server-class hardware because of its massive bandwidth. Unfortunately these Intel or AMD servers are still not offered by a major cloud provide yet (Sapphire Rapids is on AWS but in closed beta).

I'm still on the lookout for the fastest EC2 instance on AWS

There are some options beyond AWS too. If you can rent a dedicated server elsewhere at a lower price, it would be a good idea to try them for your simulation workload. Some budget hosting providers often use desktop machines as dedicated servers.

biergaizi Apr 30, 2023
Author

Interesting, everyone I had seen was reporting diminishing returns at around 4 threads or so on their (home) machines. Curious if you saw any superb results at 10 threads on Xeon.

Correction: not 10 threads. I just did a quick retest on Graviton3 and Skylake bare metal instances. Graviton3 performance peaks around 5 to 6 threads on this platform, and on Skylake, it's around 5 to 7 threads. And yes, at 4 threads, a simulation would run at 70%-90% of the peak speed, depending on the particular simulation type.

On Graviton3, MSL_NotchFilter.py ran at 379.69 MCells/s with 5 threads, and Helical_Antenna.py ran at 304.79 MCells/s with 6 threads. So a Zen 4 desktop is not faster than Graviton3, the performance is either similar or 30% faster. It also worth noting that the field dump creates a large overhead. When it was turned off, the speed jumped to 843.9 MC/s in Helical_Antenna.py.

On Skylake, MSL_NotchFilter.py ran at 276.40 MCells/s with 5 threads, Helical_Antenna.py run at 188.74 MCells/s at 6 threads with field dump, and at 507.31 MCells/s at 7 threads without field dump.

biergaizi · 2023-06-02T14:47:01Z

biergaizi
Jun 2, 2023
Author

Status update: I'm now trying to implement space-time tiling (multi-timestep) techniques for FDTD. Now it's just a simple experiment using the FDTD kernel to test its feasibility. Eventually I hope to integrate it into openEMS.

1 reply

thliebig Jun 2, 2023
Maintainer

I like all your ideas, we should maybe also talk about what could be done deeper inside the algorithm... For example simpler cases where there are no losses... different operator compression. E+H in one step where possible and so on...

biergaizi · 2023-06-13T06:50:36Z

biergaizi
Jun 13, 2023
Author

Initial result: With diamond tiling on the X/Y axis, memory traffic is reduced to ~50%, thus the theoretical speedup is ~100%. This is consistent with the data reported by the paper from Fukaya, T., & Iwashita, T (which is the basis of my implementation). The "10x" speedup in EMPIRE is likely a combination of many methods and is beyond the scope of what simple spacetime-tiling can achieve.

But beware that it's just a "laboratory" prototype. Whether or not integrating it to openEMS will show the same speedup is still an open question, due to the same problems I already mentioned in the GPU computing post. Tiling correctness is also not verified yet, I suspect there are still off-by-one or out-of-bound errors in the tiling calculations. This will be the subject of my future experiments.

Another curious question is whether diamond tiling and GPU computing can be used at the same time, so the 5x-10x GPU speedup will become a 10-20x speedup.

0 replies

biergaizi · 2023-06-23T06:47:31Z

biergaizi
Jun 23, 2023
Author

I've just identified another significant and rather unexpected bottleneck in openEMS's multi-threaded engine - excessive synchronization. I haven't checked it yet, but I'm now suspecting that this bottleneck alone is perhaps responsible for a significant slowdown and partially explains why we're seeing diminishing speedup as the number of threads go up, even if one uses a machine with an extremely high memory bandwidth.

Currently, the synchronization is implemented as the following code:

void Engine_Multithread::DoPreVoltageUpdates(int threadID)
{
	//execute extensions in reverse order -> highest priority gets access to the voltages last
	for (int n=m_Eng_exts.size()-1; n>=0; --n)
	{
		m_Eng_exts.at(n)->DoPreVoltageUpdates(threadID);
		m_IterateBarrier->wait();
	}

}

// same for DoPostVoltageUpdates, Apply2Voltages, DoPreCurrentUpdates, DoPostCurrentUpdates, Apply2Current

In other words, there are blocking synchronizations not only between each substep, but also within each substep between every extension. Thus, if we're using 3 extensions, there can be 18 barriers.

Why is it designed this way? I believe the reason is that an extension uses its own partitioning, and this partitioning is not aligned with the partitioning with the main engine. Thus, there can be "action at a distance", an extension may modify the memory that belongs to another thread, so we need synchronization between every extension, even within a thread.

By using a consistent partitioning for both the main loop and the extensions, one can remove these barriers and improve performance. If it's not possible for every single extension - for example, it looks very tricky to modify the Cylindrical extension - but at least the extension can use a flag to indicate whether global partitioning is supported. One can enforce the barrier only for far-reaching extensions, and skip it for local-reaching extensions.

0 replies

biergaizi · 2023-06-23T15:59:08Z

biergaizi
Jun 23, 2023
Author

Progress preview: I'm seeing a 2.2x speedup for the first draft of my patch that uses only spatial tiling, reducing a 1-hour simulation run to just 25 minutes. The removal of synchronization is probably a huge contributing factor of this gain. With temporal tiling I expect another 2x speedup.

Nothing is final, the results are still not quite correct but I should be able to eventually debug and fix that.

0 replies

biergaizi · 2023-06-26T05:26:06Z

biergaizi
Jun 26, 2023
Author

Another update: Using diamond tiling for temporal tiling (time skewing), I see another 2x to 4x speedup on top of the existing 2.2x speedup of the previous spatial tiling. So the total speedup is up to 600% (depending on the simulation setup). Now a slow 1-hour simulation run with PML just takes 10 minutes! This is partially due to reduced memory traffic, I think another important factor is thread synchronization. With diamond tiling, each tile is inherently parallel within a stage, thus we can almost eliminate all thread synchronizations within a stage.

My openEMS optimizations probably have limited speedup and nowhere near 600% in a bare minimum simulation setup with no extensions enabled, but it's going to tremendously increase the practicality of Perfectly Matched Layer, which currently completely trashes the performance if enabled and incurs a up-to 3x slowdown. PML is what actually makes FDTD simulations useful.

0 replies

biergaizi · 2023-06-29T03:41:37Z

biergaizi
Jun 29, 2023
Author

My preliminary result of implementing diamond tiling optimization for openEMS - 2x to 6x speedup. The GCPW example saw a massive 600% speedup, time-to-solution reduced from 1 hour to 10 minutes. It's a very small PCB surrounded by the Perfect Matched Layer extension of comparable size, in this case the overhead of poor data locality and thread synchronization seemed to be especially high.

There are limitations. Only PML and the lossy conductor extensions have been converted to support tiling, not Mur's ABC or the Cylindrical coordinate system, so I couldn't run all the tests yet. Supporting Mur's ABC shouldn't be too hard, and I should be able to write the code soon. On the other hand, adopting the optimization to the Cylindrical coordinate system looks fairly difficult, that would be a discussion for the future. For now my focus is the Cartesian mesh.

0 replies

biergaizi · 2023-07-01T13:16:12Z

biergaizi
Jul 1, 2023
Author

I just announced the first test release of this tiling engine with 200% to 600% speedup, see #92.

0 replies

toammann · 2023-11-13T15:17:06Z

toammann
Nov 13, 2023

Hello,
I have a question regarding simulation speed.

Currently I am experimenting with Lorentz Materials. I have noticed a significant increase in simulation time as soon as a Lorentz Material is included in the simulation. The number of terms or poles have no significant impact.

Does anyone know the reason for this? The Simulation duration is increased by a factor of 20x.

Thanks in advance!

Regards,
Tobias

16 replies

biergaizi Nov 19, 2023
Author

What is the largest number of threads, you utilized with openEMS, with performance increase?

Using the current upstream code, there's rarely any benefit for going beyond 8 threads. Due to artificial software bottlenecks, it cannot fully utilize the performance of powerful systems (such as the Apple M1 Ultra or an Intel Xeon 8-channel DDR4 server). Using my work-in-progress tiling engine, under some conditions, high performance is reached using 32 threads.

Sarajevo67 Nov 19, 2023

there's rarely any benefit for going beyond 8 threads

As I expected and experienced the same on most architectures, where I used openEMS.
Yes, bottlenecks are created by cache evictions from non coalesced voltage/current update memory access.
Usual mediation of this problem is rearranging memory access pattern, like in your tiling engine, or this video

From my CUDA experience, secret sauce in massive updates, apart from small data chunks per thread, is launched thread oversubscription, that allows SM's to be busy crunching data from active threads, while waiting data for inactive threads.
When this is achieved, I witnessed over 10 SP TFLOPS and 400 GiB/sec throughput in scenarios very similar to openEMS. In CPU implementations, this is handled by CPU memory management and parallel execution models (like boost/std::thread or PT), or API (like OpenMP). Both has advantages and disadvantages. OpenMP has high launch overhead, but also has very robust implementation rules and directives, which, when implemented, lead to high performance code.

Synchronisation is second source of performance drop. Lack of synchronisation corrupts the data, too much corrupts performance.

Third source of performance drop is infamous "false sharing" when threads from different cores trying to modify data in the same cache line.

Taking into account those three pillars of parallel performance, that simple code from the first @biergaizi post on this topic, becomes actual minefield.

I would like to participate in efforts, but I can not grasp complexity and dependencies of the whole openEMS.
It would be great to have monolithic C/C++ performance critical code, without dependencies, with some representative dataset (to directly upload on the heap), which we could benchmark and, through trial and error - improve.

biergaizi Nov 19, 2023
Author

It would be great to have monolithic C/C++ performance critical code, without dependencies, with some representative dataset (to directly upload on the heap), which we could benchmark and, through trial and error - improve.

I've been doing exactly this on my spare time in the past 3 months, I've already tried nearly 100 variants of micro-benchmarks to understand the memory bandwidth characteristics on CPUs and GPUs.

You've mentioned many big ideas that I have already investigated in the past months. Some ideas are correct and useful, some ideas are useless dead ends, some are still open questions. I plan to release a minimum viable version before the end of 2023, or at least write a blog post on my successful and failed experiments.

I would like to participate in efforts, but I can not grasp complexity and dependencies of the whole openEMS.

Because of Brooks' Law, I recommend that you to stop thinking and worrying about the problems for now.

I dislike working on multiple problems simultaneously on different fronts. Right now I'm working on one problem only - speeding up the engine. But more participants mean that I must stop working on this project, and instead, I have to move to the new project on teaching everyone else about the architecture of the openEMS engines instead.

Of course, I'm aware that I'm only a code contributor, not a project manager, and I have no privilege to monopolize the development work. So I've set myself a hard cut-off time of January 2024. If I did not get any productive results, I'll at least communicate all my previous findings to everyone else.

Sarajevo67 Nov 19, 2023

@biergaizi
My post was just thinking out loud. Sometimes it helps to rethink problems on different way.

No intention to patronize such an experienced developer like you and absolutely agree regarding Brooks' Law.
I become interested in this since @microcontroller posted #36 in February. It is not problem to wait few months more

Best Regards,

DD

toammann Nov 19, 2023

I've already repeatedly posted my results on this forum on multiple occasions. My test suite had 14 scripts and they were tested across 12 different machines at significant time and financial expense. I'm disappointed that my prior work went completely unnoticed.
A detailed description can be found here:
thliebig/openEMS#117
I hope this will be my last repost.

Please don´t stop re-posting links to your work. Re-posting links at suitable occations in discussions make them easily accessable for newcomers.

I have indeed not seen the results behind this link. While I am following the project now (occationally) for about half of a year - I am not on track with everything what is going on.

This is an open source project and I think everybody is interested in making openEMS as good as possible - within the limits of each persons available spare time and skillset.

Regards,
Tobias

jockeosth · 2024-10-26T15:04:23Z

jockeosth
Oct 26, 2024

I follow this discussion with great interest.
As I have understood, the speed limit is mainly due to memory bandwidth limitations.
What if some compression is used on the data that is read/write to memory? I think the fields will be suitable for delta encoding as they change smoothly? Using delta encoding and zigzag encoding together with simple8b or similar, would that be possible and suitable?

Regards,
Joakim

1 reply

biergaizi Oct 28, 2024
Author

What if some compression is used on the data that is read/write to memory?

Fields are not compressed (I'll talk about it later), but material properties (called Operators) are already deduplicated in the official codebase, openEMS calls it Operator Compression. The algorithm is simple - if the same cell value repeats, it's encoded as an unique index, and is only stored once in memory. So all-1 cells (vacuum), all-0 cells (boundary) have 100% cache hit, since you keep reading from the same memory addresses. Without the existing Operator Compression, simulation will be even slower.

I've also experimented with several new material deduplication algorithms, some have better compression rate, but not necessarily faster. To increase the compression rate you need to groups the cells as larger chunks, which changes the memory access pattern to a slower one...

A major limitation of Operator Compression is that it works on raw material properties of cells, rather than the original material. Operator Compression has no idea about what high-level material is actually located at this cell, it only knows a raw value pre-calculated before the simulation starts. So perhaps compression rate can be greatly increased by using a whitebox algorithm - instead of saying, here's a cell with operator 0.005413 (repeated 100 times), perhaps we can say "PTFE cells starts from (x, y, z) to (x', y', z')", which deliver much more information, greatly improving compression.

But the problem is that each FDTD algorithm extension may modify the cell values on the fly. Each part of the engine has no idea about what other things are doing. For example, Perfect Matched Layer is implemented by injecting a new operator value into memory, so this allows the FDTD engine to be run independent.

If we want to use a whitebox operator algorithm, then every part of the engine must be aware of every other part in a transparent manner. Not impossible, but it requires a complete engine rewrite that touches not only the FDTD part but the CSXCAD part as well...

I think the fields will be suitable for delta encoding as they change smoothly? Using delta encoding and zigzag encoding together with simple8b or similar, would that be possible and suitable?

Great question. It's likely possible.

In fact I had the same idea as well, basically:

float cacheptr[CACHE_SIZE];  /* inside CPU L2/L3 cache */
memcpy(cacheptr, memptr, CACHE_SIZE);
decompress(cacheptr);
run_simulation(cacheptr);
compress(cacheptr);
memcpy(memptr, cacheptr, CACHE_SIZE);

Right? What can be more reasonable than that.

Unfortunately, I found a major problem: cache pollution (for both fields and operators). When you do memcpy(cacheptr, memptr, CACHE_SIZE);, the CPU caches both cacheptr and memptr, even if the memptr is only used once and unwanted! On x86 and most CPU architectures, there are non-temporal write instructions that tells the CPU that "don't cache this write", but there are NO effective non-temporal read instructions. So doing this may waste half of your cache.

Perhaps a low-level programming guru with 10 years of experience can find a solution in a weekend, but sorry, I'm not that guru. :-(

So for field compression to work, it must be "natural", a natural part of the algorithm that is cached automatically, otherwise these artificial chunking may create cache problems, if not handled well. So some form of algorithm-software co-optimization is probably necessary here, but I'm not that physicist :-(

Another potential solution I see is low-precision simulations. FluidX3D already showed that in lattice-Boltzmann, a cell-based CFD simulation method - since the fluid behavior usually doesn't change abruptly, so using floating-point precision lower than 32-bit often acceptable and may have minimum loss of precision - especially if you have a non-IEEE format with a dynamic range suitable for your data.

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats, Moritz Lehmann, et al. https://arxiv.org/abs/2112.08926

So this perhaps is equally applicable to FDTD.

thehans · 2025-08-15T03:42:20Z

thehans
Aug 15, 2025

What about post processing for antenna simulation? I'm currently running a simulation that went through all the timesteps in about 2 hrs, and have been waiting (for CalcPort i think?) for 3 hours after that (and counting)... Its all on a single core the whole time. Can this aspect possibly be parallelized?

2 replies

biergaizi Aug 15, 2025
Author

Yes, absolutely I think. The only limiting factor is that I don't have enough energy to start working on yet another project at the same time.

thehans Aug 17, 2025

Nevermind, I ended up killing that process. I probably had some horribly wrong configuration(still learning a lot) or maybe it was a fluke that it got stuck that time. Either way I haven't experienced excessively long post processing since then.

[FAQ] Why are openEMS simulations slow, and how to make them faster #29

Uh oh!

Uh oh!

biergaizi Feb 14, 2023

Understanding the Memory-Bound FDTD Kernel

Unworkable Solutions

Do and Don't

Potentially Workable Solutions

Better Memory Access Pattern

Long-term Solutions

High-Order FDTD

Multi-timestep & Time Skewing Techniques

GPU Computing

Replies: 14 comments · 27 replies

Uh oh!

thliebig Feb 15, 2023 Maintainer

Uh oh!

ringo458 Feb 16, 2023

Uh oh!

Uh oh!

thliebig Feb 16, 2023 Maintainer

Uh oh!

thliebig Feb 16, 2023 Maintainer

Uh oh!

PinkMushroom Apr 29, 2023

Uh oh!

Uh oh!

biergaizi Apr 29, 2023 Author

Uh oh!

0xCoto Apr 29, 2023

Uh oh!

PinkMushroom Apr 30, 2023

Uh oh!

Uh oh!

biergaizi Apr 30, 2023 Author

Uh oh!

biergaizi Apr 30, 2023 Author

Uh oh!

Uh oh!

biergaizi Jun 2, 2023 Author

Uh oh!

thliebig Jun 2, 2023 Maintainer

Uh oh!

Uh oh!

biergaizi Jun 13, 2023 Author

Uh oh!

biergaizi Jun 23, 2023 Author

Uh oh!

biergaizi Jun 23, 2023 Author

Uh oh!

Uh oh!

biergaizi Jun 26, 2023 Author

Uh oh!

Uh oh!

biergaizi Jun 29, 2023 Author

Uh oh!

biergaizi Jul 1, 2023 Author

Uh oh!

Uh oh!

toammann Nov 13, 2023

Uh oh!

Uh oh!

biergaizi Nov 19, 2023 Author

Uh oh!

Uh oh!

Sarajevo67 Nov 19, 2023

Uh oh!

biergaizi
Feb 14, 2023

Replies: 14 comments 27 replies

thliebig
Feb 15, 2023
Maintainer

ringo458
Feb 16, 2023

thliebig
Feb 16, 2023
Maintainer

thliebig Feb 16, 2023
Maintainer

PinkMushroom
Apr 29, 2023

biergaizi Apr 29, 2023
Author

biergaizi Apr 30, 2023
Author

biergaizi Apr 30, 2023
Author

biergaizi
Jun 2, 2023
Author

thliebig Jun 2, 2023
Maintainer

biergaizi
Jun 13, 2023
Author

biergaizi
Jun 23, 2023
Author

biergaizi
Jun 23, 2023
Author

biergaizi
Jun 26, 2023
Author

biergaizi
Jun 29, 2023
Author

biergaizi
Jul 1, 2023
Author

toammann
Nov 13, 2023

biergaizi Nov 19, 2023
Author