[FAQ] Why are openEMS simulations slow, and how to make them faster #29
Replies: 14 comments 27 replies
-
Thank you for that very long and very detailed answer to this indeed frequently asked question. |
Beta Was this translation helpful? Give feedback.
-
Awesome! -a new user |
Beta Was this translation helpful? Give feedback.
-
@biergaizi if you use data from gprmax you should be aware that they obviously have a CPU FDTD code that is almost as slow as openEMS... really impressive ;) If you want to see what is possible on a CPU you should have a loook here. That said, I'm looking forward to your speed improvements on openEMS :D |
Beta Was this translation helpful? Give feedback.
-
I thought people might like to see my results just for a point of reference. This is for the Microstrip notch example as is (I didn't modify at all) Create FDTD operator (compressed SSE + multi-threading) my processor memory is 2x16GB, 4800MHz DDR5 |
Beta Was this translation helpful? Give feedback.
-
Status update: I'm now trying to implement space-time tiling (multi-timestep) techniques for FDTD. Now it's just a simple experiment using the FDTD kernel to test its feasibility. Eventually I hope to integrate it into openEMS. |
Beta Was this translation helpful? Give feedback.
-
Initial result: With diamond tiling on the X/Y axis, memory traffic is reduced to ~50%, thus the theoretical speedup is ~100%. This is consistent with the data reported by the paper from Fukaya, T., & Iwashita, T (which is the basis of my implementation). The "10x" speedup in EMPIRE is likely a combination of many methods and is beyond the scope of what simple spacetime-tiling can achieve. But beware that it's just a "laboratory" prototype. Whether or not integrating it to openEMS will show the same speedup is still an open question, due to the same problems I already mentioned in the GPU computing post. Tiling correctness is also not verified yet, I suspect there are still off-by-one or out-of-bound errors in the tiling calculations. This will be the subject of my future experiments. Another curious question is whether diamond tiling and GPU computing can be used at the same time, so the 5x-10x GPU speedup will become a 10-20x speedup. |
Beta Was this translation helpful? Give feedback.
-
I've just identified another significant and rather unexpected bottleneck in openEMS's multi-threaded engine - excessive synchronization. I haven't checked it yet, but I'm now suspecting that this bottleneck alone is perhaps responsible for a significant slowdown and partially explains why we're seeing diminishing speedup as the number of threads go up, even if one uses a machine with an extremely high memory bandwidth. Currently, the synchronization is implemented as the following code:
In other words, there are blocking synchronizations not only between each substep, but also within each substep between every extension. Thus, if we're using 3 extensions, there can be 18 barriers. Why is it designed this way? I believe the reason is that an extension uses its own partitioning, and this partitioning is not aligned with the partitioning with the main engine. Thus, there can be "action at a distance", an extension may modify the memory that belongs to another thread, so we need synchronization between every extension, even within a thread. By using a consistent partitioning for both the main loop and the extensions, one can remove these barriers and improve performance. If it's not possible for every single extension - for example, it looks very tricky to modify the Cylindrical extension - but at least the extension can use a flag to indicate whether global partitioning is supported. One can enforce the barrier only for far-reaching extensions, and skip it for local-reaching extensions. |
Beta Was this translation helpful? Give feedback.
-
Progress preview: I'm seeing a 2.2x speedup for the first draft of my patch that uses only spatial tiling, reducing a 1-hour simulation run to just 25 minutes. The removal of synchronization is probably a huge contributing factor of this gain. With temporal tiling I expect another 2x speedup. Nothing is final, the results are still not quite correct but I should be able to eventually debug and fix that. |
Beta Was this translation helpful? Give feedback.
-
Another update: Using diamond tiling for temporal tiling (time skewing), I see another 2x to 4x speedup on top of the existing 2.2x speedup of the previous spatial tiling. So the total speedup is up to 600% (depending on the simulation setup). Now a slow 1-hour simulation run with PML just takes 10 minutes! This is partially due to reduced memory traffic, I think another important factor is thread synchronization. With diamond tiling, each tile is inherently parallel within a stage, thus we can almost eliminate all thread synchronizations within a stage. My openEMS optimizations probably have limited speedup and nowhere near 600% in a bare minimum simulation setup with no extensions enabled, but it's going to tremendously increase the practicality of Perfectly Matched Layer, which currently completely trashes the performance if enabled and incurs a up-to 3x slowdown. PML is what actually makes FDTD simulations useful. |
Beta Was this translation helpful? Give feedback.
-
My preliminary result of implementing diamond tiling optimization for openEMS - 2x to 6x speedup. The GCPW example saw a massive 600% speedup, time-to-solution reduced from 1 hour to 10 minutes. It's a very small PCB surrounded by the Perfect Matched Layer extension of comparable size, in this case the overhead of poor data locality and thread synchronization seemed to be especially high. There are limitations. Only PML and the lossy conductor extensions have been converted to support tiling, not Mur's ABC or the Cylindrical coordinate system, so I couldn't run all the tests yet. Supporting Mur's ABC shouldn't be too hard, and I should be able to write the code soon. On the other hand, adopting the optimization to the Cylindrical coordinate system looks fairly difficult, that would be a discussion for the future. For now my focus is the Cartesian mesh. |
Beta Was this translation helpful? Give feedback.
-
I just announced the first test release of this tiling engine with 200% to 600% speedup, see #92. |
Beta Was this translation helpful? Give feedback.
-
Hello, Currently I am experimenting with Lorentz Materials. I have noticed a significant increase in simulation time as soon as a Lorentz Material is included in the simulation. The number of terms or poles have no significant impact. Does anyone know the reason for this? The Simulation duration is increased by a factor of 20x. Thanks in advance! Regards, |
Beta Was this translation helpful? Give feedback.
-
I follow this discussion with great interest. Regards, |
Beta Was this translation helpful? Give feedback.
-
What about post processing for antenna simulation? I'm currently running a simulation that went through all the timesteps in about 2 hrs, and have been waiting (for CalcPort i think?) for 3 hours after that (and counting)... Its all on a single core the whole time. Can this aspect possibly be parallelized? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The question, "Why are openEMS simulations slow, and how to make them faster?" is often the first question that many newcomers may ask - I, too, asked this question to myself, and found the answer independently. Thus, I wrote this FAQ to answer it so it doesn't have to be repeated.
Understanding the Memory-Bound FDTD Kernel
At its heart, FDTD is a surprisingly simple algorithm. Its core is just 3 nested "for" loops that iterate all cells in 3D space. To obtain the electric field at each cell, it takes the magnetic field of the current cell, the magnetic field of the adjacent cell on the X axis, the Y axis, and the Z axis, take their differences, multiply them by some constants, and sums them up. Once the loop is finished, the same process is repeated to obtain the magnetic field from the electric field.
It's a magnificent fact that 20 lines of code is enough to model all classical electromagnetism in the universe - but not surprising since it's literally what Maxwell's equations say - a changing electric field creates a magnetic field, and a changing magnetic field in turn creates an electric field.
The following code is a highly simplified version of a 3D FDTD kernel.
It shows everything about the inherent slow performance of FDTD simulations.
The kernel has an extremely low arithmetic intensity, that is, there's very little for the CPU to do - only 4 additions and 3 multiplications on the X, Y, Z polarizations of the electric and magnetic fields at a cell, and that is all. The rest of the work is simply reading and writing values from and to memory, over and over again - millions of times in each time step. Most of the time, the CPU core is idle (even when task manager shows "100%" usage).
To make it worse, this memory access pattern has a moderate level of memory spatial locality but nonexistent memory temporal locality.
Despite the name "random" in Random Access Memory, accessing a random location in memory has significant latency, since a new DRAM row and column must be selected. To minimize latency and maximize throughput, a linear walk in memory is much preferred, this is known as a stride-1 memory access pattern. For this kind of accesses, the DRAM can transmit sequential data in a burst, the CPU would also notice this pattern and prefetch the next addresses to its cache while the current computation is still being processed. This property is known as memory spatial locality.
Similarly, once data is loaded from DRAM to the CPU, it's temporarily stored inside CPU cache with the assumption that it would be used later, this is known as memory temporal locality. Accessing L1 cache is 100x faster than accessing DRAM. Thus, if and only if the algorithm repeatedly works on the same data values, the CPU can operate at full speed. Otherwise, it's not possible to reach the peak FLOPS performance of the CPU as measured by LINPACK, and performance suffers.
And speaking of LINPACK... LINPACK's main author Jack Dongarra is well aware of this problem:
How does FDTD perform in terms of memory spatial and temporal locality?
Spatial locality: Not ideal, but okay. A FDTD simulation is not a linear walk, but large stride-N memory accesses - to access the adjacent cell on the Y axis, the entire Z dimension must be skipped, and to access the adjacent cell on the X axis, the entire Y x Z dimensions must be skipped. Throughout simulation, these large jumps can slow down memory accesses. This is not ideal, but fortunately, the overhead is not too high - it occurs only at the beginning of each new X and Y dimension, all later accesses are stride 1. (In openEMS, some extensions may increase the overhead as some may iterate the cells along the X and Y directions).
Temporal locality: Poor. When the electric field values of all cells in the simulation domain are calculated, the FDTD kernel immediately repeat the loop again to calculate the magnetic field values at the beginning of the simulation domain - at this point there's nothing useful left in cache anymore, it only contains data at the end of the simulation domain, but now we're processing the beginning of the simulation domain.
As a result, FDTD simulations are inherently bottlenecked by memory bandwidth.
To quote Fabian “ryg” Giesen's words (the original context was a naive implementation of FFT):
In scientific computing, this kind of code is known as Iterative Stencil Loops and has been the subject of studies for decades. As far as I know, the FDTD is one of the more difficult problems due to its 3D nature.
Unworkable Solutions
Due to the nature of FDTD algorithms, many conventional software and hardware improvements have little to no effect. For CPU-bound applications, assembly code optimization is often the solution - it would be possible to unroll the loop, hand-tune the assembly, use longer SIMD vectors, etc,. to increase its performance. This is common in video encoding, file compression and cryptography. Unfortunately, for the FDTD kernel, the bottleneck is caused by the stalled CPU pipeline due to data starvation, not due to slow CPU computation, and it only has a small effect.
Increasing the clock frequency of the CPU or upgrading to a faster CPU core is also not productive due to the same reason.
The current trend in the industry is ever-increasing CPU core count. A single high-end server can have 100 CPU cores these days. Unfortunately, conventional FDTD also scales poorly with modern multicore CPUs and cannot take advantage of these development. Using multiple threads creates a speedup but beyond a couple of threads, the entire CPU's memory bandwidth is saturated. At this point, running more threads can only slow it down.
Do and Don't
Special notice to FreeBSD and macOS users - please compile openEMS with GCC, not LLVM/clang. I have identified a to-be-investigated technical problem that makes clang to generate inefficient object code for the SSE/SIMD engine, reducing the simulation to a fraction of the nominal speed!
Use the latest memory generation. The memory bandwidth of DDR memory increases in each generation. Going from DDR3 to DDR4, from DDR4 to DDR5 should produce a significant speedup due to increased bandwidth. This means picking an up-to-date CPU, not for the CPU core performance, but for the memory support.
Use dual-channel, quad-channel, and octa-channel memory. Don't run a desktop with single-channel memory, this causes a significant slowdown due to halved memory bandwidth. Upgrading from single-channel to dual-channel memory can create a 150% to 200% speedup in openEMS simulations. On workstation and server hardware, multi-channel memory configurations are available - use them.
Here's an example with DDR3, but similar results apply to later generations as well.
Use multiple CPU sockets. A dual-socket server has the potential to make openEMS's multi-threading scale further. However, currently the code is not NUMA-aware (I plan to improve that in the future), so the scaling may be limited. In this case, it's better to run two openEMS simulations simultaneously, each pinned to a CPU socket. Make sure to test your simulations before ordering it - multiple small single-socket servers is often more reasonable than a big dual-socket server. For tasks like parametric sweep, each server can execute independent jobs. For extremely large simulations, creating an MPI cluster is an option.
Disable "powersave" CPU governors and policies. Experiments have found that a "powersave" CPU frequency governor or energy policy can cause slowdowns in some simulations.
Don't run openEMS simulations with the maximum number of threads, similarly, don't run multiple openEMS simulations simultaneously without reducing the number of threads. A single CPU core often cannot saturate memory bandwidth, thus, using multi-threading increases simulation speed, but only up to a point. Beyond this point, increases the thread number further can only slow it down (having higher memory bandwidth or multiple CPU sockets may allow you to push it further). Blindly set the number of threads to the same number of CPU cores (or threads) can create a significant slowdown. Because the memory access pattern in each simulation is different, depending on the extensions used, finding the optimal number of threads require some experimentation. openEMS now automatically detects the fastest number of threads, but the result is occasionally wrong.
Use a dedicated and fast machine for simulation. For some workloads, using the desktop at the same time when they're running only has a moderate performance hit, as long as you're not using all the CPU cores. But for openEMS, during my development, I found other applications can significant influence the simulation speed. I assume it's the result of L3 cache pollution. For the best result, use a dedicated machine, perhaps a headless server. Don't browse the web, use CAD, or run games on the same machine while waiting for the simulation results.
Use the minimum number of cells in a simulation. For the fastest simulation speed, don't create a mesh finer than what's strictly necessary, careful meshing can produce more accurate results with fewer cells. For example, by correctly applying the 1/3-2/3 rule, a microstrip transmission line can be accurately simulated without a very fine mesh.
If possible, try using Mur ABC instead of PML ABC as boundary condition. PML ABC works much better than Mur ABC and it's much easier to use, but it comes with a higher computational overhead. One must make a tradeoff between ease of use, accuracy, and simulation speed. Sometimes, Mur ABC is acceptable.
Disable field dump when it's not necessary. Field dump can create a significant overhead. If the simulation has been fully debugged, field dumps can be disabled in later runs to speed them up. Unfortunately it's not possible for antenna simulations as the field dumps are required.
Potentially Workable Solutions
Better Memory Access Pattern
Although tuning the assembly code is unlikely to produce a significant speedup, but tuning the memory access pattern may help.
For example, the current multi-dimensional array has a suboptimal memory layout, improving it may create a 10%-30% speedup (I've seen 70% in an extreme case, though not relevant to practical applications), work is currently ongoing at thliebig/openEMS#100.
Another change worth considering is the engine architecture. The current engine implementation creates more memory overhead than strictly necessary. To run a simulation, the flowchart is basically: run most points in 3D space through the plugins to do pre-update, run all points in 3D space again through the main FDTD kernel, and finally run most points in 3D space again through the plugin to do post-update. The extra redundant loads and stores create significant memory bandwidth overhead...
Instead of going through the entire 3D space at a time, perhaps the update can be splitted into multiple blocks. Instead of running pre-update, main-update and post-update across the entire 3D space at once, it can be done in smaller chunks to promote cache reuse. Another potential solution is "fused" engine extension, allowing a plugin to insert itself dynamically into main field update loop of the main kernel to "fuse" the computation together, rather than using its own redundant loop. This way a few memory accesses can be saved. Both are being discussed at thliebig/openEMS#100.
Long-term Solutions
Some solutions have been discussed in the literature and they may produce a significant speedup, but the implementation is difficult. They may worth exploring and brainstorming, but don't expect to see them any time soon.
High-Order FDTD
The most "straightforward" solution is simply accepting defeat that FDTD's inherent memory-bound problem is unsolvable, and instead we should exploit it to do more work on the CPU by switching to high-order update equations. It would make computation even slower, but at least the CPU is being used for productive work to compute more accurate results. And for some kinds of problems, it may even allow one to reduce the number of cells, achieving a net speedup.
See also:
Multi-timestep & Time Skewing Techniques
In conventional FDTD, the memory bottleneck is created mainly due to the cache-unfriendly access pattern. However, modified FDTD algorithms have been invented to overcome this problem. Theoretically, one possible solution is time skewing, also known as time-space tiling. The main idea is that, instead of running the entire 3D space one time step at a time for all points, it can be time-stepped asynchronously. Once a single cell is time-stepped, one can immediately time-step the surrounding cells without throwing the register and cache values away. So at a simulation cycle, each point in space will be at different time steps, forming a diamond-shaped region. The possible speedup is reportedly 10x or even higher.
See:
Fukaya, T., & Iwashita, T. (2018). Time-space tiling with tile-level parallelism for the 3D FDTD method. Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region - HPC Asia 2018. doi:10.1145/3149457.3149478
Andrey Zakirov, et al., Using memory-efficient algorithm for large-scale time-domain modeling of surface plasmon polaritons propagation in organic light emitting diodes.
Andrey Zakirov, et al., High performance FDTD code implementation for GPGPU supercomputers
Takayuki Muranushi1 and Junichiro Makino, Optimal Temporal Blocking for Stencil Computation, RIKEN AICS
Andrey Zakirov, et al's work, the DTmaxwell4 engine, is available on GitHub: https://github.com/zakirovandrey/DTmaxwell4
Moreover, the EMPIRE solver used a multi-timestep algorithm and it was marketed as the "XPU" technology, it's likely also a similar idea.
However, openEMS, like other conventional FDTD engines, depends on synchronous time stepping. It can't be implemented without completely reinventing the algorithm and engine. So this project is a long-term project at best and requires huge effort to implement, the implementation of this engine can be someone's PhD project...
GPU Computing
In the 2010s, a customer-grade CPU had a memory bandwidth of ~10 GiB/s, while GPU had ~100 GiB/s. Today's GPUs are even faster, so perhaps it's possible to side-step the entire memory access bottleneck simply by using the GPU. Though, communication and memory access pattern remain challenges. In CPU programming we worry a lot about memory access patterns and cacheline alignment, while in GPU there's memory coalescing, which is more tricky to get right. VRAM size limits the maximum possible simulation domain, the communication overhead can be a dealbreaker in this case (though, consider that modern PC games are using 8 to 12 GiB of VRAM, this is no longer a problem. The future of GPU computing is bright).
According to gprMax's result, a 10x speedup is possible on a common customer-grade CPU like a Nvidia GeForce GTX 1080 Ti. When a HPC-class GPU like the Nvidia Tesla V100 ($15000) is used, an even higher speedup is possible.
See:
Again, plausible but not going to happen in the short term...
Beta Was this translation helpful? Give feedback.
All reactions