Notes on HPC Programming and openEMS Optimizations #154
Replies: 8 comments 30 replies
-
Beta Was this translation helpful? Give feedback.
-
I'm very interested in your work and eventually would like to contribute to this effort when I "retire" next year. I recently attended an online seminar by Tidy3D company who sell FDTD EM simulation products aimed at photonic simulations. I did ask them how they're getting around the "memory wall" but I doubt they'll want to give me much information since it's likely proprietary. |
Beta Was this translation helpful? Give feedback.
-
Also, I found it amusing that a commercial FDTD product such as Tidy3D still used a rectangular grid, just like OpenEMS. This makes me feel much better about OpenEMS. I've seen papers in which the authors used variable-sized triangular grids and other methods to improve on the grid inefficiencies of the rectangular grids. I'm imagining that since the timestep is set by the minimum grid dimension, that a scheme such as triangular gridding for FDTD would result in significant reduction of memory footprint but probably not great improvements in performance? |
Beta Was this translation helpful? Give feedback.
-
Recently, some flawed openEMS engine benchmarks were attempted. The conclusions of those benchmarks were misleading due to lack of knowledge of FDTD and openEMS's performance characteristics - this includes some of my own early benchmarks made in early 2023. Thus, to avoid a repetition of more misleading benchmarks, it's necessary to clarify the some of the basic issues. The following is the second article of the series Notes on HPC Programming and openEMS Optimizations Understanding the Performance Characteristics of FDTD and openEMS
|
Beta Was this translation helpful? Give feedback.
-
The following is the third article of the series Notes on HPC Programming and openEMS Optimizations A Short Note on Operator CompressionIn FDTD, simulation involves reading two types of data - electromagnetic field arrays and material property arrays. In openEMS, the electromagnetic fields are stored in two arrays named In practice, many simulations have large empty air boxes, or have a box with an uniform material. As a result, many values in the operator arrays are repeated many times, wasting memory bandwidth. Thus, openEMS removes duplicate operators using a simple de-duplication algorithm along the Z axis, then assigns an index for all the remaining unique values. This process is called "Operator Compression". Naturally, a question is how much improvement would be possible by using a better de-duplication or compression algorithm - for example, @thliebig previously suggested a Run-Length Encoding compression method to bring improvement. So here's a quick calculation. During electric field update, the memory accesses include: read Conclusion: Only a 1.2x to 1.5x improvement is possible over the existing compression method, even with a 100% ideal compression algorithm. Thus, optimizing operator compression further has diminishing return and will only bring a very limited speedup, hence it's not recommended. It cannot replace the use of time tiling techniques for multi-timesteps calculations (still a work in progress). I've already experimented with a "3D block" based de-duplication algorithm in my rectangular tiling engine. Although the algorithm itself creates better compression in some cases, in practice the performance is comparable with the official engine because it's just able to barely overcome the tiling overhead. |
Beta Was this translation helpful? Give feedback.
-
The following is the fourth article of the series Notes on HPC Programming and openEMS Optimizations. This describes the theory of operation of both my published Tiling engine and my unpublished, work-in-progress rewrite. Temporal Tiling: The Key to Fast FDTD Simulations, ExplainedBackgroundStencil ComputationIn scientific and engineering computing, simulations of physical systems governed by some Partial Differential Equations (PDEs) is a common task. These numerical PDE solvers usually calculate the state of a system from its current timestep The following are some examples of stencils: (1) A Jacobi solver of 2D heat equation has the dependency chain of itself and its top, bottom, left, and right, similar to the D-Pad on a game controller (von Neumann neighborhood). (2) An implementation of Conway's Game of Life has the dependency of north, south, east, west, northeast, northwest, southeast and southwest (Moore neighborhood). (3) A 2D convolution kernel in image processing. For FDTD, its stencil shape is more complex (because of the leapfrog update between the electric and magnetic fields), but it's otherwise rather similar to the 7-point 3D stencil - which itself is a generalization of 3-point 1D stencil. Thus, for clarity, the rest of this article focuses the basic 3-point 1D stencil (and its generalizations in higher dimensions, and necessary modifications for FDTD), as shown below. We also assume all elements outside the boundary are 0 (Dirichlet boundary condition).
A naive implementation looks like something below:
Memory WallIt's a well-known fact that numerical simulations based on stencil computations have an extreme memory bandwidth bottleneck, especially in a naive implementation. The naive implementation creates a tremendously amount of DRAM traffic. Consider a 1 GB simulation domain with 10000 timesteps on a desktop computer with 40 GB/s memory bandwidth (dual-channel DDR4-3200). Because the simulation domain is too large to fit in cache, each timestep involves a full memory scan. It has an extremely low arithmetic intensity, one can use the roofline model to demonstrate its extremely low performance below the CPU's peak FLOPS. We can visualize the iteration space of the loop using a 2D spacetime diagram (1D space + 1D time), each color represents a unit of work: In ordinary programming, loop tiling (loop blocking) is the standard solution to this problem. The array is broken into multiple rectangular blocks and are calculated tiles by tiles. However, in the field of stencil computation there are several unique challenges. First, conventional loop tiling are spatial (space) tiling, but physics simulation often apply just a single operation to each tile, not multiple operations, there's still no little to none data reuse even if the loop is blocked. Instead, we have to use temporal (time) tiling to calculate multiple timesteps at once. Next, stencil computation has data dependencies that extends beyond the tile itself to its neighboring tiles, so preserving the correct data dependencies is now the problem. In particular, time tiling means that the simulation domain becomes asynchronous, every element can stay in a different timestep - without an appropriate tile shape that brings natural synchronization, it would be a nightmare to handle. Parallelogram TilingHistorically, parallelogram tiling is the first solution to this problem. First, the space is broken into small rectangular tiles. Then, at each timestep, the range of updated elements shrinks one element to the right, because they cannot be updated due to missing dependencies. On the other hand, the range also grows one element to the left. By using a sliding window of two tiles, updating the next tile automatically "fixes" the elements with the wrong timesteps on the left. This tiling algorithm is visualized in the following spacetime diagram: One can see the following desirable properties of this tiling:
Modification for FDTDFor 1D FDTD, the E field has H field dependencies on the left: Thus, one can use the following modification to make it suitable for FDTD. ![]() Note that for simplicity, only the updated field is labeled in the diagram, but both the E and the H fields are accessed in each update step. Also note that in my published tiling code, 3D parallelogram tiling was incorrectly called "trapezoid tiling", this is incorrect and was a mistake due to my confusion of terminology during early development. Fatal Flaw: No Parallelism.However, this tiling algorithm also has a fatal flaw: the lack of inter-tile parallelism. To preserve dependencies, all tiles must be iterated from the left to the right. There exists three workarounds:
Trapezoid TilingTo overcome the lack of parallelism of in parallelogram tiling, an alternative tiling algorithm was invented by researchers, known as trapezoid tiling (or trapezoidal tiling). The innovation here is that, instead of using a single shape of tiles, it uses split-tiling with two different tile shapes: An invented trapezoid that shrinks over time, known as "mountains", and a regular trapezoid that grows over time, known as "valleys". In 1D case is visualized in the following 2D spacetime diagram: As one can see, different mountains are completely independent and can be processed in parallel, because trapezoid tiles provide a natural spacing for avoiding the dependencies. After all mountains are processed, all valleys are then again processed in parallel (in the picture, valleys at the boundary can either be treated as an incomplete valley, or be treated as part of the first two mountains). In the 1D case, one only needs two synchronizations for many timesteps, providing great parallelization on massively parallel computers. Modification for FDTDSimilar to parallelogram tiling, trapezoid tiling can also be modified for FDTD, as shown in the following diagram. ![]() Note that in the research paper by Fukaya, T., & Iwashita, T., "trapezoid tiling" and "diamond tiling" were used interchangeably. Strictly speaking, it's an abuse of terminology which should've been avoided. Diamond tiling is an extension of trapezoid tiling and should not be confused with the original algorithm. My published tiling code also refers to this method as "diamond tiling", which is incorrect was due to the confusion of terminology. Generalization to Higher DimensionsTrapezoid tiling can also be generalized to higher dimensions naturally. For example, in 2D, each dimension can be considered as 1D separately, 1D tiles from each dimension are combined together to form a 2D plane (and a 3D spacetime), or a 3D cube (and a 4D spacetime). Reasoning it geometrically can be extremely difficult or impossible, but fortunately it's easy to reason about it algebraically: The order of combinations following the Cartesian product of the tile types from each dimension. In other words, 2D needs 4 stages of processing, and 3D needs 8 stages of processing. Here's how to generalize trapezoid tiling to 2D space (3D spacetime):
Note that there's only a single time axis shared by all dimensions, and all the timesteps of different tiles are always aligned. For example, in 3D space, the ranges x, y and z grows or shrinks simultaneously in each timestep, allowing its clean generalization to high-dimensional space. Fatal Flaw: Redundant Memory AccessesUnfortunately, as great as trapezoid tiling initially seems to be, it also has a fatal flaw, which is redundant memory accesses at the overlapped regions of different tiles. To see how it happens, recall the 2D spacetime diagram of 1D trapezoid tiling: Although each tiles have a trapezoidal shape, but these tiles are actually overlapped 1D lines in terms of memory access. In other words, trapezoid tiling creates parallelism at the expense of redundant memory access. In fact, in trapezoid tiling, calculating as many timesteps as possible creates the highest overhead due to redundant traffic, because in this case, the "valleys" at the middle mainly exists to fix the mess from the existing work, as a result, they themselves have no chance to perform productive original work (sounds like a office space analogy...). We can reduce the overlapped region and minimizing waste, there are two workarounds:
In 1D space, the overlapped memory accesses are only a small overhead, but the overhead grows exponentially in higher dimensions - nearly quadratically in 2D space and cubically in 3D space. For example, if the simulation domain can be split into 10 non-overlapped rectangles and around 20 overlapped rectangles in the worst case (common when the each tile is small due to cache size limitation), there's a 800% memory traffic penalty, which is sufficient to eliminate any benefits from trapezoid tiling. The penalty of 2D trapezoid tiling is smaller but also significant, especially on GPUs with tiny local memory. In the current published tiling code, trapezoid tiling is applied to the X and Y dimensions with a hardcoded tile size of 10, while parallelogram tiling is applied to the Z dimensions. This (combined with operator compression) is sufficient to explain the inconsistent speedup. I plan to address the problem in the future. Diamond Tiling and Hexagon TilingTo overcome the problem of redundant memory accesses in trapezoid tiling, researchers later proposed a small modification of trapezoid tiling, known as diamond tiling. The insight of diamond tiling is that, when a valley finishes its calculation, it already contains all the data dependencies it needs to immediately start constructing another mountain: After this step, diamond tiling finishes initialization. First, all mountains and all valleys swap their roles. But note that since all valleys themselves will become diamonds. As a result, we now have a simulation domain tiled exclusively by diamonds. There are two types of diamonds, one type is running several timesteps ahead than another type. I'll call one type "slow diamonds" and another type "fast diamonds". To terminate diamond tiling, the last diamond tiles are truncated in the middle at the final timestep, so the diamonds degenerate back into ordinary mountains or valleys (omitted in the diagram). This is essentially doing the startup process in reverse. Also note that in the special case when the width of the diamond begins and ends with more than 1 elements, it's also known as hexagon tiling (hexagonal tiling). As a result, diamond tiling solves the problem of redundant memory access overhead in a simple and elegant manner, the problem is solved, right? Fatal Flaw: Problematic Generalization to Higher DimensionsUnfortunately, the literature says that diamond tiling cannot be generalized to high-dimension space in a clean manner. Personally, I don't even know how it can be done at all. The difficulty is that in diamond tiling, there's a misalignment of tiles on the time axis, and the Cartesian product trick no longer works. During startup, if we attempt to process the tiling using the following schedule using the Cartesian product trick:
As one can see, 3 of these 4 steps have mismatched timesteps, which means those diamonds degenerates back to mountains. Only the last step After the diamond tiling finishes initialization and is fully started, the dependencies become more puzzling:
At this point, the tile dependencies become extremely problematic and puzzling. Only two combinations are time-aligned, which are Mysterious Russian MagicThe only hint I can find in the literature on how it may be done comes from a Russian research group at Keldysh Institute of Applied Mathematics. This group developed a series of different parallel spacetime decomposition algorithms in the last 20 years using a methodology called Locally-Recursive non-Locally Asynchronous (LRnLA) algorithms. LRnLa is not a single algorithm but a general guideline of how to design them. Basically: (1) The tessellation must should be a recursive process to utilize the entire memory hierarchy (it's not necessarily automatic, manual parameter tuning for different machines is allow as long as tiling algorithm itself can be generalized). (2) Parallelism should exist between different tiles, the dependency conflict problem should be solvable in some natural way. Using both requirements as starting point, researchers would manually examine the stencil dependencies and use their intuition in solid geometry to design custom algorithms to satisfy these goals. Unlike polyhedron compilers, these are custom solutions designed by human domain experts for human use, with geometric interpretations that ease implementations (but only from the perspective of mathematicians and physicists...). 2D and 3D diamond tiling were known to them as the ConeTur algorithm, which were used in the first generation of HPC code in the late 1990s. In the paper The DiamondCandy Algorithm
Unfortunately there's no step-by-step procedure on how it may be implemented. I don't understand how can these pyramids, octahedrons and tetrahedrons be combined together. Furthermore, this algorithm was already obsolete from their perspective, so they likely have no interest to better explain it in the future, as they have later developed even faster algorithms such as ConeFold, DiamondTorre, and DiamondCandy. Since these operate directly within 4D spacetime, they're even more difficult to understand. As far as I know, these algorithms are original and are not used or explained by anyone else. For a physicist who can already picture a 4D Minkowski spacetime and even doing math inside it, it may be obvious. But for everyone else, even the simplest case of ConeTur is difficult. This group is at least 10 years ahead to the rest of the world in this field of research. ConclusionAs a result, in practice, diamond tiling is often only applied to one dimension, other dimensions are parallelized using parallelogram tiling or other more conventional methods. However, this greatly reduces the program's natural parallelism - a 2D 10x10 trapezoid tiling has 100 independent blocks, a 1D tiling only has 10, making it unsuitable for parallel computation. Whether diamond tiling can be generalized to 2D and 3D for practical FDTD simulations is an open question. If you're good at solid geometry, welcome to attack this problem. Strictly speaking, it has already been solved, but the literature is unreadable for outsiders. Stencil computation is a classic field in HPC research, and tiling techniques have been studied since the 1970s, as a result, the papers are full of definitions, axioms, lemma, and proofs about the difficult mathematical properties of these optimizations, or about an universal compiler framework that would work on any tile shape (e.g. see polyhedron compiler) - often without telling you what the tiles even look like graphically because everyone who's working in this field already knew. There's a serious lack of introductory materials on temporal tiling suitable for the purpose of human optimization, rather than code generation. Update: I asked the question elsewhere. A polyhedron compiler researcher saw it and told me that even though I'm looking for intuitive geometry and algebraic interpretations suitable for hand coding, but an automatic stencil or polyhedron compiler may still be helpful. It can be insightful to feed the 2D Jacobi case into a ready-made polyhedron compiler (with many options), dump the generated dependency chains or loop schedules, then visualize the output as a 3D model. If there's no alternative solution, I'm going to take a try at this approach. Solution of 2D Diamond TilingAfter looking at this ConeTur visualization again today, I finally found the key that leads to its solution: Only 2D spacetime can be tessellated exclusively with diamonds. In 3D spacetime, at every pass, only exactly one diamond can be created, not four. The remaining three all degenerate to trapezoids. Thus, the best one can do is creating only one diamond per iteration - but this still gives a slight speedup over pure trapezoid tiling. First, recall the 2D spacetime of 1D diamond tiling at startup: Here's the algorithm for startup. It's based the same Cartesian product trick. I'll show how timestep misalignment is handled later:
As one can see, 3 of these 4 steps have mismatched timesteps, which means those diamonds degenerates back to valleys. Only the last step When the diamond tiling finishes initialization, as usual, the original mountains now takes valleys' roles, so they too become diamonds. After this process, the 2D spacetime diagram would become: Here's the algorithm for continuing the calculations:
Further calculations are possible by applying the algorithm for the symmetrical case. As we can see here, in each iteration, one (and only one) new group of diamonds is created. In the second and all future iterations (excluding the last truncated one before stopped), the pairs from the original Cartesian product have been reduced from 4 to 3. After explaining the theory of operation, the ConeTur visualization is now perfectly clear. All 4 combinations with the exception of the last one degenerate back into trapezoids. Thus, the main difficulty that prevents the generalization of diamond tiling to 2D, which is those "diamonds misaligned in time" is avoided and shown to be non-existent - they will not be created to begin with due to trapezoid degeneration. As a result, at every pass, exactly only one diamond can be created, not four, the remaining three can only be trapezoids. Thus, and the best one can do is creating only one diamond per iteration, all the time. Thus, the best one can do is creating only one diamond per iteration - but this still gives a slight speedup over pure trapezoid tiling. For higher dimension, although it will become even more complicated, but assuming there's no control overhead (there will be), the savings can still be significant. This kind of behavior reminds me of Karatsuba multiplication, which also does 4 multiplications in 3 steps. This must also why the ConeTur algorithm was soon abandoned by the Russian research group at Keldysh Institute of Applied Mathematics - straight generalization of diamond tiling directly to 2D and higher dimensions only has limited gain - which was the motivation of later improvements. For example, ConeFold appears to be their second-generation algorithm, and seems to be a modification of the ConeTur algorithm by recombining the shapes. Meanwhile, DiamondTorre and DiamondCandy are completely new and use 2D as their starting point, thus are more efficient. Now the real problem is understanding ConeFold, which is not within the scope of this question. 2D diamond tiling visualizationUpdate: As usual in life, you can't find something when you needed it the most before it pops up later when you don't need it anymore... Just after solving the problem myself, while looking for something else, I've found the detailed step-by-step description of the solution in a research paper, even with 3D diagrams.
But at least I know that my solution was correct and now there's a reference for future readers... Sigh... Insight behind the DiamondTorre algorithmUpdate: I'm finally able to work out a preliminary understanding of the key idea behind the mysterious DiamondTorre spacetime decomposition algorithm today. Unlike conventional diamond blocking, the DiamondTorre algorithm is a native 2D algorithm, not a direct generalization of the 1D algorithm. Thus, we need to apply diamond tiling to the XY plane (not the XT or YT plane) - that is, the simulation space itself, see the figure below (only two layers of tiles are shown). The current timestep of all cells is also marked in the figure. After applying diamond tiling, the processing "cursor" moves the wavefront from right to left. In each stage, we select a column of diamond tiles on the same Y axis - different diamond tiles can be calculated in parallel, similar to the rules of 1D diamond tiling. For the first wavefront, we would select the green, red, pink and purple tiles. We calculate each selected tile one timestep ahead. After finishing this step, now it's the key and the algorithm: shift the tile on one unit to the right on the X axis (truncation is also allowed), then one would find that calculating a new timestep is now possible within these tiles - the shifted tiles have perfectly avoided the grid cells that we cannot calculate due to missing dependencies, leaving only good cells with complete dependencies. Similarly, the missed grid cells that cannot be calculated in the current wavefront will be "fixed" after a new wavefront is calculated and shifted (we're using double buffering with two timesteps, so it's legal to access a cell on the same timestep or a cell with exactly one timestep behind us). For example, in the second wavefront, we've selected the blue, purple and yellow tile. More and more tile shifts are possible when the wavefront goes further to the left. In 2D space, the partitioning of space to diamond tiles only determines their initial position, during the execution it's needed to keep moving them the right. Meanwhile, in 3D XYT spacetime, we can see that the process of sweeping a diamond tiles over time creates a leaning tower (Torre). The horizontal motion of the tile is just the projection of 3D spacetime to 2D space when the towers are viewed from the top. Thus, the DiamondTorre algorithm can be seen as a creative combination of two algorithms: 1D diamond tiling, and also 1D parallelogram tiling with wavefront parallelis - this is the key insight to understand it. This is the best explanation I currently have. Refer to the DiamondTorre references at the end of the post for more information. We can also see that DiamondTorre is inherently a 2D algorithm, and full parallelism only exists on the Y axis (wavefront parallelism on the X axis is still possible, at the expense of pipelined startup). Thus, the best parallel scalability and efficiency can only be maximized when one dimension is significantly longer than other dimensions - this is exactly the HPC cluster test cases showed in the papers. To overcome this problem, the team later proposed an even more powerful algorithm named DiamondCandy, which is a native 3D algorithm, and is even more difficult to understand. If you're reading this and you're great at solid geometry, an tutorial-format explanation similar to my answer would be greatly appreciated... ReferencesBasis of this article
Introduction to Stencils
Theory and Applications
Thesis
Case Studies
Mysterious Russian MagicThese results came from the Russian Keldysh Institute of Applied Mathematics. Based on their own design methodology called Local Recursive non-Local Asynchronous algorithms (LRnLA) and also their intuition of solid geometry, they've designed and programmed multiple spacetime domain decomposition algorithms by hand, all are based on the tessellation of 3D and 4D spacetime. The latest two generations of algorithms are 2D1T DiamondTorre and 3D1T DiamondCandy. As far as I know, these algorithms are original and differs greatly from everyone else. Not only that, they seem to be able to solve any application using the same logic, be it - FDTD for electromagnetism, LBM for fluid mechanics, discontinuous Galerkin method, Runge-Kutta method... You can find them by searching for the keyword LRnLA. I can't understand even one article. Here are all the descriptions I could find about these algorithms. If anyone understands it, please write an easy-to-follow tutorials in the style of my article and it would be greatly appreciated. DiamondCandy (latest)
DiamondTorre
Others
|
Beta Was this translation helpful? Give feedback.
-
The following is the fifth article of the series Notes on HPC Programming and openEMS Optimizations. Annotated Source Code of openEMS's "SSE engine"The source file P.S: I'm now experiment with different vectorization methods, because I believe the memory transposation is unsuitable for tiling.
|
Beta Was this translation helpful? Give feedback.
-
This got me thinking that GPRMax uses cuda to accelerate EM simulations. See link |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
As everyone may have noticed, in the past several months, I've been investigating different potential methods to accelerate openEMS simulations on my spare time, including loop blocking, time skewing, NUMA awareness, GPU acceleration, better operator compression, and symbolic verification of optimization correctness.
So far I've been reluctant to publish any detail from the project, since they're still in a stage of early prototype in the form of many micro-benchmarks (almost 100 code variants have been tested). If the openEMS project is a complete circuit board, then what I'm mostly doing right now is merely the characterization of some capacitors and resistors in a feedback loop - there's no new schematic or layout (yet). Thus, most code are not functional or usable. Moreover, a comprehensive description of all results would take significant time and effort to write, similar to a thesis for a college project.
However, several members of the community have expressed great interests in participating in the optimization effort, but lack the familiarity of the codebase and relevant optimization techniques. Thus, I felt that it would be useful to selectively publish my results in a series of short and self-contained research notes and tutorials. It eliminates the difficulty of writing a full thesis, and instead can be updated from time to time whenever I have the motivation to do so in this thread.
Table of Content
Because this thread is long with interleaved discussions, click the following links to jump straight into the topics:
Beta Was this translation helpful? Give feedback.
All reactions