Notes on HPC Programming and openEMS Optimizations #154

biergaizi · 2023-12-03T01:01:55Z

biergaizi
Dec 3, 2023

As everyone may have noticed, in the past several months, I've been investigating different potential methods to accelerate openEMS simulations on my spare time, including loop blocking, time skewing, NUMA awareness, GPU acceleration, better operator compression, and symbolic verification of optimization correctness.

So far I've been reluctant to publish any detail from the project, since they're still in a stage of early prototype in the form of many micro-benchmarks (almost 100 code variants have been tested). If the openEMS project is a complete circuit board, then what I'm mostly doing right now is merely the characterization of some capacitors and resistors in a feedback loop - there's no new schematic or layout (yet). Thus, most code are not functional or usable. Moreover, a comprehensive description of all results would take significant time and effort to write, similar to a thesis for a college project.

However, several members of the community have expressed great interests in participating in the optimization effort, but lack the familiarity of the codebase and relevant optimization techniques. Thus, I felt that it would be useful to selectively publish my results in a series of short and self-contained research notes and tutorials. It eliminates the difficulty of writing a full thesis, and instead can be updated from time to time whenever I have the motivation to do so in this thread.

Table of Content

Because this thread is long with interleaved discussions, click the following links to jump straight into the topics:

biergaizi · 2023-12-03T01:02:00Z

biergaizi
Dec 3, 2023
Author

The following is the first article of the series Notes on HPC Programming and openEMS Optimizations

Introduction to the Memory Bandwidth Wall

The computing power of modern CPUs and GPUs increased exponentially in the past decades - multi-core CPUs faster than 1 TFLOPS and high-end GPUs faster than 10 TFLOPS have become the norm in the industry. Naturally, these peak-FLOPS specifications are heavily publicized by corporations, universities and governments as the beginning of the new era of Exascale supercomputers.

But outside the field of HPC programming, few people are aware of the fact that the actual performance for some applications can be as low as 1% to 10% of a machine's theoretical peak floating-point throughput - in other words, the nameplate "peak FLOPS" number is only achievable in a small selection of applications.

We can illustrate this fact by using the following program:

void triad(float *dst, float *src1, float *src2)
{
    for (size_t i = 0; i < MAX_RAM_SIZE; i++) {
        dst[i] = src1[i] + k * src2[i]
    }
}

This loop reads two FP32s from memory. One input is multiplied by a constant, then the result is added into another input (these two operations are usually compiled into a single FMA instruction). Finally, the result is written back into memory. Suppose that the loop is already perfectly parallelized and optimized on a CPU or a GPU (no data sharing), answer these questions:

How would you describe the performance of this application?
If we're using a Nvidia H100 GPU (PCIe version) - one of the top-performing datacenter GPUs in the world - how many loop iterations can be executed in a second?

As we often hear that a modern CPU or GPU operates at over 1 TFLOPS from the marketing material of chipmakers, many may expect great performance and answer that the speed is 25.6 trillion iterations per second, as H100's datasheet specification is 51.2 TFLOPS and this loop has 2 FLOPS per iteration.

But is it true? Now let's consider the memory traffic. This loop performs 2 memory reads, 1 memory write. Each floating-point number takes 4 bytes (32 bits). In other words, the memory traffic per iteration is 12 bytes, and the floating-point operations per iteration are 2 FLOPs - this algorithm is actually a disaster on modern hardware. To see why, consider the fact that the H100's peak memory bandwidth is "only" 2039 GB/s. Since the processor can't run faster than it takes to perform a memory transfer, the actual performance is only 2039 / 12 x 2 = 0.34 TFLOPS, not 51.2 TFLOPS.

Due to limited memory bandwidth, the GPU (or CPU) can only achieve 0.6% of its peak performance, 99% of the time is wasted on memory transfer. Inline assembly, vectorization, parallelization, AI engines - all of them are useless without changing the algorithm first. This notorious problem is known as the memory wall (in particular, the memory bandwidth wall, not the memory latency wall as the memory I/O is perfectly sequential here).

In fact, the previous code is known as the STREAM Triad benchmark, and is actually used as a standard memory benchmark for CPUs and GPUs.

Machine Balance and Arithmetic Intensity

The previous example showed that peak FLOPS is not a good indicator of performance in many cases - especially in memory-bound physics simulations, including openEMS. To understand performance better, one needs to make use of the concepts of machine balance, arithmetic intensity and roofline analysis.

Machine balance describes the balance between a processor’s memory and floating-point throughput. In the ideal world, this balance should be 1:1 so a processor can read or write memory as fast as it can perform an arithmetic operation. However, processors were and are improving at a faster rate than DRAM bandwidth for multiple reasons, such as the time it takes for the sense amplifier to read bits from capacitors, the impracticality of integrating large DRAM and CPU on the same silicon, or the interconnect bottleneck as off-chip data transfer is always slower than on-chip access.

Processor and machine balance increasing, making communication relatively more expensive. Plot for 64-bit floating point data movement & operations; bandwidth from CPU or GPU memory to registers. Data from vendor specs and STREAM benchmark

Hence, modern CPUs and GPUs tend to have a machine balance of 1:100 (for FP64). That is, one can only achieve the peak FLOPS of the processor if 100 floating-point operations are done for each integer read from memory - in other words (no pun intended), a compute kernel should have a high Arithmetic Intensity. Arithmetic Intensity is the balance between a program’s memory access operations and floating-point operations. For example, if a program needs to read 4 FP32s to calculate their average value (using 3 adds and 1 multiplication), its arithmetic intensity is 0.25 FLOPs/bytes. That is, arithmetic intensity can be understood as a code balance in analogous to the machine balance.

In practical programs, the programmer will attempt to do as many useful computations as possible after memory is read using many tricks. But ultimately, the achievable intensity depends on the algorithm itself. For example, a good matrix multiplication program (the archetypal test is the LINPACK/HPL benchmark) is entirely compute-bound, as there are a whole lot of multiplications to do just after reading a single submatrix. Memory bandwidth does not matter in this application.

Arithmetic Intensity of different algorithms. Physics simulations using stencil computation have low arithmetic intensity below 1 FLOPs/byte, dense linear algebra (BLAS 3) has high Arithmetic Intensity over O(10) FLOPs/byte

On the other hand, many types of scientific and engineering simulations are memory bandwidth bound. All they do is reading the simulation box from memory, moving it forward by 1 timestep, and writing it back to memory (think of a naive implementation of Conway’s Game of Life, the value of a single cell depends on the values of their surrounding cells, or think of a 2D or 3D convolution kernel which behaves in the same way - real physics simulations of partial differential equations such as openEMS often work like that). The arithmetic intensity is often under 1.0 and performs poorly on modern hardware. The usable performance is 1% to 10% of the peak performance due to bandwidth limitation, which is far from what the processor core itself can do.

Roofline Model

Since memory-bound number-crunching code is so prevalent in the scientific computing (or High-Performace Computing) world, practitioners established a simple graphical method named the roofline model as a quick guideline on program optimization, using only the ratio between floating-point operations and memory accesses.

A naive roofline model is easy to graph. First, divide a processor’s peak floating-point throughput by its peak memory bandwidth. This gives the critical arithmetic intensity at which a memory-bound algorithm becomes a compute-bound one, known as the ridge point. Then, plot two segments, one from the ridge point to the origin at the left, another is a constant function equal to the machine’s peak floating-point performance, from the ridge point to the right. Then rescale the chart on a log-log scale.

Next, calculate the number of floating-point operations and the arithmetic intensity of your application and mark it on the plot. Many number-crunching kernels are simple, one can often look at the inner loop and count them by hand, only the source code and its runtime are needed as a result. To automate this process, one can use CPU performance counters and static analysis as well.

Practice 1: Triad

As a practice, let's now re-examine the STREAM Triad program and calculate its arithmetic intensity:

void triad(float *dst, float *src1, float *src2)
{
    for (size_t i = 0; i < MAX_RAM_SIZE; i++) {
        dst[i] = src1[i] + k * src2[i]
    }
}

This loop performs 2 memory reads, 1 memory write. Each floating-point number takes 4 bytes (32 bits). In other words, the memory traffic per iteration is 12 bytes, and the floating-point operations per iteration is 2 FLOPS. Thus, its arithmetic intensity is 0.16 FLOPs/byte.

Using roofline analysis, we can see that its theoretical performance cannot be faster than 340 GFLOPS - this is the hardware's physical limit. In practice, due to software overhead, the point will stay at somewhere below the roofline - the distance is a visual indication of the possible room of improvement.

Note that in this roofline plot, we're using the most optimistic datasheet values for peak memory bandwidth and peak FLOPS. In practice, one would use STREAM Triad itself to define peak memory bandwidth, as achievable memory bandwidth is often lower than the theoretical peak bandwidth, as no memory controller is 100% efficient.

Roofline model of Nvidia H100 PCIe for STREAM Triad

Practice 2: openEMS naive `UpdateVoltages` kernel

The following is the unmodified source code from openEMS's naive UpdateVoltages kernel (without vectorization and operator compression as found in the SSE and SSE Compressed engines, as they're outside the scope of this note).

void Engine::UpdateVoltages(unsigned int startX, unsigned int numX)
{
	unsigned int pos[3];
	bool shift[3];

	pos[0] = startX;
	//voltage updates
	for (unsigned int posX=0; posX<numX; ++posX)
	{
		shift[0]=pos[0];
		for (pos[1]=0; pos[1]<numLines[1]; ++pos[1])
		{
			shift[1]=pos[1];
			for (pos[2]=0; pos[2]<numLines[2]; ++pos[2])
			{
				shift[2]=pos[2];
				//do the updates here
				//for x
				volt[0][pos[0]][pos[1]][pos[2]] *=
				    Op->vv[0][pos[0]][pos[1]][pos[2]];
				volt[0][pos[0]][pos[1]][pos[2]] +=
				    Op->vi[0][pos[0]][pos[1]][pos[2]] * (
				        curr[2][pos[0]][pos[1]         ][pos[2]         ] -
				        curr[2][pos[0]][pos[1]-shift[1]][pos[2]         ] -
				        curr[1][pos[0]][pos[1]         ][pos[2]         ] +
				        curr[1][pos[0]][pos[1]         ][pos[2]-shift[2]]
				    );

				//for y
				volt[1][pos[0]][pos[1]][pos[2]] *=
				    Op->vv[1][pos[0]][pos[1]][pos[2]];
				volt[1][pos[0]][pos[1]][pos[2]] +=
				    Op->vi[1][pos[0]][pos[1]][pos[2]] * (
				        curr[0][pos[0]         ][pos[1]][pos[2]         ] -
				        curr[0][pos[0]         ][pos[1]][pos[2]-shift[2]] -
				        curr[2][pos[0]         ][pos[1]][pos[2]         ] +
				        curr[2][pos[0]-shift[0]][pos[1]][pos[2]         ]
				    );

				//for z
				volt[2][pos[0]][pos[1]][pos[2]] *=
				    Op->vv[2][pos[0]][pos[1]][pos[2]];
				volt[2][pos[0]][pos[1]][pos[2]] +=
				    Op->vi[2][pos[0]][pos[1]][pos[2]] * (
				        curr[1][pos[0]         ][pos[1]         ][pos[2]] -
				        curr[1][pos[0]-shift[0]][pos[1]         ][pos[2]] -
				        curr[0][pos[0]         ][pos[1]         ][pos[2]] +
				        curr[0][pos[0]         ][pos[1]-shift[1]][pos[2]]
				    );
			}
		}
		++pos[0];
	}
}

The code can be difficult to understand, so in the development of my new engine it has been rewritten to the following form:

void UpdateVoltagesKernel(
        Kokkos::View<float***[3]> volt,
        Kokkos::View<float***[3]> curr,
        Kokkos::View<float***[3]> vv,
        Kokkos::View<float***[3]> vi,
        uint32_t i, uint32_t j, uint32_t k
)
{
        uint32_t prev_i = i > 0 ? i - 1 : 0;
        uint32_t prev_j = j > 0 ? j - 1 : 0;
        uint32_t prev_k = k > 0 ? k - 1 : 0;

        // 3 FP32 loads
        float volt0 = volt(i, j, k, 0);
        float volt1 = volt(i, j, k, 1);
        float volt2 = volt(i, j, k, 2);

        // 3 FP32 loads
        float vv0 = vv(i, j, k, 0);
        float vv1 = vv(i, j, k, 1);
        float vv2 = vv(i, j, k, 2);

        // 3 FP32 loads
        float vi0 = vi(i, j, k, 0);
        float vi1 = vi(i, j, k, 1);
        float vi2 = vi(i, j, k, 2);

        // 9 FP32 loads (c = current, p = previous)
        float curr0_ci_cj_ck = curr(i,      j,      k     , 0);
        float curr1_ci_cj_ck = curr(i,      j,      k     , 1);
        float curr2_ci_cj_ck = curr(i,      j,      k     , 2);
        float curr0_ci_cj_pk = curr(i,      j,      prev_k, 0);
        float curr1_ci_cj_pk = curr(i,      j,      prev_k, 1);
        float curr0_ci_pj_ck = curr(i,      prev_j, k     , 0);
        float curr2_ci_pj_ck = curr(i,      prev_j, k     , 2);
        float curr1_pi_cj_ck = curr(prev_i, j,      k     , 1);
        float curr2_pi_cj_ck = curr(prev_i, j,      k     , 2);

        //for x polarization
        volt0 *= vv0;
        volt0 +=
            vi0 * (
                curr2_ci_cj_ck -
                curr2_ci_pj_ck -
                curr1_ci_cj_ck +
                curr1_ci_cj_pk
            );

        //for y polarization
        volt1 *= vv1;
        volt1 +=
            vi1 * (
                curr0_ci_cj_ck -
                curr0_ci_cj_pk -
                curr2_ci_cj_ck +
                curr2_pi_cj_ck
            );

        //for z polarization
        volt2 *= vv2;
        volt2 +=
            vi2 * (
                curr1_ci_cj_ck -
                curr1_pi_cj_ck -
                curr0_ci_cj_ck +
                curr0_ci_pj_ck
            );

        // 3 FP32 stores
        volt(i, j, k, 0) = volt0;
        volt(i, j, k, 1) = volt1;
        volt(i, j, k, 2) = volt2;
}

Now it should be obvious that, to update each electric field cell, it needs to read 3 FP32s to obtain an electric field cell from electric field array volt, 3 FP32s from operator vv, 3 FP32s from operator vi, 9 FP32 to obtain the surrounding cells from the magnetic field array curr, and finally 3 FP32 writes the updated values back into the array volt. The total memory traffic is 21 bytes. Each polarization contains 6 floating-point operation, overall there are 18 FLOPs.

Thus, the point-wise arithmetic intensity of the naive FDTD kernel is 0.21 FLOPs/byte. It would be the case if the processor has no cache or scratchpad memory, but most processors do have them. In this case, note that the 9 FP32s reads from the magnetic field array are overlapping between electric field updates at adjacent points - some points are read multiple times if the electric field of the entire simulation domain is updated. If the processor cache is ideal, all these repeated reads would be filtered out, thus a single-read pass of the magnetic field array already contains all the information we need. In this case, only 3 FP32 reads are strictly necessary, and the domain-wise arithmetic intensity would be 0.3 FLOPs/byte. In practice, not all repeated reads can be filtered out, and there's a 10% to 20% overhead.

Nevertheless, in both cases, using roofline analysis, we can see that the FDTD algorithm has an extreme memory bottleneck.

Roofline model of Nvidia H100 PCIe for naive FDTD

To avoid any misunderstanding, please note that this is only an academic example, and does not represent practical openEMS performance. While this is one significant and fundamental bottleneck of openEMS, it's not the only one. Due to the existence of other problems, openEMS cannot even saturate the available memory bandwidth in some cases.

Future Updates

As we've shown here, many HPC simulations including openEMS are limited by memory bandwidth. Thus, reducing memory traffic would be the first and foremost optimization to be done for these applications.

Many programming techniques are available. Some techniques fundamentally increases the degree of data reuse by optimizing the memory access patterns to increase the cache hit rate, such as loop blocking. In particular, another related technique called time skewing calculates multiple timesteps simultaneously, thus increasing the arithmetic intensity of the compute kernels. Other techniques are based on redundant computation. Since 99% of the CPU time is actually idle, reducing memory bandwidth at the expense of doing more computation is a worthwhile tradeoff. For example, instead of storing pre-calculated variables in a lookup table, sometimes it's better to recalculate or reconstruct them when needed in real time. Even on-the-fly data compression and decompression can be introduced (in fact, a simple de-duplication algorithm already exists in the SSE Compressed engine in openEMS, right now I'm in the process of developing an improved version). Other techniques increase the usable physical memory bandwidth at hardware level, such as NUMA awareness and GPU acceleration.

But evaluating the effectiveness of bandwidth-saving techniques would be difficult if memory bandwidth cannot be measured directly. Thus, In a future note, I'll show how memory traffic can be accurately measured using low-level hardware performance counters provided by modern CPUs, and how measured memory traffic and memory bandwidth can be used to guide performance optimization of memory-bound compute kernels.

4 replies

Sarajevo67 Dec 3, 2023

In a future note, I'll show how memory traffic can be accurately measured using low-level hardware performance counters provided by modern CPUs, and how measured memory traffic and memory bandwidth can be used to guide performance optimization of memory-bound compute kernels.

Thank you, for your efforts.
If I can suggest, it would be great to have, in the beginning, representative 'curr' and 'volt' datasets, for benchmarking and, more important, validation tolerances? I don't expect that, even correct, calculations will always yield exactly same results, so some reasonable tolerances are needed.

biergaizi Dec 4, 2023
Author

validation tolerances

I'm currently using two methods for correctness validation. Symbolic verification and field dump.

In symbolic verification, the floating-point values used in volt, curr, vv, vi, iv, ii are replaced from numerical floating-point values to symbolic mathematical variables, then a computer algebra system dry-runs the unmodified naive kernel and the proposed optimization kernel. If the final results yield identical mathematical expressions, one can be confident that the optimization is mathematically correct. It can only be used at a very small scale in the form of small test programs (not the full openEMS program), but I found this technique is valuable for debugging. It can report both the location of the bug, and helpfully shows the mismatch as algebra expressions to point out exactly which terms are wrong or missing. This technique will be covered in a future note.

In field dump, dump boxes are turned on during openEMS simulation to export fields to VTK (.vtr) files. Next, the simulation results of the unmodified openEMS code and the proposed optimized code are ran separately with their results compared using a script. If no change is made in the numerical part of the code, the floating-point should be bit-identical on the same machine (for example, my Tiling engine produces bit-to-bit identical results with the upstream Multithread engine). If changes are made in the numerical code, the result may change slightly as floating-point arithmetic is inexact and depends on even the order of operation. But I think if the numerical different between the two tests runs are small, such as a difference of 0.01, it means the optimization is probably correct. This technique is already briefly described here: thliebig/openEMS#106 (comment)

Finally, yet another verification may be a comparison between well-known analytical results (e.g. dipole antenna, transmission lines, etc.) and numerical results. A huge number of FDTD accuracy test cases have already been published in textbooks, research papers and industry standards. I believe one can create an accuracy test suites by studying the literature and implementing these tests in openEMS. So far I haven't investigate them yet and probably will not in the near future since I'll keep focus on code optimization, so it's an open question for the community.

Sarajevo67 Dec 4, 2023

Excellent analysis, as always.

(for example, my Tiling engine produces bit-to-bit identical results with the upstream Multithread engine)

I'm really surprised by this. Usually, by changing order of FP operations, one could expect deviations within few epsilon (not 0.01), per iteration. Because of self canceling nature, these deviations, shouldn't raise to any critical level. Stable results (no race), within tolerances, yielded by test crunching of preforged arrays. compared with upstream engine, would be just fine.

Don't burden yourself with high-level validation. Currently only WG and TL models can give results comparable with analytical. Lumped port excited dipoles and monopoles, doesn't. I think, that we need cylindrical face port excitation (similar to cylindrical coaxial and WG ports).

With over 40 years of programming experience, in a large number of programming languages and paradigms (with some OOP exception), my recommendation is to follow the path of incremental improvements. Too ambitious projects, big leaps, are usually abandoned...

biergaizi Dec 4, 2023
Author

I'm really surprised by this.

The tiling kernel doesn't touch the numerical expressions at all, it only changes how 3D space is traversed. At each point exactly the same expressions are executed, so bit-identical results are to be expected.

I had less luck with the new engine under development, the deviation is small but non-zero - which is not surprising.

Usually, by changing order of FP operations, one could expect deviations within few epsilon (not 0.01), per iteration.

In openEMS, field dumps are taken at a large timestep interval, such as 100 or 1000 timesteps. The deviation can reach ~1e-3 after 10000 timesteps, the differences between the existing naive engine and the Compressed SSE engine already has a deviation around this level.

montanaviking · 2023-12-03T19:55:12Z

montanaviking
Dec 3, 2023

I'm very interested in your work and eventually would like to contribute to this effort when I "retire" next year. I recently attended an online seminar by Tidy3D company who sell FDTD EM simulation products aimed at photonic simulations. I did ask them how they're getting around the "memory wall" but I doubt they'll want to give me much information since it's likely proprietary.
I recently stocked up on three dual-socket servers with Xeon E5-26xxV3 and V4 CPUs but I know that the performance won't scale with CPU cores and each machine will likely top out performance at 4cores each - 2cores/socket as I found in my old 12-core 2-socket Xeon v2 server. I have 10G LAN connections between servers.
So I'm looking into MPI to run all three machines on a problem to maximize memory use. Next, perhaps I'll look at GPUs because they have higher memory bandwidth. I was not able to get MPI to run on OpenEMS yet and working on that. Do you think with MPI, the 10G LAN bandwidth would limit any gains one could get by spreading the load over three servers which have similar memory speeds?
Thanks, Phil

1 reply

biergaizi Dec 4, 2023
Author

I recently stocked up on three dual-socket servers with Xeon E5-26xxV3 and V4 CPUs

These old Xeon E5 v3/v4 servers's performance is not great by today's standard. They have a theoretical memory bandwidth of 76.8 GB/s per socket (4-channel per socket), thus, assuming no communication overhead, the highest possible theoretical bandwidth of 4 racks of dual-socket Xeon E5 v4 servers is = 76.8 x 2 x 4 = 614.4 GB/s. On the other hand, a modern AMD EPYC 9004 (Zen 4) system supports 12-channel DDR5 memory per socket, so the single-socket bandwidth is already 460.8 GB/s. A similar 4-rack quad-socket system would provide a DRAM bandwidth of 3686.4 GB/s - an 600% improvement.

But I found these Xeon E5 v3/v4 servers are still great machines for HPC development.

First, the price is extremely low following their retirement. A full dual-socket system (mainboard + DRAM + CPU) can be purchased for as low as $200, not $2000 or $4000.
Next, the performance counters are well-understood and documented. Modern CPUs provide a wide range of low-level performance counters to allow programmers to inspect their internal operations, such as pipeline stalls, branch prediction, cache miss, memory traffic, NUMA interconnect usage. Xeon E5 v4 (Broadwell-EP) servers are well-suitable for these tasks with a long history of use by researchers. Meanwhile, relatively little is known about AMD CPUs - a strong case of "develop on Intel and deploy on AMD".
Finally, the servers are still representative of a typical mid-2010s server. If it runs great on Xeon E5 v4, it will probably run great on everything else.

I've just purchased a Xeon E5 2680 v4 for development myself.

Especially, to access performance counters, privileged access to bare-metal hardware is needed. Often, these also must also ensure that your program is the only thing running on the server to get accurate results. So all virtual machines are ruled out, forcing one to pay high fees to rent bare-metal systems. In this case, a personal $200 bare-metal is great for this purpose.

I know that the performance won't scale with CPU cores and each machine will likely top out performance at 4cores each - 2cores/socket as I found in my old 12-core 2-socket Xeon v2 server.

openEMS's scaling stops at 4 threads because of both bandwidth saturation and software overhead. For example, I've found that the locking in the current multi-thread engine is inefficient. If these problems are fixed, multi-core scaling should go a litter further than 4 threads.

Memory bandwidth scaling is unintuitive. A single CPU core obviously can't saturate memory bandwidth, so even memory-bound code must be paralleled. If the full-socket DRAM bandwidth is 150 GB/s, the single-core bandwidth would be around 10 GB/s to 20 GB/s. So one should probably use 8 cores. But on the other hand, memory bandwidth scaling shows an asymptotic behavior near its limit - the bandwidth increase becomes slower and slower as more cores are added. Using 8 cores, one may be able to get 80% of the bandwidth, the full bandwidth may only be saturated using a very large number of cores, such as 16.

Moreover, current openEMS performance starts dropping off beyond a point when a large number of threads are used. But it's not necessarily always the case. if the parallelization is highly efficient, one can launch as many threads as the CPU cores - it won't go faster, but it at least won't go slower. For example, during my tests, I'm using the Kokkos parallel programming library. Internally, the simulation domain is divided into small tiles and is scheduled at this basic unit. Then, Kokkos uses a dynamic work-stealing queue and runs a thread on each CPU core. The threads are spin-waiting until a new tile arrives. Using Kokkos parallelization, best performance is achieved when the number of threads is equal to the number of CPU cores.

Thus, if parallelization can be improved, perhaps it would allow one to run openEMS at the highest number of threads without the tricky process of tuning thread numbers.

Next, perhaps I'll look at GPUs because they have higher memory bandwidth.

A mid-range GPU offers 512 GB/s memory bandwidth - which is 5x as much as a similar-priced server. But you can only get this kind of performance if the program is completely ported to the GPU, otherwise there will be serious PCIe interconnect bottleneck when data reduction is performed.

I've already experimented with GPU acceleration with both AMD and Nvidia cards. This is not an unsolvable problem, but it will take a while to solve. More importantly, I found GPU programming is much more difficult to debug and troubleshoot. Thus, my plan right now is to do a CPU version first to test and iterate different ideas quickly, before moving to GPU.

I was not able to get MPI to run on OpenEMS yet and working on that. Do you think with MPI, the 10G LAN bandwidth would limit any gains one could get by spreading the load over three servers which have similar memory speeds?

I've never tried MPI. My current focus is single-node optimization. Multi-node optimization may come later.

I expect weak scaling but not strong scaling. That is, adding more nodes can help you to solve larger simulations. In this sense, one can say the speedup is "tremendous". The original openEMS paper said the single-node performance on a Core i7-920 (1st gen) was 150 MC/s. When using 15 desktop computers on a 400 MC simulation domain (equivalent to a 736x736x736 grid), it achieved a speed of 2000 MC/s - 15 PCs with a 13x speedup. Modern software and hardware should do better than that. However, adding more nodes to a medium-size problem without increasing the simulation size will quickly hit the communication bottleneck (one may see an acceleration around 2x to 4x before it flatlines).

In particular, the current MPI code does not use latency hiding techniques, so the communication overhead will be relatively high.

montanaviking · 2023-12-03T20:00:57Z

montanaviking
Dec 3, 2023

Also, I found it amusing that a commercial FDTD product such as Tidy3D still used a rectangular grid, just like OpenEMS. This makes me feel much better about OpenEMS. I've seen papers in which the authors used variable-sized triangular grids and other methods to improve on the grid inefficiencies of the rectangular grids. I'm imagining that since the timestep is set by the minimum grid dimension, that a scheme such as triangular gridding for FDTD would result in significant reduction of memory footprint but probably not great improvements in performance?

2 replies

biergaizi Dec 4, 2023
Author

openEMS does have one unique strength, which is its cylindrical coordinate system. For suitable applications with circular shapes, accuracy and speed can be significantly improved compared to an equivalent rectangular mesh. If I recall correctly, Thorsten Liebig said this was the main motivation behind the development of openEMS (instead of using a pre-existing free and open source simulator like MIT's MEEP). The research problem presented in the original openEMS paper was a medical MRI scanner and was simulated in cylindrical coordinates.

VolkerMuehlhaus Dec 4, 2023

I've seen papers in which the authors used variable-sized triangular grid

I think you have see a paper using FEM method, finite elements in frequency domain. They use tetrahedra with adaptive mesh refinement during the solver process. There are some pros and cons compared to FDTD.

biergaizi · 2023-12-14T13:50:43Z

biergaizi
Dec 14, 2023
Author

Recently, some flawed openEMS engine benchmarks were attempted. The conclusions of those benchmarks were misleading due to lack of knowledge of FDTD and openEMS's performance characteristics - this includes some of my own early benchmarks made in early 2023. Thus, to avoid a repetition of more misleading benchmarks, it's necessary to clarify the some of the basic issues.

The following is the second article of the series Notes on HPC Programming and openEMS Optimizations

Understanding the Performance Characteristics of FDTD and openEMS

Small Simulation vs. Big Simulation:: In general, FDTD simulations can be classified into two types, small simulations and big simulations. Because the field and operator data for small simulations can largely or entirely fits in cache, the main bottleneck of small simulations can be code overhead. Big simulations are limited by memory bandwidth because the simulation data no longer fits in cache. Because interesting simulations in most practical engineering problems are almost always large simulations, so large-simulation performance should be the priority of optimization, even at the expense of small-simulation performance. Thus, it would be misleading to use small simulations as benchmarks - since in this case, it's basically just benchmarking the engine code overhead that would quickly diminish anyway when simulation size is increased.
Theoretical Performance of FDTD: The following equations can be used to calculate the theoretical performance of FDTD implementation, given an unoptimized naive engine without any extensions. They also assume that the code has already been optimized to the extreme without other bottleneck (both are not true for openEMS):

2.1. Memory Storage Requirement: openEMS's kernel uses two arrays volt and curr to represent electromagnetic field, and uses four operator arrays vv, vi, iv, ii to represent material properties. Their memory usage is given by:
```
     Box Size (bytes) = (width x height x length x 3) x 4   // sizeof(float) == 4
     Array `volt` + `curr` Size = Box Size x 2
     Array `vv` + `vi` + `iv` + `ii` = Box Size x 4
```
For example, a 50x50x50 simulation needs 2.86 MiB of memory to store the electromagnetic field, and 5.72 MiB of memory to store the operators - which can entirely fit in cache, memory bandwidth bottleneck is eliminated. Meanwhile, a 200x200x200 simulation needs 183 MiB of field storage and 366 MiB of operator storage. Assume the CPU has a L3 cache size of 32 MiB, it means only 5% can fit in cache.

Thus, while it's true that a large cache in preferred, but for most realistic engineering simulations, cache size is irrelevant since simulation domain outgrows the cache size anyway. This is why people often claim the real bottleneck is memory bandwidth, not cache size (exceptions can be made for Intel's Sapphire Rapids Max CPU with 64 GiB of HBM cache, or AMD's Genoa-X CPU with 1 GiB of L3 cache)

2.2. Memory Bandwidth Requirement: To update the new electric field, the arrays volt, curr, vv, vi are read to calculate new values for curr, to update new magnetic field, the arrays curr, volt, ii, iv are read to calculate new values for volt. Thus, assuming the simulation domain is much larger than CPU cache size, the memory traffic per timestep is:
```
 Box Size (bytes) = (width x height x length x 3) x 4   // sizeof(float) == 4
 UpdateVoltages: Traffic: Box Size x 5  // volt read, curr read, vv read, vi read, volt write
 UpdateCurrent Traffic: Box Size x 5 // curr read, volt read, ii read, iv read, curr write
```
For example, a 200x200x200 simulation has a memory traffic of 915.52 MiB per timestep. If the machine has 40 GB/s memory bandwidth (2-channel DDR4 desktop), the theoretical upper speed is (200 * 200 * 200) / (0.915 / 40) = ~350 MC/s. If the machine has 150 GB/s of memory bandwidth (8-channel DDR4-2400 server), the upper speed is 1311 MC/s. Note that cell update per second is a constant regardless of domain size.

It's also worth noting that reaching the theoretical limit is impossible. CPU's memory controller is usually 80% efficient even when the software is perfect - as measured via the industry-standard STREAM benchmark. If the software itself is also only 80% efficient - a reasonable target, it means the reachable memory bandwidth is usually 60% of the hardware's datasheet performance.

Operator Compression: To boost simulation speed beyond the naive FDTD limit, one can use some techniques to save memory traffic. Currently in openEMS, it uses operator compression. It turned out that many simulations have large empty air boxes, or have a box with an uniform material, so many values in the operator arrays are repeated many times. openEMS's Operator Compression simply removes these duplicate values. Some simulations are highly compressible, operator size can be reduced to around 50% of the original, or in extreme cases, as much as 5% of the original. Other simulations are not very compressible. Thus, simulation with very similar domain sizes can still show significant performance differences - especially if the compressed operator just fit in cache. Very confusing or misleading benchmark results can be created if this problem is ignored.

3.1. Compressibility: The following patch allows one to see the compressibility of FDTD operators (since it's so useful for debugging I probably should submit it to the upstream soon):

diff --git a/FDTD/operator_sse_compressed.cpp b/FDTD/operator_sse_compressed.cpp
index ce3e8e3..4d79b07 100644
--- a/FDTD/operator_sse_compressed.cpp
+++ b/FDTD/operator_sse_compressed.cpp
@@ -114,6 +114,7 @@ void Operator_SSE_Compressed::ShowStat() const

        cout << "SSE compression enabled\t: " << (m_Use_Compression?"yes":"no") << endl;
        cout << "Unique SSE operators\t: " << f4_vv_Compressed->size() << endl;
+       cout << "Compression ratio\t: " << (f4_vv_Compressed->size() * 100 / (numLines[0] * numLines[1] * numVectors)) << "%" << endl;
        cout << "-----------------------------------" << endl;
 }

3.2. Verbose Mode: One must enable openEMS's verbose mode using the command line option -v to see this result, otherwise, the following hack can force verbose mode during development.

diff --git a/openems.cpp b/openems.cpp
index 33f23ae..ae72fd2 100644
--- a/openems.cpp
+++ b/openems.cpp
@@ -1046,7 +1047,7 @@ int openEMS::SetupFDTD()
        timeval OpDoneTime;
        gettimeofday(&OpDoneTime,NULL);

-       if (g_settings.GetVerboseLevel()>0)
+       if (true || g_settings.GetVerboseLevel()>0)
        {
                FDTD_Op->ShowStat();
                FDTD_Op->ShowExtStat();

3.3. Disabling Compression: To isolate the contribution of operator compression, during the development I've used another patch to disable it unconditionally. If two similar simulations have a huge difference in performance, this hack can be used as a test to verify whether it's caused by compression.

diff --git a/FDTD/operator_sse_compressed.cpp b/FDTD/operator_sse_compressed.cpp
index ce3e8e3..7ef3e60 100644
--- a/FDTD/operator_sse_compressed.cpp
+++ b/FDTD/operator_sse_compressed.cpp
@@ -139,7 +139,7 @@ bool Operator_SSE_Compressed::CompressOperator()

                                map<SSE_coeff,unsigned int>::iterator it;
                                it = lookUpMap.find(c);
-                               if (it == lookUpMap.end())
+                               if (it == lookUpMap.end() || true)
                                {
                                        // not found -> insert
                                        unsigned int index = f4_vv_Compressed[0].size();

Extensions: During the openEMS optimizations, I've found that the main engine in fact has reasonable performance, but the extensions has serious performance overheads. These overhead limits performance even when memory bandwidth is not saturated, preventing good scaling. In particular, PML and Lorentz materials are two known performance killers. More source code-level debugging and investigation is needed to fully understand the problem. But currently three possible sources are suspected:

4.1. Memory Locality: openEMS's extensions are implemented as extra simulation passes, causing memory traffic amplification. For example, the process of updating PML cells likely pollute the cache of the main FDTD engine. After PML is updated, all the pre-existing cache of the main engine is flushed away. I'm currently working on an improved tiling engine to limit the range of each simulation domain update to a small region to avoid this problem. Because this problem is memory-traffic related, using a regular profiler to see the runtime of each function is insufficient and misleading. It must be diagnosed using low-level CPU performance counters.

4.2. Memory Layout: For vectorization, openEMS transposes the layout of the arrays from a layout like (n1, n2), (n3, n4), ... to (n1, n3), (n2, n4). However, this is an implementation detail which has been abstracted away from the extensions. This means the stride-1 memory accesses are no longer stride-1, reducing memory spatial locality.

4.3. Virtual Functions: The engines and extensions make heavy use of C++ virtual functions, including using virtual functions for accessing the simulation array from extensions to hide the implementation detail. If the C++ compiler is unable to inline these accesses, extensions are slowed down significantly. It has been reported that clang and MSVC sometimes show extremely low performance compared to GCC, I suspect this may be the source of the problem.
Multi-threading: The multi-threading implementation is suboptimal.

5.1. Influence of Thread Numbers: When the multi-threading simulation engine is used, openEMS's performance has three regions - increasing, saturation, and decreasing. As the number of thread increases, performance is initially steadily increased, reach a point of peak performance, and starts decreasing again. openEMS's automatic thread number selection already uses this method to pick the optimal number of thread, however, in some conditions, it may stuck in a local maxima and unable to reach peak performance.

5.2. Synchronization: Another source of bottleneck of that prevents good scaling (even if memory bandwidth is still available) is the synchronization scheme. Because each extension is a separate update pass to the entire simulation domain, locking is used after running each extensions which can create a significant overhead. For example, if an all-PML simulation box is used, 6 instances of the PML extensions are created, which increases the number of locking operations by 6 per timestep.

5.3. Partitioning: The simulation domain partition scheme is also potentially flawed. The current partition is basically partition the main simulation domain across the X axis evenly to each thread, and then partition the simulation domain assigned to each extension evenly across the X axis to each thread. This means that in a single thread, the updated extension region and the updated main engine region are often unrelated and misaligned, reducing cache locality. I guess the problem is likely most serious on NUMA-like CPUs such as AMD Zen. A better partitioning scheme is to break the whole simulation domain into small tiles, and then assign one tile to each thread, for both simulation and extension update. This may create load imbalance, so it's better to be combined with a work-stealing queue.
NUMA: The openEMS engine is not NUMA aware, without first-touch optimization. Thus, memory bandwidth may or may not increase on a NUMA system.
Field Dump: For large simulations, the use of field dump is a relatively low performance overhead. But for small simulations, field dump's overhead becomes significant, the difference in speed can be as much as 5x. But it only happens when the simulation is small to fit in cache.
Conclusion and Remark on Scalability: We know that a fully-optimized FDTD kernel should scale linearly with memory bandwidth, but my test has showed that even when a UMA (not NUMA) machine with an extremely high memory bandwidth is used - namely the Apple M1 Ultra, openEMS failed to scale proportionally with memory bandwidth, this suggest the existence of multiple software bottlenecks as a combinations of all the previously listed factors involved.

Thus, please do not claim that "I just paid $2000 for a high-bandwidth server but there's no improvement at all! These theoretical results are nonsense!" I still recommend you to use a high-bandwidth system, because it will run better in the future.
Benchmarking Rules: As a result, due to the combination of these problems, the benchmarking results can be really confusing and hard to interpret. Thus, to obtain meaningful benchmark results without misinterpretation, one should pay attention to the following.

9.1. Simulation Domain Size: A large simulation domain size must be used, and even then, the simulation domain size and operator compressibility must always be reported for each simulation, as "sparse" operators can still influence the results significantly. Small simulation domain sizes can still be used as reference, but should not be used in isolation without large simulations.

9.2. Thread number: Automatic thread number selection should not be used. A good benchmark should systematically test all possible number of threads up to the number of CPU threads, the fastest thread number and speed should be reported as the result. To avoid long run time, the benchmark must have a task management system, it should kill each simulation after a short time, such as 30 seconds.

9.3. Extensions: All benchmark should be repeated with and without extensions. For example, for a simulation with PML and Conducting Sheet, it should be repeated with PEC and without conducting sheet, with PEC with conducting sheet, and with PML and conducting sheet. The simulation will become physically meaningless after doing this and may not converge, so each simulation should be killed after a short time.

9.4. Field Dump: If field dump is used, a comparison must be made with field dump disabled.

9.5. NUMA: Simulations may or may not become faster on NUMA, if it they do, they likely still underperforms. Thus, systems should not be used for the propose of benchmark. If used, it must be pinned to exactly one NUMA domain. An exception can be made if it's specifically a benchmark about NUMA performance.

9.6. Compiler: GCC should be used, with optimizations turned on.

9.7. Source Files: The source files of all benchmarks should be made available to allow examination.

8 replies

biergaizi Dec 14, 2023
Author

I'm not into the detail how they implemented the solver, but the very brief description of Empire's XPU technology in the marketing slides is this:

Combined calculation of E-field and H-field

Calculation multiple time steps in the cache of the CPU cores

Complete E- and H-field needs only to be transferred after many time step between main memory and CPU

Yes, this is indeed a well-known optimization technique already reported in the open literature. In fact, my experimental Tiling engine (that already published one, outdated) already uses essentially the same ideas to increase cache reuse. My code is based on the following paper on rectangular and diamond tiling for FDTD:

Fukaya, T., & Iwashita, T. (2018). Time-space tiling with tile-level parallelism for the 3D FDTD method. Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region - HPC Asia 2018. doi: 10.1145/3149457.3149478

I already published the experimental tiling code here (outdated):

https://github.com/biergaizi/project-diamond

The problem is that rectangular and diamond tiling have their upper limits. They're not silver bullet and the improvement is limited. In both the paper's report and my observation, the speedup is around 1.5x to 2x, not more. So EMPIRE must have either used either a much improved version of cache blocking technique, or have used other optimization techniques - I suspect it's a combination of both.

VolkerMuehlhaus Dec 14, 2023

I have no additional insights what their coding magic is. One thing that I didn't mention is that they compile a solver core "on the fly" for the specific CPU and the specific problem, so it is not one fat universal solver engine for everyone and everything.

The difference in speed that observe between openEMS and Empire for my models is approximately factor 10x to 50x, using a similar mesh and the same machine.

biergaizi Dec 14, 2023
Author

I have no additional insights what their coding magic is.

This is why I said it's a trade secret, and already provided a guess that "they probably developed a modified form of FDTD with a heavy combination of multi-stepping, compression, on-the-fly computation, and wavefront following."

One thing that I didn't mention is that they compile a solver core "on the fly" for the specific CPU and the specific problem, so it is not one fat universal solver engine for everyone and everything.

Yes, JIT is a possible solution to the extension overhead problem. The problem of engine extensions is that they need extra simulation passes during each timestep, amplifying memory traffic. If the extension code can be merged into the main engine's update loop, this overhead is eliminated, in GPU programming it's called "kernel fusion". Automatic kernel fusion is now a very hot topic in machine learning research (I wonder if I can abuse machine-learning libraries like TensorFlow and PyTorch to implement FDTD, just so that we can borrow the works already done in ML - should be possible, but extremely far-fatched to say the least).

Back to the topic of extensions. If there is only a small number of extensions, hard-coding is one way to speed it up (and is the method I plan to use for PML in GPU simulations). But if there are many optional engine features, checking whether a feature is turned on is too time consuming. Thus, a possible solution is to generate code on-the-fly, only the engine features you're actually using are generated in the machine code, and they're also merged into the main update loop. This way, extension / optional feature overhead is completely eliminated.

But this problem is independent from the performance of the main update loop.

The difference in speed that observe between openEMS and Empire for my models is approximately factor 50x to 100x, using a similar mesh and the same machine.

With good optimization and GPU acceleration, the open literature should allow us to close the gap from 100x to 10x - which is not too bad ;-)

KJ7LNW Dec 14, 2023

Automatic thread number selection should not be used.

Good idea. There might be a place further out beyond the automatic thread test limit that performs even better.

biergaizi Jan 15, 2024
Author

The problem is that rectangular and diamond tiling have their upper limits. They're not silver bullet and the improvement is limited. In both the paper's report and my observation, the speedup is around 1.5x to 2x, not more.

After finishing the first prototype of rectangular tiling, in the next stage of my openEMS acceleration project, I redid my analysis with trapezoid and diamond tiling. Indeed, I found a ~10x acceleration seems possible if one blocks for CPU's last-level cache. This should be sufficient to explain the theory of operation behind EMPIRE. Of course, actually implementing it is easier said than done. cache-tiling itself is already tricky enough, but cache-tiling for both L2 and L3 cache at the same time means the logic will be deeply nested.

I plan to share the technical details behind cache-tiling when after I make further progress, including a method for calculating the theoretical speedup.

biergaizi · 2024-01-18T10:52:45Z

biergaizi
Jan 18, 2024
Author

The following is the third article of the series Notes on HPC Programming and openEMS Optimizations

A Short Note on Operator Compression

In FDTD, simulation involves reading two types of data - electromagnetic field arrays and material property arrays. In openEMS, the electromagnetic fields are stored in two arrays named volt and curr (they're not named E / D / B / H because openEMS uses the Equivalent Circuit formulation of FDTD, or EC-FDTD), meanwhile the material property arrays are called "operators", they include vv, vi, ii, and iv. These operators are calculated from the RLCG parameters via the function Operator::Calc_ECOperatorPos() during simulation setup, and are read-only in the simulation process (RLCG themselves are derived from material properties ε, κ, μ, σ from CSXCAD, in Operator::Calc_ECPos()). In openEMS, operators are also used for enforcing PEC and PEM boundary conditions by setting them to appropriate values (for example, see Operator::ApplyElectricBC()). Note that other absorptive boundary conditions are enforced by extensions as extra update passes, the main engine is ignorant about them.

In practice, many simulations have large empty air boxes, or have a box with an uniform material. As a result, many values in the operator arrays are repeated many times, wasting memory bandwidth. Thus, openEMS removes duplicate operators using a simple de-duplication algorithm along the Z axis, then assigns an index for all the remaining unique values. This process is called "Operator Compression".

Naturally, a question is how much improvement would be possible by using a better de-duplication or compression algorithm - for example, @thliebig previously suggested a Run-Length Encoding compression method to bring improvement. So here's a quick calculation.

During electric field update, the memory accesses include: read volt, read curr, read vv, read vi, write volt. During magnetic field update, the memory accesses include: read curr, read volt, read ii, read iv, write curr. Suppose the simulation box is huge and almost none fits in Last-Level Cache, in a full timestep, the total memory reads and writes would be 10 units of data. When we remove all operators, there are only 6 remaining units of memory accesses, and a 10 / (4 * 0 + 6) = 1.66x speedup is possible. When only 90% of the operators are removed, a 1.5x speedup is possible. When only 50% of the operators are removed, a 1.2x speedup is possible. The "90% removal" case is realistic for an all-vacuum box, while the "50% removal" case is common for other simulations.

Conclusion: Only a 1.2x to 1.5x improvement is possible over the existing compression method, even with a 100% ideal compression algorithm. Thus, optimizing operator compression further has diminishing return and will only bring a very limited speedup, hence it's not recommended. It cannot replace the use of time tiling techniques for multi-timesteps calculations (still a work in progress). I've already experimented with a "3D block" based de-duplication algorithm in my rectangular tiling engine. Although the algorithm itself creates better compression in some cases, in practice the performance is comparable with the official engine because it's just able to barely overcome the tiling overhead.

0 replies

biergaizi · 2024-01-19T15:57:05Z

biergaizi
Jan 19, 2024
Author

The following is the fourth article of the series Notes on HPC Programming and openEMS Optimizations. This describes the theory of operation of both my published Tiling engine and my unpublished, work-in-progress rewrite.

Temporal Tiling: The Key to Fast FDTD Simulations, Explained

Background

Stencil Computation

In scientific and engineering computing, simulations of physical systems governed by some Partial Differential Equations (PDEs) is a common task. These numerical PDE solvers usually calculate the state of a system from its current timestep t to its next timestep t + 1 according to some update equations. The future state t + 1 at position x (state(x, t + 1)) often depends on the current state of x's nearest neighbors. For example, when using Jacobi method to solve the 1D heat equation, the dependency chain is state(x, t + 1) <- state(x - 1, t), state(x, t), state(x + 1, t). Because the data dependencies always have a unique shape, it's known to HPC programmers as a stencil, and this type of computation is named iterative stencil computation.

The following are some examples of stencils: (1) A Jacobi solver of 2D heat equation has the dependency chain of itself and its top, bottom, left, and right, similar to the D-Pad on a game controller (von Neumann neighborhood). (2) An implementation of Conway's Game of Life has the dependency of north, south, east, west, northeast, northwest, southeast and southwest (Moore neighborhood). (3) A 2D convolution kernel in image processing.

For FDTD, its stencil shape is more complex (because of the leapfrog update between the electric and magnetic fields), but it's otherwise rather similar to the 7-point 3D stencil - which itself is a generalization of 3-point 1D stencil. Thus, for clarity, the rest of this article focuses the basic 3-point 1D stencil (and its generalizations in higher dimensions, and necessary modifications for FDTD), as shown below. We also assume all elements outside the boundary are 0 (Dirichlet boundary condition).

  (t)       x_1
            /|\
           / | \
          /  |  \
(t - 1) x_0 x_1 x_2 x_3 x_4 x_5 x_6 ...

A naive implementation looks like something below:

for (size_t t = 0; t < T_MAX; t++) {
    for (size_t x = 0; x < X_MAX; x++) {
        float prev   = x > 0 ? now[x - 1] : 0;
        float center = now[x];
        float next   = x < X_MAX - 1 ? now[x + 1] : 0;

        future[x] = f(prev, center, next);
    }
    swap(&future, &now);  /* swap front and back buffer */
}

Memory Wall

It's a well-known fact that numerical simulations based on stencil computations have an extreme memory bandwidth bottleneck, especially in a naive implementation. The naive implementation creates a tremendously amount of DRAM traffic. Consider a 1 GB simulation domain with 10000 timesteps on a desktop computer with 40 GB/s memory bandwidth (dual-channel DDR4-3200). Because the simulation domain is too large to fit in cache, each timestep involves a full memory scan. It has an extremely low arithmetic intensity, one can use the roofline model to demonstrate its extremely low performance below the CPU's peak FLOPS.

We can visualize the iteration space of the loop using a 2D spacetime diagram (1D space + 1D time), each color represents a unit of work:

In ordinary programming, loop tiling (loop blocking) is the standard solution to this problem. The array is broken into multiple rectangular blocks and are calculated tiles by tiles. However, in the field of stencil computation there are several unique challenges. First, conventional loop tiling are spatial (space) tiling, but physics simulation often apply just a single operation to each tile, not multiple operations, there's still no little to none data reuse even if the loop is blocked. Instead, we have to use temporal (time) tiling to calculate multiple timesteps at once. Next, stencil computation has data dependencies that extends beyond the tile itself to its neighboring tiles, so preserving the correct data dependencies is now the problem. In particular, time tiling means that the simulation domain becomes asynchronous, every element can stay in a different timestep - without an appropriate tile shape that brings natural synchronization, it would be a nightmare to handle.

Parallelogram Tiling

Historically, parallelogram tiling is the first solution to this problem. First, the space is broken into small rectangular tiles. Then, at each timestep, the range of updated elements shrinks one element to the right, because they cannot be updated due to missing dependencies. On the other hand, the range also grows one element to the left. By using a sliding window of two tiles, updating the next tile automatically "fixes" the elements with the wrong timesteps on the left.

This tiling algorithm is visualized in the following spacetime diagram:

One can see the following desirable properties of this tiling:

There's no dependency violation anywhere, as long as the tiles are iterated from the left to the right (note that we're buffering the system's state in two timesteps in now and future - see the example code above, so accessing one timestep below the updated element is safe).
Updating the right tile automatically "fixes" the elements with the wrong timesteps of the left right.
At the end of the process, all tiles are naturally re-aligned to the same timestep, even though the simulation domain becomes asynchronous during the process itself.
Using a sliding window buffer that contains two tiles, there are no redundant memory access, even though the two tiles overlap.
The width of the tile should be half of the cache size. Acceleration is unlimited and grows linearly with cache size (up to the cache bandwidth bottleneck). Larger cache allows one to calculate more timesteps at once.
It can be readily generalized to higher dimensions. For example, in 3D, we first use a nested triple-loop to select a tile, then within a tile, we use a nested quad-loop to grow and shrink the x, y, and z ranges per timestep.

Modification for FDTD

For 1D FDTD, the E field has H field dependencies on the left: E(x, t + 1) <- H(x - 1, t), H(x, t), while the H field has E field dependency on the right: H(x, t + 1) <- E(x, t), E(x + 1, t). A quick way of remembering that is that E's dependencies points inwards, while H's dependencies points outwards.

Thus, one can use the following modification to make it suitable for FDTD.

Note that for simplicity, only the updated field is labeled in the diagram, but both the E and the H fields are accessed in each update step. Also note that in my published tiling code, 3D parallelogram tiling was incorrectly called "trapezoid tiling", this is incorrect and was a mistake due to my confusion of terminology during early development.

Fatal Flaw: No Parallelism.

However, this tiling algorithm also has a fatal flaw: the lack of inter-tile parallelism. To preserve dependencies, all tiles must be iterated from the left to the right. There exists three workarounds:

Process a single tile in parallel, and do it serially one by one - the cost is usually far higher than the parallel speedup.
Process all tiles in parallel, but insert a barrier at the end of each timestep - the cost is usually far higher than the parallel speedup, and cannot be applied to GPUs due to lack of inter-workgroup barrier.
Hyperplane or wavefront parallelism: As a matter of fact, updating a single region in the simulation doesn't instantaneously affect other parts. Instead, the propagation of dependencies follow a particular path in spacetime. In 1D space, it's not particularly interesting - changing a left region causes a corresponding change in its right region, one tile at a time. But in 2D space and above, a single region may affect multiple regions, thus creating parallelism. Researchers used this insight to create another form of parallel algorithm called hyperplane or wavefront parallelism. It's called wavefront parallelism because the updated regions follow how a wave spreads out in space. It's called hyperplane parallelism because the parallel processing follows the propagation of causality (thus information) within the simulation box on a hyperplane (in 2D, a line, in 3D, a plane), just like the propagation of lightcones in special relativity's Minkowski spacetime - except that our lightcone eventually shrinks due to finiteness of the simulation.

In 2D space, we can first break the simulation domain into small rectangle supertiles. Each supertile itself processed in serial using parallelogram tiling within, but different supertiles can sometimes be processed in parallel based on their nearest-neighbor dependency chain.

For example, in the 1st pass, the only independent supertile is (0, 0) - it's located at the top-left edge of the universe with all-zero boundary conditions, so it does not depend on anything else. In the 2nd pass, we can see that the supertile (0, 0) only has two neighbors: (1, 0) and (0, 1). Thus, only these two supertiles has cause-and-effect relationship with (0, 0), and they're also independent between themselves. In the 3rd pass, (2, 0), (1, 1), (0, 2) are independent, etc. The same technique readily generalizes to 3D space. However, hyperplane parallelism has the problem of irregularity - the independent tiles are first growing, reaching a maximum, then shrinking. This phenomenon is known as pipelined startup - the hypothetical "pipeline" is filled then drained. This creates significant synchronization overhead and load imbalance problems.

Trapezoid Tiling

To overcome the lack of parallelism of in parallelogram tiling, an alternative tiling algorithm was invented by researchers, known as trapezoid tiling (or trapezoidal tiling). The innovation here is that, instead of using a single shape of tiles, it uses split-tiling with two different tile shapes: An invented trapezoid that shrinks over time, known as "mountains", and a regular trapezoid that grows over time, known as "valleys".

In 1D case is visualized in the following 2D spacetime diagram:

As one can see, different mountains are completely independent and can be processed in parallel, because trapezoid tiles provide a natural spacing for avoiding the dependencies. After all mountains are processed, all valleys are then again processed in parallel (in the picture, valleys at the boundary can either be treated as an incomplete valley, or be treated as part of the first two mountains). In the 1D case, one only needs two synchronizations for many timesteps, providing great parallelization on massively parallel computers.

Modification for FDTD

Similar to parallelogram tiling, trapezoid tiling can also be modified for FDTD, as shown in the following diagram.

Note that in the research paper by Fukaya, T., & Iwashita, T., "trapezoid tiling" and "diamond tiling" were used interchangeably. Strictly speaking, it's an abuse of terminology which should've been avoided. Diamond tiling is an extension of trapezoid tiling and should not be confused with the original algorithm. My published tiling code also refers to this method as "diamond tiling", which is incorrect was due to the confusion of terminology.

Generalization to Higher Dimensions

Trapezoid tiling can also be generalized to higher dimensions naturally. For example, in 2D, each dimension can be considered as 1D separately, 1D tiles from each dimension are combined together to form a 2D plane (and a 3D spacetime), or a 3D cube (and a 4D spacetime). Reasoning it geometrically can be extremely difficult or impossible, but fortunately it's easy to reason about it algebraically: The order of combinations following the Cartesian product of the tile types from each dimension. In other words, 2D needs 4 stages of processing, and 3D needs 8 stages of processing.

Here's how to generalize trapezoid tiling to 2D space (3D spacetime):

Combine all X mountains with Y mountains, process all in parallel

Its 3D spacetime diagram is:
Combine all X mountains with Y valleys, process all in parallel

Its 3D spacetime diagram is:
Combine all X valleys with Y mountains, process all in parallel

Its 3D spacetime diagram is:
Combine all X valleys with Y valleys, process all in parallel

Its 3D spacetime diagram is:

Note that there's only a single time axis shared by all dimensions, and all the timesteps of different tiles are always aligned. For example, in 3D space, the ranges x, y and z grows or shrinks simultaneously in each timestep, allowing its clean generalization to high-dimensional space.

Fatal Flaw: Redundant Memory Accesses

Unfortunately, as great as trapezoid tiling initially seems to be, it also has a fatal flaw, which is redundant memory accesses at the overlapped regions of different tiles. To see how it happens, recall the 2D spacetime diagram of 1D trapezoid tiling:

Although each tiles have a trapezoidal shape, but these tiles are actually overlapped 1D lines in terms of memory access. In other words, trapezoid tiling creates parallelism at the expense of redundant memory access.

In fact, in trapezoid tiling, calculating as many timesteps as possible creates the highest overhead due to redundant traffic, because in this case, the "valleys" at the middle mainly exists to fix the mess from the existing work, as a result, they themselves have no chance to perform productive original work (sounds like a office space analogy...).

We can reduce the overlapped region and minimizing waste, there are two workarounds:

Reduce the number of timesteps, but at the expense of doing less work - which also creates inefficiency.

For example, to minimize the overlapped area to just 2 elements:

It would be only possible to calculate 2 timesteps:
Increase the width of tiles, but it's limited by cache size.
Use hierarchical cache tiling for both L2 and L3 cache, each supertile fits inside L3 cache and each mountain or valley fits in L2 cache. This recursive process can be repeatedly from small tiles to big tiles ad infinitum in principle to utilize all levels of storage in the entire memory hierarchy, including L1, L2, L3, and even SSD page files. Thus, the Sierpinski triangle can be a theoretically optimal solution. This kind of fractal algorithm is known as cache-oblivious algorithm, as it's optimal without considering the cache parameters. This is also closely related to the Locally-Recursive non-Locally Asynchronous (LRnLA) algorithms purposed by a Russian research group from Keldysh Institute of Applied Mathematics.

But this greatly complicate the logic of the PDE solver, which is already difficult enough to understand. Moreover, for GPUs with tiny on-chip local memory, practicality is limited.

In 1D space, the overlapped memory accesses are only a small overhead, but the overhead grows exponentially in higher dimensions - nearly quadratically in 2D space and cubically in 3D space. For example, if the simulation domain can be split into 10 non-overlapped rectangles and around 20 overlapped rectangles in the worst case (common when the each tile is small due to cache size limitation), there's a 800% memory traffic penalty, which is sufficient to eliminate any benefits from trapezoid tiling. The penalty of 2D trapezoid tiling is smaller but also significant, especially on GPUs with tiny local memory.

In the current published tiling code, trapezoid tiling is applied to the X and Y dimensions with a hardcoded tile size of 10, while parallelogram tiling is applied to the Z dimensions. This (combined with operator compression) is sufficient to explain the inconsistent speedup. I plan to address the problem in the future.

Diamond Tiling and Hexagon Tiling

To overcome the problem of redundant memory accesses in trapezoid tiling, researchers later proposed a small modification of trapezoid tiling, known as diamond tiling. The insight of diamond tiling is that, when a valley finishes its calculation, it already contains all the data dependencies it needs to immediately start constructing another mountain:

After this step, diamond tiling finishes initialization. First, all mountains and all valleys swap their roles. But note that since all valleys themselves will become diamonds.

As a result, we now have a simulation domain tiled exclusively by diamonds. There are two types of diamonds, one type is running several timesteps ahead than another type. I'll call one type "slow diamonds" and another type "fast diamonds".

To terminate diamond tiling, the last diamond tiles are truncated in the middle at the final timestep, so the diamonds degenerate back into ordinary mountains or valleys (omitted in the diagram). This is essentially doing the startup process in reverse.

Also note that in the special case when the width of the diamond begins and ends with more than 1 elements, it's also known as hexagon tiling (hexagonal tiling).

As a result, diamond tiling solves the problem of redundant memory access overhead in a simple and elegant manner, the problem is solved, right?

Fatal Flaw: Problematic Generalization to Higher Dimensions

Unfortunately, the literature says that diamond tiling cannot be generalized to high-dimension space in a clean manner. Personally, I don't even know how it can be done at all. The difficulty is that in diamond tiling, there's a misalignment of tiles on the time axis, and the Cartesian product trick no longer works.

During startup, if we attempt to process the tiling using the following schedule using the Cartesian product trick:

Combine all X mountains with Y mountains, process all in parallel
Combine all X mountains with Y diamonds, process all in parallel.
Combine all X diamonds with Y mountains, process all in parallel.
Combine all X diamonds with Y diamonds, process all in parallel.

As one can see, 3 of these 4 steps have mismatched timesteps, which means those diamonds degenerates back to mountains. Only the last step (diamond, diamond) remains a diamond. It suggests that one cannot take full performance advantage of diamond tiling in multiple dimensions - okay so far this still looks reasonable,

After the diamond tiling finishes initialization and is fully started, the dependencies become more puzzling:

Combine all X slow diamonds with Y slow diamonds, process all in parallel
Combine all X slow diamonds with Y fast diamonds, process all in parallel.
Combine all X fast diamonds with Y slow diamonds, process all in parallel.
Combine all X fast diamonds with Y fast diamonds, process all in parallel.

At this point, the tile dependencies become extremely problematic and puzzling. Only two combinations are time-aligned, which are (slow diamond, slow diamond) and (fast diamond, fast diamond). the remaining two have misaligned time axis at the bottom of the tile. Without the bottom half of the diamond, it's impossible to calculate the top half of the diamond tiles.

Mysterious Russian Magic

The only hint I can find in the literature on how it may be done comes from a Russian research group at Keldysh Institute of Applied Mathematics. This group developed a series of different parallel spacetime decomposition algorithms in the last 20 years using a methodology called Locally-Recursive non-Locally Asynchronous (LRnLA) algorithms. LRnLa is not a single algorithm but a general guideline of how to design them. Basically: (1) The tessellation must should be a recursive process to utilize the entire memory hierarchy (it's not necessarily automatic, manual parameter tuning for different machines is allow as long as tiling algorithm itself can be generalized). (2) Parallelism should exist between different tiles, the dependency conflict problem should be solvable in some natural way. Using both requirements as starting point, researchers would manually examine the stencil dependencies and use their intuition in solid geometry to design custom algorithms to satisfy these goals. Unlike polyhedron compilers, these are custom solutions designed by human domain experts for human use, with geometric interpretations that ease implementations (but only from the perspective of mathematicians and physicists...).

2D and 3D diamond tiling were known to them as the ConeTur algorithm, which were used in the first generation of HPC code in the late 1990s. In the paper The DiamondCandy Algorithm
for Maximum Performance Vectorized Cross-Stencil Computation by Perepelkina A.Y. and Levchenko V.D., they described it as:

The ConeTur algorithm had some key aspects. It is easily generalized to higher dimensions by a superposition of 1D1T shapes. With the recursive subdivision for 1D simulation, the resulting shapes in 1D1T time-space were triangles (upright and upside-down) and diamonds, and there were special cases at the boundaries. In 2D1T the pyramids are tiled along with octahedrons and tetrahedrons, and the number of special shapes for the boundaries increases even more. For 3D1T the programming effort becomes unreasonable, so code generators were used.

Unfortunately there's no step-by-step procedure on how it may be implemented. I don't understand how can these pyramids, octahedrons and tetrahedrons be combined together. Furthermore, this algorithm was already obsolete from their perspective, so they likely have no interest to better explain it in the future, as they have later developed even faster algorithms such as ConeFold, DiamondTorre, and DiamondCandy. Since these operate directly within 4D spacetime, they're even more difficult to understand. As far as I know, these algorithms are original and are not used or explained by anyone else. For a physicist who can already picture a 4D Minkowski spacetime and even doing math inside it, it may be obvious. But for everyone else, even the simplest case of ConeTur is difficult. This group is at least 10 years ahead to the rest of the world in this field of research.

Conclusion

As a result, in practice, diamond tiling is often only applied to one dimension, other dimensions are parallelized using parallelogram tiling or other more conventional methods. However, this greatly reduces the program's natural parallelism - a 2D 10x10 trapezoid tiling has 100 independent blocks, a 1D tiling only has 10, making it unsuitable for parallel computation.

Whether diamond tiling can be generalized to 2D and 3D for practical FDTD simulations is an open question. If you're good at solid geometry, welcome to attack this problem.

Strictly speaking, it has already been solved, but the literature is unreadable for outsiders. Stencil computation is a classic field in HPC research, and tiling techniques have been studied since the 1970s, as a result, the papers are full of definitions, axioms, lemma, and proofs about the difficult mathematical properties of these optimizations, or about an universal compiler framework that would work on any tile shape (e.g. see polyhedron compiler) - often without telling you what the tiles even look like graphically because everyone who's working in this field already knew. There's a serious lack of introductory materials on temporal tiling suitable for the purpose of human optimization, rather than code generation.

Update: I asked the question elsewhere. A polyhedron compiler researcher saw it and told me that even though I'm looking for intuitive geometry and algebraic interpretations suitable for hand coding, but an automatic stencil or polyhedron compiler may still be helpful. It can be insightful to feed the 2D Jacobi case into a ready-made polyhedron compiler (with many options), dump the generated dependency chains or loop schedules, then visualize the output as a 3D model. If there's no alternative solution, I'm going to take a try at this approach.

Solution of 2D Diamond Tiling

After looking at this ConeTur visualization again today, I finally found the key that leads to its solution: Only 2D spacetime can be tessellated exclusively with diamonds. In 3D spacetime, at every pass, only exactly one diamond can be created, not four. The remaining three all degenerate to trapezoids. Thus, the best one can do is creating only one diamond per iteration - but this still gives a slight speedup over pure trapezoid tiling.

First, recall the 2D spacetime of 1D diamond tiling at startup:

Here's the algorithm for startup. It's based the same Cartesian product trick. I'll show how timestep misalignment is handled later:

Combine all X mountains with Y mountains, process all in parallel.
Combine all X mountains with Y diamonds, process all in parallel. In the diagrams, I use the color grey to mark the skipped regions. As one can see: due to missing dependencies, Y diamonds degenerate into lower-half diamonds (valleys).
Combine all X diamonds with Y mountains, process all in parallel. due to missing dependencies, X diamonds degenerate into lower-half diamonds (valleys).
Combine all X diamonds with Y diamonds, process all in parallel. This is the only timestep that allows the creation and computation of a full diamond.

As one can see, 3 of these 4 steps have mismatched timesteps, which means those diamonds degenerates back to valleys. Only the last step (diamond, diamond) remains a diamond.

When the diamond tiling finishes initialization, as usual, the original mountains now takes valleys' roles, so they too become diamonds. After this process, the 2D spacetime diagram would become:

Here's the algorithm for continuing the calculations:

Combine all X fast diamonds with Y fast diamonds, process all in parallel. Note that the lowest point of the diamond is also the top of the previous mountain and should be skipped (this is true for pure 1D as well).
Combine all X fast diamond with Y slow diamonds. process all in parallel. For the fast diamond, skip the top half due to missing dependencies, so it degenerates to a lower-half diamond (valleys). For the slow diamond, skip and ignore the bottom half.

The bottom half of the slow diamond is skipped. Why? By symmetry. Note that these slow diamonds are not new ones, but are the exact the same one that we've already partially processed in the last iteration, as the (mountain, valley) case, shown below.

Combine all X slow diamonds with Y fast diamonds, process all in parallel. For the Y fast diamond, skip the top half due to missing dependencies, so it degenerates to a lower-half diamond (valleys). For the X slow diamond, skip and ignore the bottom half.

By symmetry. the skipped lower half of the X slow diamonds have already been partially processed in the last iteration, as the (valley, mountain) case, shown below.

Combine all X slow diamonds with Y slow diamonds, do not process anything. We've already processed these slow diamonds in the previous iteration!

Further calculations are possible by applying the algorithm for the symmetrical case. As we can see here, in each iteration, one (and only one) new group of diamonds is created. In the second and all future iterations (excluding the last truncated one before stopped), the pairs from the original Cartesian product have been reduced from 4 to 3.

After explaining the theory of operation, the ConeTur visualization is now perfectly clear.

All 4 combinations with the exception of the last one degenerate back into trapezoids. Thus, the main difficulty that prevents the generalization of diamond tiling to 2D, which is those "diamonds misaligned in time" is avoided and shown to be non-existent - they will not be created to begin with due to trapezoid degeneration. As a result, at every pass, exactly only one diamond can be created, not four, the remaining three can only be trapezoids. Thus, and the best one can do is creating only one diamond per iteration, all the time.

Thus, the best one can do is creating only one diamond per iteration - but this still gives a slight speedup over pure trapezoid tiling. For higher dimension, although it will become even more complicated, but assuming there's no control overhead (there will be), the savings can still be significant. This kind of behavior reminds me of Karatsuba multiplication, which also does 4 multiplications in 3 steps.

This must also why the ConeTur algorithm was soon abandoned by the Russian research group at Keldysh Institute of Applied Mathematics - straight generalization of diamond tiling directly to 2D and higher dimensions only has limited gain - which was the motivation of later improvements. For example, ConeFold appears to be their second-generation algorithm, and seems to be a modification of the ConeTur algorithm by recombining the shapes. Meanwhile, DiamondTorre and DiamondCandy are completely new and use 2D as their starting point, thus are more efficient.

Now the real problem is understanding ConeFold, which is not within the scope of this question.

2D diamond tiling visualization

Update: As usual in life, you can't find something when you needed it the most before it pops up later when you don't need it anymore... Just after solving the problem myself, while looking for something else, I've found the detailed step-by-step description of the solution in a research paper, even with 3D diagrams.

Walker, Anthony S., and Kyle E. Niemeyer. "Applying the swept rule for solving two-dimensional partial differential equations on heterogeneous architectures." Mathematical and Computational Applications 26.3 (2021): 52. https://www.mdpi.com/2297-8747/26/3/52

But at least I know that my solution was correct and now there's a reference for future readers... Sigh...

Insight behind the DiamondTorre algorithm

Update: I'm finally able to work out a preliminary understanding of the key idea behind the mysterious DiamondTorre spacetime decomposition algorithm today.

Unlike conventional diamond blocking, the DiamondTorre algorithm is a native 2D algorithm, not a direct generalization of the 1D algorithm. Thus, we need to apply diamond tiling to the XY plane (not the XT or YT plane) - that is, the simulation space itself, see the figure below (only two layers of tiles are shown). The current timestep of all cells is also marked in the figure.

After applying diamond tiling, the processing "cursor" moves the wavefront from right to left. In each stage, we select a column of diamond tiles on the same Y axis - different diamond tiles can be calculated in parallel, similar to the rules of 1D diamond tiling. For the first wavefront, we would select the green, red, pink and purple tiles.

We calculate each selected tile one timestep ahead. After finishing this step, now it's the key and the algorithm: shift the tile on one unit to the right on the X axis (truncation is also allowed), then one would find that calculating a new timestep is now possible within these tiles - the shifted tiles have perfectly avoided the grid cells that we cannot calculate due to missing dependencies, leaving only good cells with complete dependencies.

Similarly, the missed grid cells that cannot be calculated in the current wavefront will be "fixed" after a new wavefront is calculated and shifted (we're using double buffering with two timesteps, so it's legal to access a cell on the same timestep or a cell with exactly one timestep behind us). For example, in the second wavefront, we've selected the blue, purple and yellow tile.

More and more tile shifts are possible when the wavefront goes further to the left.

In 2D space, the partitioning of space to diamond tiles only determines their initial position, during the execution it's needed to keep moving them the right. Meanwhile, in 3D XYT spacetime, we can see that the process of sweeping a diamond tiles over time creates a leaning tower (Torre). The horizontal motion of the tile is just the projection of 3D spacetime to 2D space when the towers are viewed from the top.

Thus, the DiamondTorre algorithm can be seen as a creative combination of two algorithms: 1D diamond tiling, and also 1D parallelogram tiling with wavefront parallelis - this is the key insight to understand it. This is the best explanation I currently have. Refer to the DiamondTorre references at the end of the post for more information.

We can also see that DiamondTorre is inherently a 2D algorithm, and full parallelism only exists on the Y axis (wavefront parallelism on the X axis is still possible, at the expense of pipelined startup). Thus, the best parallel scalability and efficiency can only be maximized when one dimension is significantly longer than other dimensions - this is exactly the HPC cluster test cases showed in the papers.

To overcome this problem, the team later proposed an even more powerful algorithm named DiamondCandy, which is a native 3D algorithm, and is even more difficult to understand. If you're reading this and you're great at solid geometry, an tutorial-format explanation similar to my answer would be greatly appreciated...

References

Basis of this article

Fukaya, Takeshi, and Takeshi Iwashita. "Time-space tiling with tile-level parallelism for the 3D FDTD method." Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. 2018. https://www.capsl.udel.edu/pub/doc/memos/memo091.pdf

Introduction to Stencils

Stencil Pattern, Parallel Computing CIS 410/510, Department of Computer and Information Science, University of Oregon https://www.slidestalk.com/u3809/08StencilPattern

Theory and Applications

Korch, Matthias, and Tim Werner. "An in‐depth introduction of multi‐workgroup tiling for improving the locality of explicit one‐step methods for ODE systems with limited access distance on GPUs." Concurrency and Computation: Practice and Experience 33.11 (2021): e6016. https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.6016
Malas, Tareq M., et al. "Multidimensional intratile parallelization for memory-starved stencil computations." ACM Transactions on Parallel Computing (TOPC) 4.3 (2017): 1-32. https://arxiv.org/pdf/1510.04995
Strout et al., Practical Diamond Tiling for Stencil Computations Using Chapel Iterators https://chapel-lang.org/CHIUW/2015/talks/CHIUW2015-Strout.pdf
Zhao, Jie, and Albert Cohen. "Flextended tiles: A flexible extension of overlapped tiles for polyhedral compilation." ACM Transactions on Architecture and Code Optimization (TACO) 16.4 (2019): 1-25. https://dl.acm.org/doi/pdf/10.1145/3369382
Bandishti, Vinayaka, Irshad Pananilath, and Uday Bondhugula. "Tiling stencil computations to maximize parallelism." SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 2012. https://scholar.archive.org/work/hyl6ee4nv5bmbp5xafc4ayxvme/access/wayback/http://www.csa.iisc.ernet.in:80/~cplse/papers/ukbr-sc-12-1.pdf
Tobias Grosser, Sven Verdoolaege, Albert Cohen, P. Sadayappan. The Promises of Hybrid Hexagonal/Classical Tiling for GPU. [Research Report] RR-8339, INRIA. 2013. al-00848691 https://inria.hal.science/hal-00848691/document

Thesis

De la Cruz, Raúl. "Leveraging performance of 3D finite difference schemes in large scientific computing simulations." (2015). https://upcommons.upc.edu/bitstream/handle/2117/95958/TRDLC1de1.pdf
Duer, Nicholas, Paul Kelly, and George Bisbas. "Temporal Tiling for Distributed Parallel Solution of Partial Differential Equations." (2023). https://www.imperial.ac.uk/media/imperial-college/faculty-of-engineering/computing/public/2223-ug-projects/Tiling-in-time---improving-data-locality-in-computational-science-applications,-in-4D.pdf
Wang, Hengjie. Algorithm Design for High-Performance CFD Solvers on Structured Grids. University of California, Irvine, 2021. https://escholarship.org/content/qt2ct7j3j0/qt2ct7j3j0.pdf

Case Studies

Orozco, Daniel, and Guang Gao. "Diamond tiling: A tiling framework for time-iterated scientific applications." CAPSL Technical Memo 091, Tech. Rep. (2009). https://www.capsl.udel.edu/pub/doc/memos/memo091.pdf
Malas, Tareq M., et al. "Optimization of an electromagnetics code with multicore wavefront diamond blocking and multi-dimensional intra-tile parallelization." 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2016. https://arxiv.org/pdf/1510.05218
Akbudak, Kadir, et al. "Asynchronous computations for solving the acoustic wave propagation equation." The International Journal of High Performance Computing Applications 34.4 (2020): 377-393. https://repository.kaust.edu.sa/bitstream/handle/10754/662949/hpc-19-0066.pdf?sequence=1&isAllowed=y

Mysterious Russian Magic

These results came from the Russian Keldysh Institute of Applied Mathematics. Based on their own design methodology called Local Recursive non-Local Asynchronous algorithms (LRnLA) and also their intuition of solid geometry, they've designed and programmed multiple spacetime domain decomposition algorithms by hand, all are based on the tessellation of 3D and 4D spacetime. The latest two generations of algorithms are 2D1T DiamondTorre and 3D1T DiamondCandy. As far as I know, these algorithms are original and differs greatly from everyone else. Not only that, they seem to be able to solve any application using the same logic, be it - FDTD for electromagnetism, LBM for fluid mechanics, discontinuous Galerkin method, Runge-Kutta method... You can find them by searching for the keyword LRnLA.

I can't understand even one article. Here are all the descriptions I could find about these algorithms. If anyone understands it, please write an easy-to-follow tutorials in the style of my article and it would be greatly appreciated.

DiamondCandy (latest)

Perepelkina A.Y., Levchenko V.D. The DiamondCandy Algorithm for Maximum Performance Vectorized Cross-Stencil Computation Keldysh Institute Preprints. 2018. No. 225. 23 p. doi:10.20948/prepr-2018-225-e, http://library.keldysh.ru/preprint.asp?id=2018-225&lg=e
Perepelkina, Anastasia, Vadim Levchenko, and Sergey Khilkov. "The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes." The Journal of Supercomputing 75 (2019): 7778-7789. https://sci-hub.se/10.1007/s11227-018-2461-z
Perepelkina, A., and V. D. Levchenko. "Recursive DiamondCandy: non-memory-bound LRnLA algorithm for 3D cross stencil calculations on CUDA GPU." Proceedings of the 2020 4th International Conference on High Performance Compilation, Computing and Communications. 2020. https://sci-hub.se/10.1145/3407947.3407951

DiamondTorre

Vadim Levchenko, Anastasia Perepelkina, “DiamondTile Algorithm for High-Performance Wave Modeling“, Keldysh Institute of Apllied Mathematics RAS, GTC 2015 https://on-demand.gputechconf.com/gtc/2015/presentation/S5315-Anastasia-Perepelkina.pdf
Levchenko, Vadim, Anastasia Perepelkina, and Andrey Zakirov. "DiamondTorre algorithm for high-performance wave modeling." Computation 4.3 (2016): 29. https://www.mdpi.com/2079-3197/4/3/29/pdf
Zakirov A.V., Levchenko V.D., Perepelkina A.Yu., Zempo Yasunari. High performance FDTD code implementation for GPGPU supercomputers, Keldysh Institute Preprints. 2016. No. 44. 22 p. doi:10.20948/prepr-2016-44-e, http://library.keldysh.ru/preprint.asp?id=2016-44&lg=e

Others

Perepelkina, Anastasia, and Vadim Levchenko. "LRnLA algorithm ConeFold with non-local vectorization for LBM implementation." Russian Supercomputing Days. Cham: Springer International Publishing, 2018. 101-113. https://2018.russianscdays.org/files/pdf18/274.pdf
Levchenko, V. D., and A. Y. Perepelkina. "Locally recursive non-locally asynchronous algorithms for stencil computation." Lobachevskii Journal of Mathematics 39 (2018): 552-561. https://sci-hub.se/10.1134/S1995080218040108
Levchenko, Vadim, and Anastasia Perepelkina. "The diamondtetris algorithm for maximum performance vectorized stencil computation." Parallel Computing Technologies: 14th International Conference, PaCT 2017, Nizhny Novgorod, Russia, September 4-8, 2017, Proceedings 14. Springer International Publishing, 2017. https://www.researchgate.net/profile/Anastasia-Perepelkina/publication/318757233_The_DiamondTetris_Algorithm_for_Maximum_Performance_Vectorized_Stencil_Computation/links/5c4b6852299bf12be3e30ee6/The-DiamondTetris-Algorithm-for-Maximum-Performance-Vectorized-Stencil-Computation.pdf
Zakirov, Andrey Vladimirovich, and Vadim Dmitrievich Levchenko. "The effective algorithm for 3D modeling of electromagnetic waves' propagation through photonic crystals." Keldysh Institute Preprints 21 (2008): 1-20. https://library.keldysh.ru/preprint.asp?id=2008-21

11 replies

ezz666 May 9, 2024

Hello! On behalf of my colleagues, mysterious Russian magicians, I'd like to thank you for reading our ~~scrolls~~ articles. If you have any questions don't hesitate to ask them, I'll try my best to answer.

Foolock Dec 17, 2024

Hi @biergaizi , Thank you for the amazing post! I have a hard time understanding temperal tiling for FDTD and this saved me world. But some of links of the images are broken in this post. Could you please fix those for better understanding? Thank you so much!

biergaizi Dec 19, 2024
Author

@Foolock Unfortunately, I can't, for now. Fixing the links is too much work for me, I don't even know where the original images are anymore. Blame Microsoft and GitHub for that! For now, you can refer to my earlier version of the same post on Stack Overflow. It doesn't have FDTD-specific images and explanations, but most of the images are working here.

Foolock Apr 20, 2025

Hi @biergaizi , Thank you! I was able to walk through the stack overflow discussion. There is one thing I am a little confused about the explanation of the fatal flaw in trapezoid tiling. It said there exists redundant memory access in trapezoid tiling since mountain tiles and valley tiles are overlapped 1D lines in terms of memory access. By redundant memory access, do you mean this part of "unnecessary DRAM/main memory accesses could have been turned into cache accesses/shared memory accesses(GPU)"? Because no matter what kind of tiling methods we applied, the amount of load ops and store ops for each data should be always the same, correct?

Sorry if my question sounds confusing. Just trying to better understand this. Thank you!

biergaizi Apr 20, 2025
Author

@Foolock What I meant is the following - when trapezoid tiling is applied, the 1D line is decomposed into mountain tiles and valley tiles. They're independent and can be calculated separately in any order, and there's data reuse within each tile. Both are great properties. But if you iterate all mountains in the first "mountain" pass, then iterate all valleys in the second "valley" pass, you'll notice that the valley reloads the some overlapping cells already used in the first mountain pass - but now it has already been evicted from CPU's last-level cache - since the first "mountain" pass touches too much memory, much greater than cache. So when you run the "valley" passes, some loads must be provided by DRAM (that otherwise may be served by cache). Ideally, the tiling algorithm should work even faster if adjacent the mountains and valleys can share some in-cache data without evicting it.

As a potential improvement, instead of calculating the simulation in the order of M1 M2 M3 M4 M5, V1 V2 V3 V4 V5 V6 V7, in theory you can increase the cache hit rate by doing it like M1 M2 V1 M3 M4 V2 M5 M6 V3, so that the overlapping array cells can be reused. But now all tiles are no longer independent, and requires many additional synchronizations. Alternatively, perhaps you can tile the simulation recursively using different sizes, nesting large tiles within small tiles, each with a size optimized for the respective cache level - perhaps you can even do it in an cache-oblivious way using fractal tiling, such as partitioning the tiles like Sierpinski triangles.

But both optimization proposals make the algorithm extremely complicated. I didn't succeed at doing it.

I don't guarantee that it's the correct understanding of the trapezoid tiling algorithm, but it's my best understanding of it.

biergaizi · 2024-05-01T21:17:35Z

biergaizi
May 1, 2024
Author

The following is the fifth article of the series Notes on HPC Programming and openEMS Optimizations.

Annotated Source Code of openEMS's "SSE engine"

The source file engine_sse_compressed.cpp is openEMS's FDTD kernel, but it can be confusing for new developers to understand because of its transposed memory layout for vectorization. During my own attempt to develop a high-performance kernel, I also re-implemented the official openEMS kernel for reference, with annotations. I believe it would be useful to post it here in case anyone else needs it too, to avoid duplicate work.

P.S: I'm now experiment with different vectorization methods, because I believe the memory transposation is unsuitable for tiling.

`simd.hpp`

#pragma once
#include <cstddef>

// The openEMS engine is called "SSE engine", but it's not literally SSE. It's just float4
// SIMD using GCC's vector extension which compiles to SSE on x86. This is also why
// recompiling the code with `-mavx` won't make it faster since float4 is hardcoded.
// 
// Note: Due to memory bandwidth bottleneck, changing it to float8 won't make it faster
// either, I know because I've already tried... But combining AVX with tiling is possibly a
// good idea, still work in progress...
namespace Simd
{
	const std::size_t floatSize4 = sizeof(float) * 4;
	typedef float _float4 __attribute__ ((vector_size(floatSize4)));
	union float4
	{
		_float4 all;
		float elem[4];
	};
}

`kernel.hpp`

void updateVoltageRange(
	const FieldArray<float, Simd::float4>& volt,
	const FieldArray<float, Simd::float4>& curr,
	const CompressedOperatorArray<float, Simd::float4>& op,
	const std::array<size_t, 2> first,
	const std::array<size_t, 2> last
)
{
	for (uint32_t i = first[0]; i <= last[0]; i++) {
		uint32_t prev_i = i > 0 ? i - 1 : 0;

		for (uint32_t j = first[1]; j <= last[1]; j++) {
			uint32_t prev_j = j > 0 ? j - 1 : 0;

			for (uint32_t k = 1; k <= volt.size()[2] - 1; k++) {
				uint32_t prev_k = k - 1;

				// 12 (3 x 4) FP32 loads
				Simd::float4 volt0_ci_cj_ck = volt.v(i,      j,      k,      0);
				Simd::float4 volt1_ci_cj_ck = volt.v(i,      j,      k,      1);
				Simd::float4 volt2_ci_cj_ck = volt.v(i,      j,      k,      2);

				// 12 (3 x 4) FP32 loads
				Simd::float4 curr0_ci_cj_ck = curr.v(i,      j,      k,      0);
				Simd::float4 curr1_ci_cj_ck = curr.v(i,      j,      k,      1);
				Simd::float4 curr2_ci_cj_ck = curr.v(i,      j,      k,      2);

				// 24 (6 x 4) FP32 loads
				// c = current, p = previous
				Simd::float4 curr0_ci_cj_pk = curr.v(i,      j,      prev_k, 0);
				Simd::float4 curr1_ci_cj_pk = curr.v(i,      j,      prev_k, 1);
				Simd::float4 curr0_ci_pj_ck = curr.v(i,      prev_j, k,      0);
				Simd::float4 curr2_ci_pj_ck = curr.v(i,      prev_j, k,      2);
				Simd::float4 curr1_pi_cj_ck = curr.v(prev_i, j,      k,      1);
				Simd::float4 curr2_pi_cj_ck = curr.v(prev_i, j,      k,      2);

				// 24 (6 x 4) FP32 loads, but deduplicated using a lookup
				// table with key (i, k), so vectors with repeated values may
				// be reused multiple times, creating hits.
				Simd::float4 vv0_ci_cj_ck = op.vv(i, j, k, 0);
				Simd::float4 vv1_ci_cj_ck = op.vv(i, j, k, 1);
				Simd::float4 vv2_ci_cj_ck = op.vv(i, j, k, 2);
				Simd::float4 vi0_ci_cj_ck = op.vi(i, j, k, 0);
				Simd::float4 vi1_ci_cj_ck = op.vi(i, j, k, 1);
				Simd::float4 vi2_ci_cj_ck = op.vi(i, j, k, 2);

				// x-polarization
				volt0_ci_cj_ck.all *= vv0_ci_cj_ck.all;
				volt0_ci_cj_ck.all +=
					vi0_ci_cj_ck.all * (
						curr2_ci_cj_ck.all -
						curr2_ci_pj_ck.all -
						curr1_ci_cj_ck.all +
						curr1_ci_cj_pk.all
					);

				// y-polarization
				volt1_ci_cj_ck.all *= vv1_ci_cj_ck.all;
				volt1_ci_cj_ck.all +=
					vi1_ci_cj_ck.all * (
						curr0_ci_cj_ck.all -
						curr0_ci_cj_pk.all -
						curr2_ci_cj_ck.all +
						curr2_pi_cj_ck.all
					);

				// z-polarization
				volt2_ci_cj_ck.all *= vv2_ci_cj_ck.all;
				volt2_ci_cj_ck.all +=
					vi2_ci_cj_ck.all * (
						curr1_ci_cj_ck.all -
						curr1_pi_cj_ck.all - 
						curr0_ci_cj_ck.all +
						curr0_ci_pj_ck.all
					);

				// 12 (3 x 4) FP32 stores
				volt.v(i, j, k, 0) = volt0_ci_cj_ck;
				volt.v(i, j, k, 1) = volt1_ci_cj_ck;
				volt.v(i, j, k, 2) = volt2_ci_cj_ck;
			}

			// for k = 0

			// 12 (3 x 4) FP32 loads
			Simd::float4 volt0_ci_cj_ck = volt.v(i,      j,      0,      0);
			Simd::float4 volt1_ci_cj_ck = volt.v(i,      j,      0,      1);
			Simd::float4 volt2_ci_cj_ck = volt.v(i,      j,      0,      2);

			// 12 (3 x 4) FP32 loads
			Simd::float4 curr0_ci_cj_ck = curr.v(i,      j,      0,      0);
			Simd::float4 curr1_ci_cj_ck = curr.v(i,      j,      0,      1);
			Simd::float4 curr2_ci_cj_ck = curr.v(i,      j,      0,      2);

			// 24 (6 x 4) FP32 loads
			Simd::float4 curr0_ci_pj_ck = curr.v(i,      prev_j, 0,      0);
			Simd::float4 curr2_ci_pj_ck = curr.v(i,      prev_j, 0,      2);
			Simd::float4 curr1_pi_cj_ck = curr.v(prev_i, j,      0,      1);
			Simd::float4 curr2_pi_cj_ck = curr.v(prev_i, j,      0,      2);

			// The value of k depends on k - 1, so one might expect
			// the dependency of the first vector on dimension K to
			// be (i, j, 0), but it wraps back to the opposite side
			// (i, j, curr.size()[2] - 1) because the array's K dim-
			// ension has been transposed for efficient vectorization,
			// e.g.
			//
			//         00 01 02       00 03 06
			//         03 04 05   =>  01 04 07
			//         06 07 08       02 05 08
			//
			// See how nearest-element dependencies appear naturally
			// in adjacent vectors, and also that the first vector depends on
			// the last vector, shifted to the right by 1 element (note the
			// position of 03/06 atd 02/05).
			Simd::float4 curr0_ci_cj_pk;
			curr0_ci_cj_pk.elem[0] = 0;  // outside grid boundary, pad a zero
			curr0_ci_cj_pk.elem[1] = curr.v(i, j, curr.size()[2]-1, 0).elem[0];
			curr0_ci_cj_pk.elem[2] = curr.v(i, j, curr.size()[2]-1, 0).elem[1];
			curr0_ci_cj_pk.elem[3] = curr.v(i, j, curr.size()[2]-1, 0).elem[2];

			Simd::float4 curr1_ci_cj_pk;
			curr1_ci_cj_pk.elem[0] = 0;  // outside grid boundary, pad a zero
			curr1_ci_cj_pk.elem[1] = curr.v(i, j, curr.size()[2]-1, 2).elem[0];
			curr1_ci_cj_pk.elem[2] = curr.v(i, j, curr.size()[2]-1, 2).elem[1];
			curr1_ci_cj_pk.elem[3] = curr.v(i, j, curr.size()[2]-1, 2).elem[2];

			// 24 (6 x 4) FP32 loads, but deduplicated using a lookup
			// table with key (i, k), so vectors with repeated values may
			// be reused multiple times, creating hits.
			Simd::float4 vv0_ci_cj_ck = op.vv(i, j, 0, 0);
			Simd::float4 vv1_ci_cj_ck = op.vv(i, j, 0, 1);
			Simd::float4 vv2_ci_cj_ck = op.vv(i, j, 0, 2);
			Simd::float4 vi0_ci_cj_ck = op.vi(i, j, 0, 0);
			Simd::float4 vi1_ci_cj_ck = op.vi(i, j, 0, 1);
			Simd::float4 vi2_ci_cj_ck = op.vi(i, j, 0, 2);

			// x-polarization
			volt0_ci_cj_ck.all *= vv0_ci_cj_ck.all;
			volt0_ci_cj_ck.all +=
				vi0_ci_cj_ck.all * (
					curr2_ci_cj_ck.all -
					curr2_ci_pj_ck.all -
					curr1_ci_cj_ck.all +
					curr1_ci_cj_pk.all
				);

			// y-polarization
			volt1_ci_cj_ck.all *= vv1_ci_cj_ck.all;
			volt1_ci_cj_ck.all +=
				vi1_ci_cj_ck.all * (
					curr0_ci_cj_ck.all -
					curr0_ci_cj_pk.all -
					curr2_ci_cj_ck.all +
					curr2_pi_cj_ck.all
				);

			// z-polarization
			volt2_ci_cj_ck.all *= vv2_ci_cj_ck.all;
			volt2_ci_cj_ck.all +=
				vi2_ci_cj_ck.all * (
					curr1_ci_cj_ck.all -
					curr1_pi_cj_ck.all - 
					curr0_ci_cj_ck.all +
					curr0_ci_pj_ck.all
				);

			// 12 (3 x 4) FP32 stores
			volt.v(i, j, 0, 0) = volt0_ci_cj_ck;
			volt.v(i, j, 0, 1) = volt1_ci_cj_ck;
			volt.v(i, j, 0, 2) = volt2_ci_cj_ck;
		}
	}
}

void updateCurrentRange(
	const FieldArray<float, Simd::float4>& curr,
	const FieldArray<float, Simd::float4>& volt,
	const CompressedOperatorArray<float, Simd::float4>& op,
	const std::array<size_t, 2> first,
	const std::array<size_t, 2> last
)
{
	for (uint32_t i = first[0]; i <= last[0]; i++) {
		for (uint32_t j = first[1]; j <= last[1]; j++) {
			for (uint32_t k = 1; k <= volt.size()[2] - 1; k++) {
				// 12 (3 x 4) FP32 loads
				Simd::float4 curr0_ci_cj_ck = curr.v(i,      j,      k,      0);
				Simd::float4 curr1_ci_cj_ck = curr.v(i,      j,      k,      1);
				Simd::float4 curr2_ci_cj_ck = curr.v(i,      j,      k,      2);

				// 12 (3 x 4) FP32 loads
				Simd::float4 volt0_ci_cj_ck = volt.v(i,      j,      k,      0);
				Simd::float4 volt1_ci_cj_ck = volt.v(i,      j,      k,      1);
				Simd::float4 volt2_ci_cj_ck = volt.v(i,      j,      k,      2);

				// 24 (6 x 4) FP32 loads
				// c = current, n = next
				Simd::float4 volt0_ci_cj_nk = volt.v(i,      j,      k + 1,  0);
				Simd::float4 volt1_ci_cj_nk = volt.v(i,      j,      k + 1,  1);
				Simd::float4 volt0_ci_nj_ck = volt.v(i,      j + 1,  k,      0);
				Simd::float4 volt2_ci_nj_ck = volt.v(i,      j + 1,  k,      2);
				Simd::float4 volt1_ni_cj_ck = volt.v(i + 1 , j,      k,      1);
				Simd::float4 volt2_ni_cj_ck = volt.v(i + 1 , j,      k,      2);

				// 24 (6 x 4) FP32 loads, but deduplicated using a lookup
				// table with key (i, k), so vectors with repeated values may
				// be reused multiple times, creating hits.
				Simd::float4 ii0_ci_cj_ck = op.ii(i, j, k, 0);
				Simd::float4 ii1_ci_cj_ck = op.ii(i, j, k, 1);
				Simd::float4 ii2_ci_cj_ck = op.ii(i, j, k, 2);
				Simd::float4 iv0_ci_cj_ck = op.iv(i, j, k, 0);
				Simd::float4 iv1_ci_cj_ck = op.iv(i, j, k, 1);
				Simd::float4 iv2_ci_cj_ck = op.iv(i, j, k, 2);

				// x-polarization
				curr0_ci_cj_ck.all *= ii0_ci_cj_ck.all;
				curr0_ci_cj_ck.all +=
					iv0_ci_cj_ck.all * (
						volt2_ci_cj_ck.all -
						volt2_ci_nj_ck.all -
						volt1_ci_cj_ck.all +
						volt1_ci_cj_nk.all
					);

				// y-polarization
				curr1_ci_cj_ck.all *= ii1_ci_cj_ck.all;
				curr1_ci_cj_ck.all *=
					iv1_ci_cj_ck.all * (
						volt0_ci_cj_ck.all -
						volt0_ci_cj_nk.all -
						volt2_ci_cj_ck.all +
						volt2_ni_cj_ck.all
					);

				// z-polarization
				curr2_ci_cj_ck.all *= ii2_ci_cj_ck.all;
				curr2_ci_cj_ck.all *=
					iv2_ci_cj_ck.all * (
						volt1_ci_cj_ck.all -
						volt1_ni_cj_ck.all - 
						volt0_ci_cj_ck.all +
						volt0_ci_nj_ck.all
					);

				// 12 (3 x 4) FP32 stores
				curr.v(i, j, k, 0) = curr0_ci_cj_ck;
				curr.v(i, j, k, 1) = curr1_ci_cj_ck;
				curr.v(i, j, k, 2) = curr2_ci_cj_ck;
			}

			// for k = 0

			// 12 (3 x 4) FP32 loads
			Simd::float4 curr0_ci_cj_ck = curr.v(i,      j,      0,      0);
			Simd::float4 curr1_ci_cj_ck = curr.v(i,      j,      0,      1);
			Simd::float4 curr2_ci_cj_ck = curr.v(i,      j,      0,      2);

			// 12 (3 x 4) FP32 loads
			Simd::float4 volt0_ci_cj_ck = volt.v(i,      j,      0,      0);
			Simd::float4 volt1_ci_cj_ck = volt.v(i,      j,      0,      1);
			Simd::float4 volt2_ci_cj_ck = volt.v(i,      j,      0,      2);

			// 24 (6 x 4) FP32 loads
			Simd::float4 volt0_ci_nj_ck = volt.v(i,      j + 1,  0,      0);
			Simd::float4 volt2_ci_nj_ck = volt.v(i,      j + 1,  0,      2);
			Simd::float4 volt1_ni_cj_ck = volt.v(i + 1 , j,      0,      1);
			Simd::float4 volt2_ni_cj_ck = volt.v(i + 1 , j,      0,      2);

			// The value of k depends on k + 1, so one might expect
			// the dependency of the first vector on dimension K to
			// be (i, j, curr.size()[2] - 1), but it wraps back to
			// the opposite side (i, j, 0) because the array's K di-
			// mension has been transposed for efficient vectorization,
			// e.g.
			//
			//         00 01 02       00 03 06
			//         03 04 05   =>  01 04 07
			//         06 07 08       02 05 08
			//
			// See how nearest-element dependencies appear naturally
			// in adjacent vectors, and also that the last vector depends on
			// the fist vector, shifted to the left by 1 element (note the
			// position of 02/05 atd 03/06).
			Simd::float4 volt0_ci_cj_nk;
			volt0_ci_cj_nk.elem[0] = volt.v(i, j, 0, 0).elem[1];
			volt0_ci_cj_nk.elem[1] = volt.v(i, j, 0, 0).elem[2];
			volt0_ci_cj_nk.elem[2] = volt.v(i, j, 0, 0).elem[3];
			volt0_ci_cj_nk.elem[3] = 0;  // outside grid boundary, pad a zero

			Simd::float4 volt1_ci_cj_nk;
			volt1_ci_cj_nk.elem[0] = volt.v(i, j, 0, 1).elem[1];
			volt1_ci_cj_nk.elem[1] = volt.v(i, j, 0, 1).elem[2];
			volt1_ci_cj_nk.elem[2] = volt.v(i, j, 0, 1).elem[3];
			volt1_ci_cj_nk.elem[3] = 0;  // outside grid boundary, pad a zero

			// 24 (6 x 4) FP32 loads, but deduplicated using a lookup
			// table with key (i, k), so vectors with repeated values may
			// be reused multiple times, creating hits.
			Simd::float4 ii0_ci_cj_ck = op.ii(i, j, 0, 0);
			Simd::float4 ii1_ci_cj_ck = op.ii(i, j, 0, 1);
			Simd::float4 ii2_ci_cj_ck = op.ii(i, j, 0, 2);
			Simd::float4 iv0_ci_cj_ck = op.iv(i, j, 0, 0);
			Simd::float4 iv1_ci_cj_ck = op.iv(i, j, 0, 1);
			Simd::float4 iv2_ci_cj_ck = op.iv(i, j, 0, 2);

			// x-polarization
			curr0_ci_cj_ck.all *= ii0_ci_cj_ck.all;
			curr0_ci_cj_ck.all +=
				iv0_ci_cj_ck.all * (
					volt2_ci_cj_ck.all -
					volt2_ci_nj_ck.all -
					volt1_ci_cj_ck.all +
					volt1_ci_cj_nk.all
				);

			// y-polarization
			curr1_ci_cj_ck.all *= ii1_ci_cj_ck.all;
			curr1_ci_cj_ck.all *=
				iv1_ci_cj_ck.all * (
					volt0_ci_cj_ck.all -
					volt0_ci_cj_nk.all -
					volt2_ci_cj_ck.all +
					volt2_ni_cj_ck.all
				);

			// z-polarization
			curr2_ci_cj_ck.all *= ii2_ci_cj_ck.all;
			curr2_ci_cj_ck.all *=
				iv2_ci_cj_ck.all * (
					volt1_ci_cj_ck.all -
					volt1_ni_cj_ck.all - 
					volt0_ci_cj_ck.all +
					volt0_ci_nj_ck.all
				);

			// 12 (3 x 4) FP32 stores
			curr.v(i, j, 0, 0) = curr0_ci_cj_ck;
			curr.v(i, j, 0, 1) = curr1_ci_cj_ck;
			curr.v(i, j, 0, 2) = curr2_ci_cj_ck;
		}
	}
}

0 replies

JLA4444 · 2024-12-30T13:29:46Z

JLA4444
Dec 30, 2024

This got me thinking that GPRMax uses cuda to accelerate EM simulations.
Maybe you can draw inspiration from there. Their benchmark comparison is insane compared to CPU.

See link
https://docs.gprmax.com/en/latest/benchmarking.html

4 replies

biergaizi Dec 30, 2024
Author

This got me thinking

Stop "thinking" and start reading first. It's the tried and true rule to work with anything technical so duplicated work can be avoided. I've already been worked on the GPU version of the FDTD for a year before suspending this project. In fact I've already tried a hundred of different code variations on the GPU to maximize performance, but I eventually decided that it's the best to optimize the code for the CPU first.

The problem is GPU is that it's extremely difficult to try different algorithms (such as tiling), as bugs in the algorithm are difficult to trace and debug. A second problem is that due to data transfer overhead, a GPU engine is all-or-nothing, you have to either rewrite the entirety of openEMS with GPU in mind - a kernel partial GPU won't work well. So my current conclusion is that it's the best to get a working CPU version first.

JLA4444 Dec 30, 2024

Since GPRMax is an open source project. Would it be possible to use their kernel? This would save a lot of work someone else already done.

I am not a computational engineer so i have no clue if it would be feasible

thliebig Dec 30, 2024
Maintainer

No that is not possible. The sets of features and simulation targets are very different. Afaik GRPMax does not have an inhomogeneous mesh which makes things much easier to speed up but is essential for lower frequency RF devices like antennas and such.
I would think that GPRMax is ideal for very high frequencies (like optics) or at least simulation objects that are much larger than the wavelength.

biergaizi Jan 2, 2025
Author

Since GPRMax is an open source project. Would it be possible to use their kernel?

I know the gprMax kernel, but it's of little use. The gprMax kernel is basically a textbook implementation of the FDTD algorithm. As I've already showed here in this thread, the main FDTD kernel is just 20 lines of code with a bunch of multiplies and adds, a plain implementation is trivial on the GPU - there's no need to study gprMax. The problem is going beyond the textbook, since the textbook algorithm is unable to achieve the optimal performance, so there's a lot of optimization room above that to minimize the memory bandwidth usage - this is where the real difficulty lies. This problem has been studied in details by the Russian team in the papers I've cited, and to my best knowledge, is not explored by anyone else (other than what I'm currently doing) so far.

The next problem is the problem of integrating the GPU kernels within the openEMS architecture. This is not trivial and would likely require a big rewrite. The architecture of gprMax and openEMS are completely different, and the latter is not designed with GPU's limitations in mind. For example, openEMS's simulation engine heavily relies on multi-pass "extensions", this architecture doesn't play well with GPUs.

This would save a lot of work someone else already done.

I have personally already worked on researching possible performance optimizations for a full year. So I, too, is the so-called "someone else".

I am not a computational engineer so i have no clue if it would be feasible

In China, there's a saying called "having a discussion with rocket engineers about what kind of coal is the best rocket fuel". If you have no clue about it, it's better to read and understand everything that has been written and cited in my posts to understand the mechanisms, and problems involved. Once you've learned the context of the problem, a productive conversation would then be possible.

Notes on HPC Programming and openEMS Optimizations #154

Uh oh!

Uh oh!

Table of Content

Replies: 8 comments · 30 replies

Uh oh!

Uh oh!

biergaizi Dec 3, 2023 Author

Introduction to the Memory Bandwidth Wall

Machine Balance and Arithmetic Intensity

Roofline Model

Practice 1: Triad

Practice 2: openEMS naive UpdateVoltages kernel

Future Updates

Uh oh!

Uh oh!

Uh oh!

biergaizi Dec 4, 2023 Author

Uh oh!

Uh oh!

Uh oh!

biergaizi Dec 4, 2023 Author

Uh oh!

Uh oh!

Uh oh!

biergaizi Dec 4, 2023 Author

Uh oh!

Uh oh!

Uh oh!

biergaizi Dec 4, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

biergaizi Dec 14, 2023 Author

Understanding the Performance Characteristics of FDTD and openEMS

Uh oh!

biergaizi Dec 14, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

biergaizi Dec 14, 2023 Author

Uh oh!

Uh oh!

Uh oh!

biergaizi Jan 15, 2024 Author

Uh oh!

Uh oh!

biergaizi Jan 18, 2024 Author

A Short Note on Operator Compression

Uh oh!

Uh oh!

biergaizi Jan 19, 2024 Author

Temporal Tiling: The Key to Fast FDTD Simulations, Explained

Background

Stencil Computation

Memory Wall

Parallelogram Tiling

Replies: 8 comments 30 replies

biergaizi
Dec 3, 2023
Author

Practice 2: openEMS naive `UpdateVoltages` kernel

biergaizi Dec 4, 2023
Author

biergaizi Dec 4, 2023
Author

biergaizi Dec 4, 2023
Author

biergaizi Dec 4, 2023
Author

biergaizi
Dec 14, 2023
Author

biergaizi Dec 14, 2023
Author

biergaizi Dec 14, 2023
Author

biergaizi Jan 15, 2024
Author

biergaizi
Jan 18, 2024
Author

biergaizi
Jan 19, 2024
Author