Only OpenMP
, CMake 3.22
and C++ >= 11
are required to build the library.
- For those who want to use the library:
You can add the following lines to
git clone --recursive-submodules https://github.com/AMSC-24-25/20-fft-20-fft.git # rename folder clone mv 20-fft-20-fft signal_processing
CMakeLists.txt
:# suppose you have cloned the repository into an external folder add_subdirectory(external/signal_processing) target_link_libraries(your_executable_name PRIVATE signal_processing)
- For those who want to run examples or benchmarks:
git clone --recursive-submodules https://github.com/AMSC-24-25/20-fft-20-fft.git # rename folder clone mv 20-fft-20-fft signal_processing && cd signal_processing ./build-essentials.sh # to install the dependencies ./build-examples.sh # to build the examples ./build-benchmarks.sh # to build the benchmarks
The Signal Processing library is a C++ educational project implementing famous transform algorithms.
The transforms implemented in this repository are:
- The Fast Fourier Transform (FFT) and its inverse are both implemented in parallel using the OpenMP framework and in sequential. The implementation is based on the Cooley-Tukey algorithm. It supports N-dimensional signals, but obviously the signal length must be a power of 2 (radix-2 case).
- The Discrete Cosine Transform (DCT) and its inverse are both implemented in parallel using the OpenMP framework and in sequential. The DCT is implemented as DCT-II (type II), and its inverse is implemented as DCT-III (type III). It is believed to be used for image compression.
- The Haar Wavelet Transform (HWT) (only sequential).
Furthermore, the DCT and the HWT are implemented for image compression.
Important
The library itself has only been tested on Linux. It is not guaranteed to work on other operating systems. The benchmarks and examples may not work on Windows.
Clone the repository:
git clone --recursive-submodules https://github.com/AMSC-24-25/20-fft-20-fft.git
# rename folder clone
mv 20-fft-20-fft signal_processing
Now you are ready to go.
However, some prerequisites are very common. The library is written in C++11 and uses the OpenMP framework for parallelization. In order to use it, you need to have:
- A C++ compiler that supports the C++11 standard.
- Build tools are CMake
$\ge 3.22$ , Make, or Ninja (strongly recommended). - The OpenMP framework.
However, you can run the following script to install the dependencies (linux only):
./build-essentials.sh
Keep in mind that if you are a developer, you may already have the necessary dependencies installed because build-essential, CMake, and Ninja are very common packages.
Note
If you have already cloned the repository without the --recursive
flag,
you can clone the submodules using the following command (from the repository folder):
git submodule update --init --recursive
This project uses CMake Presets to simplify configuration.
To list available presets:
cmake --list-presets
The recommended way to build the library is to use the ninja-lib
preset.
This preset only builds the library and uses Ninja as the build system.
Therefore, no examples or benchmarks are built.
# assuming you are in the repository folder where the CMakeLists.txt file is located
# create a build directory
mkdir build && cd build
# configure the project using the ninja-lib preset
cmake .. --preset ninja-lib
cd ninja-lib
# build the project
ninja all
Other useful presets:
Preset Name | Description |
---|---|
ninja-dev |
Build with examples (requires OpenCV and Matplot++) |
ninja-dev-benchmark |
Build with benchmarks (requires Google Benchmark too) |
make-dev |
Same as ninja-dev but uses Makefiles instead of Ninja |
make-lib |
Build library only with Makefiles |
The presets are defined in the CMakePresets.json file.
The external libraries required for ninja-dev
, ninja-dev-benchmark
, and make-dev
can be installed using the corresponding scripts:
./build-examples.sh
./build-benchmarks.sh
These scripts will install the dependencies required for the examples and benchmarks, respectively. However, they also build the examples and benchmarks. You will find the examples and benchmarks in the build folder.
Warning
If you are using WSL (Windows Subsystem for Linux), the sh
scripts may not work due to
end-of-line issues.
This is because Windows uses CRLF
(Carriage Return + Line Feed) as the end-of-line character,
whereas Linux uses LF
(Line Feed) only.
For this reason, we recommend converting the end-of-line characters to Unix format using the following command:
# install dos2unix
sudo apt update && sudo apt install dos2unix
# convert all sh scripts to Unix format
dos2unix clean-all.sh build-*.sh
If you want to use the library in your own project, you can add the following lines to your CMakeLists.txt
file:
# suppose you have cloned the repository into an external folder
add_subdirectory(external/signal_processing)
target_link_libraries(your_executable_name PRIVATE signal_processing)
It will add the library to your project and link it to your executable. During the build process, it will also automatically issue a warning if some dependencies are missing.
The library is designed to be user-friendly.
Simply import the main header file, and you're ready to start using it.
For example, to use the FFT solver, follow these steps:
#include <iostream>
#include <cmath>
#include <complex>
#include <vector>
#include <signal_processing/signal_processing.hpp>
int main() {
// 2D signal, ordered in row-major order
const size_t dims = 2;
const size_t rows = 4;
const size_t cols = 4;
std::vector<std::complex<double>> rand_signal = {
{1.0, 0.0}, {2.0, 0.0}, {3.0, 0.0}, {4.0, 0.0},
{5.0, 0.0}, {6.0, 0.0}, {7.0, 0.0}, {8.0, 0.0},
{9.0, 0.0}, {10.0, 0.0}, {11.0, 0.0}, {12.0, 0.0},
{13.0, 0.0}, {14.0, 0.0}, {15.0, 0.0}, {16.0, 0.0}
};
// solver, where dims is the number of dimensions and it is a template parameter
sp::fft::solver::FastFourierTransform<dims> solver(std::array{rows, cols});
// solve in-place (sequential)
solver.compute(rand_signal, sp::fft::solver::ComputationMode::SEQUENTIAL);
// print the result
for (const auto& s : rand_signal) {
printf("(%.2f + %.2fi)", s.real(), s.imag());
// print a new line every 4 elements using the address of the element
if ((&s - &rand_signal[0] + 1) % 4 == 0) {
printf("\n");
} else {
printf(", ");
}
}
}
You can also use the parallel version of the FFT solver:
solver.compute(rand_signal, sp::fft::solver::ComputationMode::OPENMP);
Or solve the FFT not in-place:
std::vector<std::complex<double>> result(rand_signal.size());
// rand_signal will not be modified
solver.compute(rand_signal, result, sp::fft::solver::ComputationMode::SEQUENTIAL);
To restore the original signal, use the inverse FFT solver:
sp::fft::solver::InverseFastFourierTransform<dims> inverse_solver(std::array{rows, cols});
// solve in-place
inverse_solver.compute(rand_signal, sp::fft::solver::ComputationMode::SEQUENTIAL);
// print the result
for (const auto& s : rand_signal) {
printf("(%.2f + %.2fi)", s.real(), s.imag());
// print a new line every 4 elements using the address of the element
if ((&s - &rand_signal[0] + 1) % 4 == 0) {
printf("\n");
} else {
printf(", ");
}
}
Check the examples section for more examples.
The library also provides some utility functions to help you with the signal generation and saving.
- The signal_generator folder contains two classes that generate signals in the time and space domains.
- The signal_saver folder contains a class that saves the generated signal to a CSV file.
- The config_loader folder
contains a class that loads the configuration from a JSON file.
This class follows the JSON schema defined in the resources folder.
Example: JSON Configuration File
{ "signal_domain": "time", "signal_length": 2048, "hz_frequency": 5, "phase": 0, "noise": 5 }
An example of JSON file that describes a signal with the following characteristics:
- Signal Domain:
"time"
- The signal is represented in the time domain. - Signal Length:
2048
- The duration or length of the signal (number of samples). - Frequency:
5 Hz
- The frequency of the signal in Hertz (cycles per second). - Phase:
0
- The phase shift of the signal, which is 0 in this case. - Noise:
5
- The noise level or amplitude of noise in the signal.
In short, the configuration describes a time domain signal with a frequency of
5 Hz
, no phase shift, and an amount of noise (5
). - Signal Domain:
The Fourier transform is a mathematical operation that transforms a function of time (or space) into a function of frequency. In this library, we implement the Fast Fourier Transform (FFT) algorithm, and its inverse (IFFT) using the Cooley-Tukey algorithm (Radix-2 case).
The Cooley-Tukey algorithm is a (famous) algorithm for computing the Fast Fourier Transform (FFT) efficiently.
It reduces the computational complexity of the Discrete Fourier Transform (DFT) from
Under the hood, the Cooley-Tukey algorithm uses the divide-and-conquer strategy.
It recursively breaks a DFT of size
-
$X(k)$ is the$k$ -th component of the Discrete Fourier Transform (DFT), representing the frequency domain. -
$\displaystyle\sum_{n=0}^{N-1}$ is the summation over all$N$ input samples. -
$x(n)$ is the$n$ -th sample of the input signal in the time (or spatial) domain. -
$e^{(-2\pi i k n) \div N} = \exp((-2\pi i k n) \div N)$ is the complex exponential (twiddle factor) that maps the time-domain signal to the frequency domain. It introduces a phase shift and scales the input signal.
When
- Bit-reversal permutation: reorder the input array based on the bit-reversed indices (see also Wikipedia).
- Iteratively compute FFT:
- Start with 2-point DFTs.
- Merge results into 4-point DFTs, then 8-point, and so on, up to
$N$ -point.
Each stage computes butterfly operations,
which involve two elements
Finally, for the inverse FFT, the algorithm is similar but with a few differences:
- The direction of the twiddle factor in the inverse FFT is reversed.
- After computing the inverse FFT using Cooley-Tukey, we normalize the result.
Each element is divided by
$N$ , the size of the input.
This scaling ensures that the inverse transform truly inverts the original FFT and restores the original signal.
Warning
The Cooley-Tukey algorithm implemented in this repository is the Radix-2 case. So the signal length must be a power of 2.
The Discrete Cosine Transform Type-II (DCT-II) is the most commonly used variant of the DCT, particularly in image and video compression (e.g. JPEG). It is defined as:
Where:
-
$x_n$ : input signal (spatial or time domain) -
$X_k$ : DCT coefficient (frequency domain) -
$N$ : number of input samples -
$\alpha_k$ : normalization factor:
The cosine argument determines the basis functions. It’s the heart of the transform:
-
$k$ determines the frequency of the cosine wave: higher$k$ means more oscillations. -
$(2n+1)$ causes the cosine to sample at odd intervals, which makes it suitable for even symmetry extension (fundamental for avoiding boundary discontinuities in signals/images).
In other words, the DCT basis functions are cosine waves of increasing frequency.
The summation is an inner product between the input signal and the cosine basis function of frequency
Finally, the normalization factor
-
$\alpha_0 = \sqrt{\frac{1}{N}}$ gives the DC term (average value) less weight. -
$\alpha_k = \sqrt{\frac{2}{N}}$ for$k > 0$ keeps other frequencies properly scaled.
Finally, the DCT-III, which is the inverse of the DCT-II, is defined as:
In general, the DCT is used for compression because it focuses the signal's energy into a few coefficients. This allows for efficient representation and storage. In this library, we use DCT-II and DCT-III for image compression. These methods are supported by quantization, zigzag scanning, and RLE encoding (unfortunately Huffman encoding is not implemented yet).
The 1D Haar Wavelet Transform (HWT) is one of the simplest wavelet transforms. It is very useful for signal compression (our goal), denoising, and multi-resolution analysis.
The Haar transform converts a sequence of values into averages and differences, capturing both low-frequency (smooth) and high-frequency (detail) information.
The 1D Haar transform is defined as follows. Given an input vector:
The transform produces two outputs:
- Averages (low-frequency content):
- Details (high-frequency content):
Tip
For example, for
So the Haar transform of x
is:
Our implementation is a multilevel Haar transform, which means it is recursive. We apply the transform to the average coefficients only, creating a pyramid of resolutions. This is known as the Haar wavelet decomposition, and it outputs are:
- low-frequency coefficients (averages)
- high-frequency coefficients (details)
Tip
Taking in account the example above, the first level of the Haar transform (level 1) gives:
Then, you can apply the Haar transform again on the averages from level 1:
Again, compute average and detail (level 2):
After 2 levels, we get:
Where
Level 2: a00 <-- 1 value (most compressed)
/ \
Level 1: a0 a1 <-- 2 values
/ \ / \
Input: x0 x1 x2 x3 <-- 4 values
The 2D Haar Wavelet Transform is essentially just the 1D version applied twice: first across rows, then across columns. Given a 2D matrix (e.g., a grayscale image), we apply the 1D Haar transform to:
- Each row;
- Then each column of the result.
This gives a decomposition of the image into components that describe different frequency orientations. Finally, we apply the multilevel Haar transform recursively on each block to get a compressed representation.
Tip
For example, suppose we have a 4x4 matrix
We apply the 1D Haar transform to each row:
This gives us a new matrix with:
- Left half: row averages
- Right half: row details
Now we take this result and apply the 1D Haar transform to each column. This gives us a final matrix divided into four blocks:
[ LL | HL ]
[----+----]
[ LH | HH ]
Where:
LL
: low frequency in both directions (approximation).HL
: high frequency in rows, low in columns (horizontal details).LH
: low frequency in rows, high in columns (vertical details).HH
: high frequency in both directions (diagonal details).
The examples are located in the examples folder. To build the examples, run the following command:
./build-examples.sh
The examples need the following dependencies:
The FFT examples are:
-
fourier_transform_solver is the simplest one that shows how to use the FFT solver with a one-dimensional (1D) signal.
It generates a random signal using a JSON configuration file. It computes the FFT and IFFT in parallel and sequentially. It saves each result in a CSV file and plots the results using Matplot++ to demonstrate the difference between the FFT and IFFT.
-
fourier_transform_solver_image is an example that shows how to use the FFT solver with a two-dimensional (2D) signal.
This time, instead of generating a random signal, we will demonstrate how to load an RGB image from a file and apply the FFT to it. The results show that the FFT implementation can handle two-dimensional (2D) signals. We verified its correctness by comparing the results with the original image.
Original Image (2048 x 2048) Image after FFT and IFFT (2048 x 2048) -
fourier_transform_solver_performance is a simple example that demonstrates the performance of the FFT and inverse FFT algorithms.
Run it to see the performance of the FFT and inverse FFT algorithms. It generates a random signal with 8,388,608 (8 million) samples and computes the FFT and inverse FFT.
-
fourier_transform_solver_video shows how the FFT solver works well with 3D signals, too.
It uses the OpenCV library to read a video file and apply the FFT to each frame. The input video is RGB, and the output is grayscale. This avoids a long processing time.
The example video has a resolution of 512 x 1024 and contains 266 frames at 60 fps. Since the FFT algorithm is designed to work with signals that are powers of two, the video is cropped to 256 frames. These frames are reshaped into a 3D signal with dimensions of 512 x 1024 x 256 (134,217,728 samples, or 134 million).
Note: The converted video on GitHub was compressed to reduce its size. The original video is available here: cats_fft_ifft.mp4.
Original Video (266 frames x 512 x 1024) Video after FFT and IFFT (256 frames x 512 x 1024) cats_resize.mp4
cats_fft_ifft-comp.mp4
-
fourier_transform_solver_video_rgb is similar to the previous one, but it uses the RGB color space instead of grayscale.
The FFT is applied to each frame of the video, and the result is saved as a new video file.
The input and output videos are both RGB. The example video has a resolution of 512 x 1024 and contains 266 frames at 60 fps. Since the FFT algorithm is designed to work with powers of two, the video is cropped to 256 frames. The frames are then reshaped into a 3D signal with dimensions of 512 x 1024 x 256 (134,217,728 samples, or approximately 134 million).
Note: The converted video on GitHub was compressed to reduce its size. The original video is available here: cats_fft_ifft-rgb.mp4.
Original Video (266 frames x 512 x 1024) Video after FFT and IFFT (256 frames x 512 x 1024) cats_resize.mp4
cats_fft_ifft-rgb-comp.mp4
The Haar Wavelet Transform (HWT) examples are:
-
haar_wavelet_transform_1d is a simple example that shows how to use the Haar wavelet transform (HWT) solver with a one-dimensional signal.
A similar example is haar_wavelet_transform_2d. This example shows how to use the HWT solver with a two-dimensional (2D) signal.
Both examples generate a random signal and print the original signal, compute the HWT, and print the result.
-
haar_wavelet_transform is an example that shows how the HWT solver works with a two-dimensional (2D) signal, in this case an image.
It loads a grayscale image from a file and applies the HWT to it to compress the image. Finally, it restores the original image using the inverse HWT and saves it to a file. With different levels of compression, the image quality is preserved.
Compression is modified by adjusting the threshold for detail coefficients. This threshold determines which coefficients in the Haar wavelet transform are set to zero during compression.
- A higher threshold removes more coefficients, resulting in greater compression and loss of detail in the image. This can degrade the quality of the reconstructed image.
- A lower threshold retains more coefficients, preserving more detail but reducing the compression ratio.
We conducted a comprehensive set of benchmarks to evaluate the performance and scalability of our Fast Fourier Transform (FFT) implementation. These benchmarks cover 1D, 2D, and 3D FFTs and compare sequential and parallel (OpenMP) computation modes.
-
We generated random input signals of various sizes to ensure that the total number of elements was always a power of two, as required by the radix-2 Cooley-Tukey FFT.
-
We tested a wide range of input shapes for each dimension, from small to very large, to assess performance at both small and large scales.
Using a backtracking approach, we generated input shapes and benchmarked all possible combinations of sizes for each dimension that were less than or equal to 8,388,608
$\left(2^{23}\right)$ elements. -
Each benchmark was run in both sequential and OpenMP parallel modes with different thread counts to measure parallel speedup and efficiency.
-
The benchmarks were executed using the Google Benchmark framework, which provides reliable and statistically sound timing results.
We benchmarked the performance of our FFT implementation on a Lenovo ThinkPad T430:
- CPU: Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
- Thread(s) per core: 2
- Core(s) per socket: 4
- Socket(s): 1
- Stepping: 9
- CPU(s) scaling MHz: 56%
- CPU max MHz: 3200.0000
- CPU min MHz: 1200.0000
- Caches (sum of all):
- L1d: 128 KiB (4 instances)
- L1i: 128 KiB (4 instances)
- L2: 1 MiB (4 instances)
- L3: 6 MiB (1 instance)
- RAM: 2 x 8 GB DDR3 1600 MHz
- OS: Ubuntu 24.04 LTS
To conclude, the benchmarks demonstrate that:
- 1D FFT is the most parallel-friendly on this Intel i7-3632QM, showing strong speedup and high efficiency, especially with 2-4 threads.
- 2D FFT gains from parallelism for larger inputs, but memory usage and thread overhead limit its full potential.
- 3D FFT suffers from poor cache locality and higher overhead, making it the least efficient and least scalable among the three.
We benchmarked the performance of our FFT implementation on an HP Pavilion 15-cx0xxx:
- CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
- Thread(s) per core: 2
- Core(s) per socket: 6
- Socket(s): 1
- Stepping: 10
- CPU(s) scaling MHz: 40%
- CPU max MHz: 4100.0000
- CPU min MHz: 800.0000
- Caches (sum of all):
- L1d: 192 KiB (6 instances)
- L1i: 192 KiB (6 instances)
- L2: 1.5 MiB (6 instances)
- L3: 9 MiB (1 instance)
- RAM: 2 x 8 GB DDR4 2667 MHz
- OS: Ubuntu 24.04 LTS
In conclusion, the CPU scales well with 4 to 8 threads, depending on the dimensionality of the FFT. However, 12 threads (via Hyper-Threading) yield diminishing returns, often increasing variance and lowering efficiency. However, efficiency consistently improves with input size but never reaches ideal (100%) scaling, especially in higher-dimensional FFTs.
Compared to the ThinkPad T430, the HP Pavilion 15-cx0xxx has a higher raw speed, especially for larger input sizes and 1D/2D FFTs. However, the ThinkPad is more efficient at low thread counts, while the HP offers higher parallel capacity. Finally, both devices remain inefficient with 3D FFTs, though the HP handles it slightly better.
We benchmarked the performance of our FFT implementation on a Dell Inspiron 14 Plus:
- CPU: Intel(R) Core(TM) Ultra 7 155H CPU @ 1.40GHz (up to 4.80 GHz)
- Thread(s) per core: 2
- Core(s) per socket: 11
- Socket(s): 1
- Stepping: 4
- CPU max MHz: 4800.0000
- CPU min MHz: 700.0000
- Caches (sum of all):
- L1d: 528 KiB (11 instances)
- L1i: 704 KiB (11 instances)
- L2: 22 MiB (11 instances)
- L3: 24 MiB (1 instance)
- RAM: 2 x 16 GB LPDDR5X 6.400 MT/s integrated memory
- OS: Windows 11 with WSL2 (Ubuntu 24.04 LTS)
This hardware demonstrates excellent scaling, particularly for 1D and 2D FFTs. It even achieved strong 3D FFT throughput. The best results are obtained with 8 to 16 threads, where a balance between raw speed and stability is maintained. Using 22 threads provides a significant speed boost for very large data sets, but it results in diminished per-thread efficiency and increased variability.
In conclusion, Dell leads in scalability and peak performance, HP offers a balanced middle ground, and the ThinkPad, though limited in raw computing power, excels at small, efficient workloads. Each system demonstrates the direct impact of core count, cache size, and architectural design on the success of FFT parallelization.
-
✅ Highly scalable in 1D: The Cooley-Tukey radix-2 algorithm used here parallelizes well in 1D. With proper workload size and cache locality, it can achieve near-linear speedup, particularly when each thread has sufficient independent work to avoid contention.
-
❌ Diminishing returns with thread count: As thread count grows beyond physical cores, synchronization, false sharing, and cache thrashing begin to erode efficiency, particularly visible in 3D FFTs and with 22-thread runs.
-
❌ 3D FFTs are much harder to scale: The increase in data dimensionality introduces more complex memory access and greater working set sizes.
-
❌ Sequential bottlenecks remain: Some stages inherently require coordination or are not fully parallelizable with "simple" loop-level OpenMP.
The FFT algorithm itself is inherently parallel, especially in its divide-and-conquer form like Cooley-Tukey. However, the real-world scalability depends heavily on: input size, dimensionality, hardware cache structure, and the balance between thread granularity and workload size.