This is the replication package containing code and experimental results for the paper "Abstractions for C++ Code Optimizations in Parallel High-performance Applications" submitted to the special issue of the Parallel Computing journal for The 15th International Workshop on Programming Models and Applications for Multicores and Manycores.
The paper presents a continuation of the work on the Noarr library, which can be found at https://github.com/ParaCoToUl/noarr-structures. The earlier work on this extension was presented at the PMAM 2024 workshop paper Pure C++ Approach to Optimized Parallel Traversal of Regular Data Structures (doi: 10.1145/3649169.3649247). The work on the Noarr library was presented at the ICA3PP 2022 conference paper Astute Approach to Handling Memory Layouts of Regular Data Structures (doi: 10.1007/978-3-031-22677-9_27).
The abstractions extending the Noarr library are implemented in the following two repositories:
-
Noarr Traversers: https://github.com/jiriklepl/noarr-structures
-
Noarr Tuning: https://github.com/jiriklepl/noarr-tuning
- Overview - overview of the contents of the artifact
- Requirements - software requirements for the experiments
- Experiment reproduction - steps for reproducing the experiments
- Validation - steps for validating the implementations
- Code comparison - steps for comparing the code of the original Polybench/C benchmark and the Noarr implementation (summarization presented in the paper) and comparing the code changes required to perform the tuning transformations implemented in PolybenchC-tuned on algorithms in PolybenchC-pretune that are adjusted for the transformations
The artifact contains experimental results on various modifications of the following benchmark suits:
The artifact uses the implementation of Noarr from https://github.com/jiriklepl/noarr-structures, which is continuously tested on various platforms using GitHub Actions. The tests cover the following compilers: GCC (10, 11, 12, 13), Clang (14, 15), NVCC (12), and MSVC (19) on Linux, macOS, and Windows.
The artifact structure:
- running-examples: Contains the examples of Noarr code presented in the paper.
- plots: Generated plots used in paper figures.
- results: CSV files with the measured wall-clock times and the code comparison and archives with the measured wall-clock times; also contains the log file with the summarized statistics of the code comparison and the autotuning results.
- relating to the Polybench/C-4.2.1 benchmark suite:
- PolybenchC-4.2.1: The Polybench/C benchmark with custom build script for convenience and
flatten
pragmas for consistency with Noarr. - PolybenchC-Noarr: Implementation of the Polybench/C benchmark using Noarr structures and Noarr Traversers.
- PolybenchC-pretune: Modification of the algorithms from Polybench/C benchmark that are subjected to tuning in PolybenchC-Noarr-tuned. Some of the loops are broken down into multiple loops to allow better traversal options.
- PolybenchC-tuned: Tuned versions of four algorithms from PolybenchC-Noarr (PolybenchC-pretune). Each algorithm is taken from a different category of algorithms from the Polybench/C benchmark.
- PolybenchC-Noarr-tuned: Implementation of the tuned algorithms from PolybenchC-tuned using Noarr Traversers.
- PolybenchC-Noarr-tuning: Implementation of the autotuning of the Polybench/C benchmarks reimplemented using Noarr Structures and Noarr Traversers using Noarr Tuning.
- PolybenchC-tbb: Parallelization of four algorithms from PolybenchC-4.2.1 using Intel TBB.
- PolybenchC-Noarr-tbb: Implementation of the parallelized algorithms from PolybenchC-tbb using Noarr Traversers.
- PolybenchC-omp: Parallelization of four algorithms from PolybenchC-4.2.1 using OpenMP.
- PolybenchC-Noarr-omp: Implementation of the parallelized algorithms from PolybenchC-omp using Noarr Traversers.
- PolybenchC-4.2.1: The Polybench/C benchmark with custom build script for convenience and
- relating to the PolyBench/GPU-1.0 benchmark suite:
- PolyBenchGPU: The PolyBench/GPU with minor bug fixes (mostly relating to wrong arguments) and modified dataset sizes for convenience of measurement.
- PolyBenchGPU-Noarr: Implementation of the PolyBench/GPU benchmark using Noarr structures and Noarr Traversers.
- transformations.md: Presents a detailed list of transformations provided by Noarr Traversers.
- scripts: Contains the scripts used for running the experiments, validating the implementations, generating the plots presented in the paper, and producing further data from the collected measurements.
The artifact considers the following software requirements for the experiments:
- C++ compiler with support for C++20 - preferably GCC 12, or newer
- Intel TBB 2021.3 or newer
- NVCC 12 or newer
- CMake 3.10 or newer
- GNU
time
(only for autotuning) - Python 3.6 or newer (only for autotuning)
The following software is required for the analysis of the results:
awk
R
withggplot2
anddplyr
packagesgcc
clang
,clang-format
- standard tools: namely
wc
,gzip
,tar
, GNUtime
The following software is required for the running examples presented in the paper:
- C++ compiler with support for C++20 - preferably GCC 12 or newer
- Intel TBB 2021.3 or newer
- NVCC 12 or newer
- CMake 3.10 or newer
- Python 3.6 or newer
The experiments can be reproduced using the following steps:
# clone the repository and enter it
git clone "https://github.com/jiriklepl/ParCo2024-artifact.git"
cd ParCo2024-artifact
# for the CPU experiments:
scripts/run-measurements-CPU.sh
# for the GPU experiments:
scripts/run-measurements-GPU.sh
# generate the plots presented in the paper:
scripts/generate_plots.sh
# measure and visualize the compile time comparison and autotuning results:
scripts/autotuning-test.sh
scripts/visualize-autotuning.sh
# -- optionally --
# generate additional visualizations of the measured wall-clock times and the code comparison:
scripts/more_visualizations.sh
In our laboratory cluster, we use the Slurm workload manager. Setting the USE_SLURM
environment variable to 1
configures the scripts to use Slurm for running the experiments in the configuration that we used for the paper.
This configuration can be modified in the scripts run by the scripts/run-measurements-CPU.sh, scripts/run-measurements-GPU.sh, and scripts/autotuning-test.sh scripts; and similarly, in the scripts/autotuning-test.sh script.
After running the experiments, the plots presented in the paper can be generated using scripts/generate_plots.sh and scripts/visualize-autotuning.sh. The resulting plots are stored in the plots directory. The script scripts/vizualize-autotuning.sh also generates the file results/autotuning-report.log with the summarized statistics of the autotuning results presented in the paper.
This process also generates the corresponding CSV files with the measured wall-clock times in subdirectories of the results directory.
Additional visualizations of the measured wall-clock times and the code comparison can be generated by running the scripts/more_visualizations.sh script. The resulting plots are stored in the plots directory.
The validation of the implementations can be performed using the following steps:
# clone the repository and enter it
git clone "https://github.com/jiriklepl/ParCo2024-artifact.git"
cd ParCo2024-artifact
# for the CPU experiments:
scripts/validate-CPU.sh
# for more strict validation of the CPU experiments (runs ASan, UBSan, and leak sanitizer):
scripts/sanitize-CPU.sh
# for the GPU experiments:
scripts/validate-GPU.sh
This script runs the implementations of the Polybench/C and the tuned/TBB-parallelized/OpenMP-parallelized versions and the PolyBench/GPU baseline algorithms and compares their respective outputs with their Noarr counterparts. It reports any mismatches found in the outputs.
The validation scripts check whether the outputs of the implementations are the same (there is a zero threshold for the difference) on multiple datasets: SMALL, MEDIUM, LARGE, EXTRALARGE. The sanity check script runs the implementations on the SMALL dataset (and also compares the outputs).
# clone the repository and enter it
git clone "https://github.com/jiriklepl/ParCo2024-artifact.git"
cd ParCo2024-artifact
mkdir -p results
# for the Polybench/C benchmark and the Noarr implementation:
scripts/code_compare.sh > "results/code_overall.log"
# for the tuning transformations related to PolybenchC-Noarr-tuned and PolybenchC-tuned:
scripts/compare_transformations.sh > "results/compare_transformations.log"
These scripts are designed to provide more insights into where the Noarr approach is more complex and where it saves some coding effort.
The code comparison can be performed by running scripts/code_compare.sh. It compares the code of the original Polybench/C benchmark and the Noarr implementation and outputs the differences into the file results/statistics.csv and the summarized statistics to the standard output (as shown in results/code_overall.log).
The code for hand-tuned transformations can be compared by running scripts/compare_transformations.sh. The results of this are shown in results/compare_transformations.log. The log file dumps the code changes to require the transformations for each algorithm and suite implementation and prints the summarized statistics for each implementation.