Skip to content

UDC-GAC/uzp-sparse-format

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Union of Z-Polyhedra (UZP) Sparse Polyhedral Format

⚠️ WARNING This is a preview README for the Full Repository.

By now, this repository contains some software related to the artifact for the paper": "Modular Construction and Optimization of the UZP Sparse Format for SpMV on CPUs", but in the future it will contain full format specification, as well as other related software/toolchain.

PLDI'25 Artifacts

Refer to pldi25_artifact.txt

Download and installation instructions

The artifact is encapsulated inside a Singularity container, included in the Zenodo package with DOI 10.5281/zenodo.15048006. The artifact has been tested with Singularity CE v4.2.2 and v4.3.0, both on Arch Linux- and Debian-based systems.

The scripts contained in the artifact compile new executables and generate files in order to function, and therefore need the container to be writable. We suggest to use the following commands in order to start an interactive session inside the container and access the artifact folder, located in /uzp-artifact/:

$ singularity build --sandbox pldi25-uzp-artifact/ pldi25-uzp-artifact.sif
$ sudo singularity shell --writable pldi25-uzp-artifact/
Singularity> cd /uzp-artifact/

IMPORTANT: the artifact should be run as root to ensure appropriate permissions, particularly when running with numactl and hugectl.

Contents

This artifact contains several different tools and scripts to reproduce the results in the submitted paper. The main results in Sec. 5, i.e., experimentation with SpMV kernels, can be reproduced using the scripts found in the /uzp-artifact/spmv-executors/ folder, described in more detail below. Section Running the SpMV executors gives details about how to reproduce experiments. Section Experimental setup details the experimental setup used in the submitted paper. Note that it is fully expected that using a different setup (e.g., a different CPU) will result in different performance profiles.

For the reviewer wishing to experiment further with the toolchain, the artifact also includes the Z-Polyhedrator tool, that allows to create UZP files from MTX representations, and two example tuners which modify the layout of the UZP files to, e.g., achieve optimized results for specific computational kernels. The Z-Polyhedrator tool and the tuners are found in the /uzp-artifact/z_polyhedrator/ and /uzp-artifact/uzp-tuners/ folders, respectively. Sections Customizing the pattern-mining process and Customizing the UZP tuning process describe how to customize the matrix creation and tuning process, respectively.

Finally, we have included all the results obtained and used for writing Sec. 5 in the form of Pandas DataFrames. These are described in more detail in Section Performance spreadsheets.

Folders

  • /uzp-artifact/lib/: this folder contains libraries and scripts required for the different tools in this artifact, and is not intended to be modified by the user.
  • /uzp-artifact/spmv-executors/: this folder contains codes and scripts to execute the SpMV kernel on different formats (SpMV and the baselines used in the submitted paper). A more complete description on how to replicate the experimentation in the paper is given below in Section Running the SpMV executors.
  • /uzp-artifact/z_polyhedrator/: this is the Rust tool mentioned in Sec. 4.1 of the paper that is used to generate UZP files. Refer to the README.mdfile inside this folder for more details and instructions of usage. The tool is used automatically by the execution scripts in the /uzp-artifact/spmv-executors/ folder, searching the same list of patterns as used for the experiments in the paper. See Section Customizing the pattern-mining process below for more information.
  • /uzp-artifact/uzp-tuners/: the UZP format can be tuned for specific targets (compression, execution performance of specific kernels, etc.). This folder contains example tuners. A more complete description on it, and how to change the tuner employed for SpMV kernel execution, is found in Section Customizing the UZP tuning process
  • /uzp-artifact/performance-spreadsheets/: contains the spreadsheets including all the results obtained using our Experimental setup. These are detailed in Section Performance spreadsheets.

Running the SpMV executors

The /uzp-artifact/spmv-executors/ folder contains the different SpMV versions used in Section 5 of the paper, and allows to automatically run experiments on any SuiteSparse matrix. This section explains how to invoke the SpMV kernels with specific SuiteSparse matrices. To re-execute all the experiments included in the paper, check the section Running all experiments below. For details on experimental setup and variability considerations, check section Experimental setup.

  1. UZP GenEx: the code is contained in folder uzp-genex. In order to automatically launch an experiment:

    1. cd spmv-executors

    2. ./uzp_spmv.sh <group> <matrix> <datatype> <cache mode> <execution threads>

      • <group> and <matrix> are a Group name and Matrix name from SuiteSparse.
      • <datatype>: either "double" or "float" for double and single precision, respectively.
      • <cache mode>: either "hot" or "cold". Under cold cache settings, the SpMV kernel is run only once. Under hot cache settings, the kernel is run 100 times without cleaning the cache between repetitions.
      • <execution threads>: either "1th", "2th", or "8th" for 1, 2, or 8 threads, respectively.
    3. See important notes below about Experimental setup, including operational frequency and pinning of threads to cores in the execution machine.

    4. The script automatically downloads the required files from SuiteSparse to the /tmp folder of the execution machine, generates UZP files as needed, and launches the executor. By default, this script mines for patterns and tunes the UZP file as described in Sec. 4.3 in the paper. Refer to the subsections below about customizing UZP construction and tuning for more information.

    5. The script outputs the obtained performance in GFLOPS. It can be modified to output performance counters instead, by modifying the uzp_spmv.sh script to compile with OUTPUT_TYPE=PAPI. In this case, a papi_counters.list must be created inside folder uzp-genex,capturing the counters to measure. See PolyBench C/4.2.1 documentation for more details.

    6. For instance:

      ./uzp_spmv.sh QLi crashbasis double cold 1th

      obtains a performance of approximately 6.8 GFLOPS on the Intel Core 12900K machine used for the tests when fixing the operational frequency at 3.2 GHz.

  2. MKL executor: the code is contained in folder spmv-mkl. In order to automatically launch the experiment:

    1. cd spmv-executors

    2. ./mkl_spmv.sh <group> <matrix> <datatype> <cache mode> <execution threads>

      • <group> and <matrix> are a Group name and Matrix name from SuiteSparse.
      • <datatype>: either "double" or "float" for double and single precision, respectively.
      • <cache mode>: either "hot" or "cold". Under cold cache settings, the SpMV kernel is run only once. Under hot cache settings, the kernel is run 100 times without cleaning the cache between repetitions.
      • <execution threads>: either "1th", "2th", or "8th" for 1, 2, or 8 threads, respectively.
    3. See important notes below about Experimental setup, including operational frequency and pinning of threads to cores in the execution machine.

    4. The script automatically downloads the required files from SuiteSparse to the /tmp folder of the execution machine, and launches the executor. By default, this script uses the "IEhint" capabilities in MKL 2024.1. In order to disable them, comment the relevant lines in spmv-mkl/spmv{d}_mkl.c (calls to mkl_sparse_set_mv_hint() and mkl_sparse_optimize()).

    5. The script outputs the obtained performance in GFLOPS. It can be modified to output performance counters instead, by modifying the spmv-mkl/Makefile file to compile with -DPOLYBENCH_PAPI, instead of -DPOLYBENCH_GFLOPS. In this case, a papi_counters.list must be created inside folder spmv-mkl` capturing the counters to measure. See PolyBench C/4.2.1 documentation for more details.

    6. For instance:

      ./mkl_spmv.sh QLi crashbasis double cold 1th

      obtains a performance of approximately 4.8 GFLOPS on the Intel Core 12900K machine used for the tests when fixing the operational frequency at 3.2 GHz.

  3. MACVETH executor: the helper files are contained in folder spmv-macveth. The source codes are downloaded from the MACVETH artifact repository in GitHub by the script. In order to automatically launch the experiment:

    1. cd spmv-executors

    2. ./macveth_spmv.sh <group> <matrix> <datatype> <cache mode> <execution threads>

      • <group> and <matrix> are a Group name and Matrix name from SuiteSparse.
      • <datatype>: only "float" is supported by MACVETH, using single-precision floating-point operations.
      • <cache mode>: either "hot" or "cold". Under cold cache settings, the SpMV kernel is run only once. Under hot cache settings, the kernel is run 100 times without cleaning the cache between repetitions.
      • <execution threads>: only "1th" is supported by the MACVETH artifact, using single-threaded execution.
    3. See important notes below about Experimental setup, including operational frequency and pinning of threads to cores in the execution machine.

    4. The script automatically downloads the required files from SuiteSparse to the /tmp folder of the execution machine, downloads the data-specific source codes from the GitHub repository containing the artifact for MACVETH, compiles the source (this step may take a long time, depending on the input matrix and target machine), and launches the executor.

    5. The script outputs the obtained performance in GFLOPS. It can be modified to output performance counters instead, by modifying the spmv-macveth/Makefile file to compile with -DPOLYBENCH_PAPI, instead of -DPOLYBENCH_GFLOPS. In this case, a papi_counters.list must be created inside folder spmv-macveth` capturing the counters to measure. See PolyBench C/4.2.1 documentation for more details.

    6. For instance:

      ./macveth_spmv.sh QLi crashbasis float cold 1th

      obtains a performance of approximately 4.0 GFLOPS on the Intel Core 12900K machine used for the tests when fixing the operational frequency at 3.2 GHz. The compilation phase takes approximately 5 minutes.

  4. CSR5 executor: the relevant section of the CSR5 artifact used for experimentation is contained in folder spmv-csr5. In order to automatically launch the experiment:

    1. cd spmv-executors

    2. ./csr5_spmv.sh <group> <matrix> <datatype> <cache mode> <execution threads>

      • <group> and <matrix> are a Group name and Matrix name from SuiteSparse.
      • <datatype>: only "double" is supported by CSR5, using double-precision floating-point operations.
      • <cache mode>: either "hot" or "cold". Under cold cache settings, the SpMV kernel is run only once. Under hot cache settings, the kernel is run 100 times without cleaning the cache between repetitions.
      • <execution threads>: either "1th", "2th", or "8th" for 1, 2, or 8 threads, respectively.
    3. See important notes below about Experimental setup, including operational frequency and pinning of threads to cores in the execution machine.

    4. The script automatically downloads the required files from SuiteSparse to the /tmp folder of the execution machine, and launches the executor.

    5. The script outputs the obtained performance in GFLOPS. It can be modified to output performance counters instead, by modifying the spmv-csr5/CSR5_avx2/Makefile file to compile with -DPOLYBENCH_PAPI, instead of -DPOLYBENCH_GFLOPS. In this case, a papi_counters.list must be created inside folder spmv-csr5/CSR5_avx2, capturing the counters to measure. See PolyBench C/4.2.1 documentation for more details.

    6. For instance:

      ./csr5_spmv.sh QLi crashbasis double cold 1th

      obtains a performance of approximately 2.3 GFLOPS on the Intel Core 12900K machine used for the tests when fixing the operational frequency at 3.2 GHz.

  5. Vanilla CSR executor: the code is contained in folder spmv-csr. In order to automatically launch the experiment:

    1. cd spmv-executors

    2. ./csr_spmv.sh <group> <matrix> <datatype> <cache mode> <execution threads>

      • <group> and <matrix> are a Group name and Matrix name from SuiteSparse.
      • <datatype>: either "double" or "float" for double and single precision, respectively.
      • <cache mode>: either "hot" or "cold". Under cold cache settings, the SpMV kernel is run only once. Under hot cache settings, the kernel is run 100 times without cleaning the cache between repetitions.
      • <execution threads>: either "1th", "2th", or "8th" for 1, 2, or 8 threads, respectively.
    3. See important notes below about Experimental setup, including operational frequency and pinning of threads to cores in the execution machine.

    4. The script automatically downloads the required files from SuiteSparse to the /tmp folder of the execution machine, and launches the executor.

    5. The script outputs the obtained performance in GFLOPS. It can be modified to output performance counters instead, by modifying the spmv-csr/Makefile file to compile with -DPOLYBENCH_PAPI, instead of -DPOLYBENCH_GFLOPS. In this case, a papi_counters.list must be created inside folder spmv-csr, capturing the counters to measure. See PolyBench C/4.2.1 documentation for more details.

    6. For instance:

      ./csr_spmv.sh QLi crashbasis double cold 1th

      obtains a performance of approximately 2.6 GFLOPS on the Intel Core 12900K machine used for the tests when fixing the operational frequency at 3.2 GHz.

Running all experiments

The paper experiments on 229 matrices extracted from SuiteSparse. This artifact includes scripts to re-run the experiments in the paper as described below. The artifact also includes all the data used in the submitted version of the paper as a set of Pandas DataFrames. See also the section on the included performance spreadsheets for information on how to obtain the original experimental data used in Sec. 5 of the paper.

The script spmv-executors/run_all.py allows to run all experiments of a particular type, e.g., to run all double cold single-threaded experiments for all relevant executor versions type:

$ cd spmv-executors
$ ./run_all.py double cold 1th

The script generates the output in a file called spmv-executors/results_run_all_{datatype}_{cachemode}_{execution_threads}_{timestamp}.out. Besides, for simplicity, the script also parses the output file and generates a CSV spreadsheet under the same file name but with the .csv extension.

Experimental setup

The experimentation that led to the results in Sec. 5 of the submitted paper was conducted on an Intel Core i9 12900K with 128 GiB of RAM. Using different processors may lead to different performance profiles, depending on their capabilities. Besides, there are some aspects of the setup which can lead to experimental variability, as described below:

Operational frequency: the use of DVFS may lead to experimental variability, particularly in bursts of short runs, that may affect the relative performance of different kernels depending on CPU temperature. In order to minimize this effect, during our experiments the operational frequency was fixed using the following command:

cpupower frequency-set -d 3.2GHz -u 3.2GHz

We use here the nominal base frequency of 3.2GHz for the Intel i9 12900K, which should correspond to a sustainable steady-state frequency that should not vary significantly due to thermal constraints.

Reviewers are advised to fix the frequency of the CPU running the Singularity container, if possible, to minimize variability.

Pinning of threads to cores: the Intel i9 12900K CPU includes both P- and E-cores. In our tests, we have pinned threads to cores using the following command:

numactl -C10 <command> # For single-threaded executions
numactl -C10,12 <command> # For 2-threaded executions
numactl -C10,12,14,1,2,4,6 # For 8-threaded executions

These are logical P-cores in the Intel i9 12900K that are associated to different physical cores. The execution scripts in folder /uzp-artifact/spmv- executor/ employ this pinning strategy (hardcoded into each file).

Reviewers are advised to change this pinning strategy if it does not fit the processor used for tests, e.g., if there is no core number 14, or if some of the selected cores are E-cores or are logical cores associated to the same physical core.

Customizing the pattern-mining process

As detailed in Sec. 4.1 in the submitted paper, the user must provide a set of polyhedral patterns that the Z-Polyhedrator tool will mine for. By default, the tool uses the patterns enumerated in file /uzp-artifact/z_polyhedrator/data/patterns.txt. In that file, each pattern is specified as a 3-tuple:

(N,stride_y,stride_y)

Where N is the number of points and the strides in y and x represent, respectively, the access stride in the rows and columns of a 2-dimensional matrix. The Z-Polyhedrator tool mines for these enumeration of patterns in the order that they are presented, e.g., it will try to find all possible instances of the first pattern before trying to find the second one. This behavior can be changed, as described in the tool manual.

If the reviewer wishes to experiment with different a different set of patterns, the steps to follow are:

  1. Write the new patterns file, e.g., in /uzp-artifact/z_polyhedrator/data/new_patterns.txt.
  2. Modify the script that runs Z-Polyhedrator on top of SuiteSparse matrices when running the SpMV kernels in uzp-artifact/lib/scripts/z_polyhedrator_manager.py. The relevant line is the last one, which calls the manager.generate_uzp() function specifying the patterns file as its first parameter.
  3. Run the SpMV experiments as described in Section Running the SpMV executors.

Customizing the UZP tuning process

By default, the SpMV kernel executors are employing the tuner described in Sec. 4.3 of the submitted paper. The /uzp-artifact/uzp-tuners/ folder contains an additional example tuner, that will tile a matrix using a given tile size for locality during the execution of the SpMV kernel. Note that this tuner, which typically obtains better performance than the standard one for large matrices, has not been used in the submitted paper, so it is experimental code and bugs might be present.

In order to use this tuner to generate a file, follow these steps:

$ cd /uzp-artifact/spmv-executors
$ ./uzp_spmv.sh Mittelmann rail4284 double cold 1th # Run once to download the required files from SuiteSparse and generate the basic UZP
$ cd /uzp-artifact/uzp-tuners
$ make spf_tiling_tuner
$ ./spf_tiling_tuner --tile 2048 /tmp/rail4284/rail4284.1d.uzp /tmp/rail4284/rail4284.tuned.uzp # Tune the UZP using 2048x2048 tiles
$ cd /uzp-artifact/spmv-executors
$ ./uzp_spmv.sh Mittelmann rail4284 double cold 1th # Repeat the execution, now the tiled UZP will be used automatically

In order to use this tuner by default for experiments, line 75 in file /uzp-artifact/spmv-executors/uzp_spmv.sh must be changed to invoke the tiling tuner.

Performance spreadsheets

For reference and to simplify reviewing, we have included two Pandas DataFrames containing the results obtained during our experimentation in folder /uzp-artifact/performance-spreadsheets/. This folder contains two files:

  • pldi25_uzp_results_singlethread.pickle: this file contains a Pandas DataFrame including 34 different columns with matrix statistics and performance counters, namely:
    • index: (app,version,variant,dtype,cachestate): matrix, kernel version, variant (e.g., tuner used for UZP, whether IE hint was used for MKL, etc.) data type and cache mode for the experiment.
    • nnz: number of nonzeros in the matrix.
    • inc: number of nonzeros incorporated inside z-polyhedral shapes in the UZP file. Mechanically, nnz - inc computes the number of points not incorporated, but stored using a CSR/COO section.
    • nrows: number of rows of the sparse matrix.
    • ncols: number of columns of the sparse matrix.
    • cycles: execution cycles of the corresponding SpMV kernel (UNHALTED_REFERENCE_CYCLES)
    • stalls: number of cycles with no uops executed by thread (UOPS_EXECUTED:STALL_CYCLES)
    • d1h: Retired load uops with L1 cache hits as data source (MEM_LOAD_UOPS_RETIRED:L1_HIT)
    • d1m: Retired load uops which missed the L1D (MEM_LOAD_UOPS_RETIRED:L1_MISS)
    • i1h: Number of instruction fetch tag lookups that hit in the instruction cache (L1I). Counts at 64-byte cache-line granularity (ICACHE_64B:IFTAG_HIT)
    • i1m: Number of instruction fetch tag lookups that miss in the instruction cache (L1I). Counts at 64-byte cache-line granularity (ICACHE_64B:IFTAG_MISS)
    • l2m: All requests that miss the L2 cache (L2_RQSTS:MISS)
    • l2im: L2 cache misses when fetching instructions (L2_RQSTS:CODE_RD_MISS)
    • l3m: Core-originated cacheable demand requests missed LLC - architectural event (LONGEST_LAT_CACHE:MISS)
    • brinst: Branch instructions (PAPI_BR_INS)
    • mispred: Conditional branch instructions mispredicted (PAPI_BR_MSP)
    • scalars: Number of scalar single precision floating-point arithmetic instructions (multiply by 1 to get flops) (FP_ARITH:SCALAR_SINGLE)
    • packed_128s: Number of scalar 128-bit packed single precision floating-point arithmetic instructions (multiply by 4 to get flops) (FP_ARITH:128B_PACKED_SINGLE)
    • packed_256s: Number of scalar 256-bit packed single precision floating-point arithmetic instructions (multiply by 8 to get flops) (FP_ARITH:256B_PACKED_SINGLE)
    • scalard: Number of scalar double precision floating-point arithmetic instructions (multiply by 1 to get flops) (FP_ARITH:SCALAR_DOUBLE)
    • packed_128d: Number of scalar 128-bit packed double precision floating-point arithmetic instructions (multiply by 2 to get flops) (FP_ARITH:256B_PACKED_DOUBLE)
    • packed_256d: Number of scalar 256-bit packed double precision floating-point arithmetic instructions (multiply by 4 to get flops) (FP_ARITH:256B_PACKED_DOUBLE)
    • insts: Instructions completed (PAPI_TOT_INS)
    • uops: Number of uops executed per thread in each cycle (UOPS_EXECUTED:THREAD)
    • loads: All load uops retired (MEM_INST_RETIRED:ALL_LOADS)
    • stores: All store uops retired (MEM_INST_RETIRED:ALL_STORES)
    • dtlb_lwalks: Misses in all DTLB levels that cause page walks (DTLB_LOAD_MISSES:MISS_CAUSES_A_WALK)
    • dtlb_swalks: Misses in all DTLB levels that cause page walks (DTLB_STORE_MISSES:MISS_CAUSES_A_WALK)
    • itlb_walks: Misses in all DTLB levels that cause page walks (ITLB_MISSES:MISS_CAUSES_A_WALK)
    • itlb_wd: Cycles when PMH is busy with page walks (ITLB_MISSES:WALK_DURATION)
    • l2_pfm: Requests from the L1/L2/L3 hardware prefetchers or Load software prefetches that miss L2 cache (L2_RQSTS:PF_MISS)
    • l2_pfh: Requests from the L1/L2/L3 hardware prefetchers or Load software prefetches that hit L2 cache (L2_RQSTS:PF_HIT)
    • mite_uops: Number of uops delivered to Instruction Decode Queue (IDQ) from MITE path (IDQ:MITE_UOPS)
    • dsb_uops: Number of uops delivered to Instruction Decode Queue (IDQ) from Decode Stream Buffer (DSB) path (IDQ:DSB_UOPS)
    • gflops: GFLOPS. This is a column computed as 2xNNZx3.2/CYCLES, i.e., 2xNNZ = theoretical number of flops for this matrix, divided by the execution cycles which gives FLOPS/cycle and multiplied by 3.2 which is the operational frequency in GHz.
  • pldi25_uzp_results_multithread.pickle: similar to the DataFrame for single-threaded results, but contains an additional column in the index showing the number of threads used for execution. Note that performance counters were not captured for these executions, and the GFLOPS were computed using the PolyBench harness directly. All other columns described above are shown as NaNs.

Also note that some performance counters do not work in the Intel Core i9 12900K, for example, no data was registered regarding page walks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •