Skip to content

sfu-arch/minuet-tracer

Repository files navigation

Requirements

c++17 compatible compiler for running c++ version

# Install miniconda and activate the environment
# https://www.anaconda.com/docs/getting-started/miniconda/install
pip3 install open3d
pip3 install pybind11
sudo apt-get install zlib1g-dev
# You have to install within conda. If you install in system python, cmake will have issues finding the pybind11 headers. see CI.yml for cmake command.

This repository contains a Python script that uses the Open3D library to visualize point cloud data. The script loads a point cloud from a file, applies voxel downsampling, and visualizes the original and downsampled point clouds side by side. There are three main scripts in this repository:

  • minuet_mapping.py and c++ version minuet_trace_cpp (6x faster than python version)
  • mem_trace_reader.py (for reading the memory trace file generated by the minuet_mapping.py and c++ version)
  • read_pcl.py (for reading the point cloud data and visualizing it using Open3D. Can hookup trace reader using pybind)

Point cloud mapping tracer

# Python version. Feature complete.
python3 minuet_trace.py --config config.json
# c++ version. Scatter pending
cd c++; cmake -B build; cd build; make
./minuet_trace_cpp --config ../config.json

Output files produced:**

ls out/
checksums.json       kernel_map.bin       metadata.bin.gz
gather_trace.bin.gz  kernel_map.bin.gz    scatter_trace.bin.gz
gemms.bin.gz         map_trace.bin.gz

# To read map trace file
python3 mem_trace_reader.py --trace-file out/map_trace.bin.gz --size

# To read gemm list

$ python3 gemms_trace_reader.py --trace-file ./out/gemms.bin.gz
Successfully loaded configuration from /Users/ashriram/Desktop/minuet/config.json
{'num_offsets': 1, 'gemm_M': 4, 'gemm_N': 1, 'padding': 0}
{'num_offsets': 2, 'gemm_M': 4, 'gemm_N': 2, 'padding': 2}
{'num_offsets': 2, 'gemm_M': 4, 'gemm_N': 2, 'padding': 2}

# Gather reader
python3 mem_trace_reader.py --trace-file out/gather_trace.bin.gz --sizeof-addr 8

# Scatter reader
python3 mem_trace_reader.py --trace-file out/scatter_trace.bin.gz --sizeof-addr 8
  • map_trace.bin.gz: Memory access trace for the mapping phase.
  • kernel_map.bin.gz: Kernel map data (not a trace)
  • gather_trace.bin.gz: Gather phase.
  • scatter_trace.bin.gz: Scatter phase.
  • gemms.bin.gz: GEMM operations trace.
  • metadata.bin.gz: A compressed binary file containing metadata about the simulation, such as the number of threads, sizes of various data structures, and base addresses for tensors.

Main timing simulation loop goes in phases MAP->GATHER->GEMM (one line at a time)->SCATTER

Config file**

{
  "NUM_THREADS": 8,
  "SIZE_KEY": 4,
  "SIZE_INT": 4,
  "SIZE_WEIGHT": 4,
  "I_BASE": "0x10000000",
  "QK_BASE": "0x20000000",
  "QI_BASE": "0x30000000",
  "QO_BASE": "0x40000000",
  "PIV_BASE": "0x50000000",
  "KM_BASE": "0x60000000",
  "WO_BASE": "0x80000000",
  "IV_BASE": "0x100000000",
  "WV_BASE": "0xF00000000",
  "GEMM_ALIGNMENT": 4,
  "GEMM_WT_GROUP": 2,
  "GEMM_SIZE": 4,
  "NUM_TILES_GATHER": 4,
  "TILE_FEATS_GATHER": 16,
  "BULK_FEATS_GATHER": 4,
  "N_THREADS_GATHER": 1,
  "debug": true,
  "output_dir": "./trace_out_python/",
  "NUM_PIVOTS": 2
}

Description of the parameters:

  • NUM_THREADS: Number of virtual threads to simulate for the mapping phase.
  • SIZE_KEY: Size of a key in bytes (e.g., for coordinates).
  • SIZE_INT: Size of an integer in bytes.
  • SIZE_WEIGHT: Size of a weight value in bytes.
  • I_BASE, QK_BASE, QI_BASE, QO_BASE, PIV_BASE, KM_BASE, WO_BASE, IV_BASE, WV_BASE: Base addresses for different data structures (tensors) used in the simulation, defined in hexadecimal format. I: Input, QK: Query Keys, PIV: Pivot Keys, KM: Kernel Map, IV: Input Feature Vectors, GEMM_BASE: Buffers for GEMM
  • GEMM_ALIGNMENT: Target matrix size for GEMM; number of inputs fused, GEMM_WT_GROUP: Max number of weights per group (break out condition for groups)

This project simulates the memory access patterns of a Minuet-like kernel map algorithm. It includes both a c++ version and a python version. The simulation employs a fixed pool of virtual threads to model parallel processing. The primary output is a compressed binary file (map_trace.bin.gz python version or map_trace_cpp.bin.gz c++ version) containing the recorded memory accesses, which can be used for analyzing memory behavior, locality, and potential bottlenecks.

Input Format The input to the simulation is a set of 3D coordinates representing points in space. These coordinates are quantized and packed into 12-bit integer keys. The input data is generated within the simulation itself, so no external input files are required.

Output Trace Format.

Section Field Data Type Description
Header Entry Count uint32_t Total number of trace entries that follow
Trace Entry phase_id uint8_t Integer ID representing the computational phase during which the access occurred. Dynamically assigned during the write_gmem_trace function, and the mapping is printed to the console
thread_id uint8_t The ID of the virtual thread that performed the memory access
op_id uint8_t Integer ID for the memory operation type (0 for Read 'R', 1 for Write 'W')
tensor_id uint8_t Integer ID representing the tensor region being accessed
addr_int uint32_t The memory address that was accessed. Note that addresses are truncated to 32 bits in the output file

NOTE THAT THIS IS JUST THE MAPPING PHASE SO 32 BIT ADDRESS SPACE IS ENOUGH FOR THE MEMORY TRACE FILE. IF WE ARE FETCHING ACTUAL FEATURE VECTORS, THE ADDRESS SPACE WILL NEED TO BE 64 BIT.

Input format

  • coords: A Python list of tuples, where each tuple (x, y, z) represents the integer coordinates of a point. Example: coords = [(1,5,0),(0,1,1),(0,0,2),(0,0,3)]
  • offsets: A Python list of 3D offset tuples (dx, dy, dz). These offsets are applied to the input coordinates to generate query points. Example: offsets = [(dx,dy,dz) for dx in (-1,0,1) for dy in (-1,0,1) for dz in (-1,0,1)]
Parameter Description Default Value
debug (boolean) If True, enables additional print statements for debugging. True
NUM_THREADS (integer) The number of virtual threads to simulate for parallel processing. 4
I_TILES (integer) The number of input elements per tile, used in the make_tiles_and_pivots used during the backward binary search phase 2
Tensor Base Addresses These hexadecimal values define the starting memory addresses for different data structures (tensors) used in the simulation.
I_BASE Base address for input point coordinates. 0x10000000
QK_BASE Base address for query keys. 0x20000000
QI_BASE Base address for query input-index array. 0x30000000
QO_BASE Base address for query offset-index array. 0x40000000
PIV_BASE Base address for tile pivot keys. 0x50000000
TILE_BASE Alias for I_BASE, used for tile data reads. 0x10000000
KM_BASE Base address for kernel map writes. 0x60000000
WO_BASE Base address for reading weight offsets. 0x80000000
IV_BASE Base address for input feature vectors. Currently unused. 0x100000000
WV_BASE Base address for weight values. 0x800000000
Data Type Sizes
SIZE_KEY (integer) Size of a key in bytes (e.g., for coordinates). 4 (32-bit)
SIZE_INT (integer) Size of an integer in bytes. 4
SIZE_WEIGHT (integer) Size of a weight value in bytes. 4

Core Simulation Phases

Simulates the following key phases:

  1. Radix Sort Simulation (Radix-Sort & Dedup):
    • Input 3D coordinates are quantized and packed into 32-bit integer keys.
    • Memory access patterns for a radix sort are simulated on these keys (reads from an input array, writes to an auxiliary array, then swap).
    • The keys are then explicitly sorted using std::sort and deduplicated to obtain unique input keys.
  2. Build Queries (Build-Queries):
    • Query keys are generated by applying a set of 3D offsets to each unique input key.
    • Associated metadata arrays are created: qii (query input-index), qki (query kernel/offset-index), and woffs (packed weight-offset keys).
  3. Sort Query Keys (Sort-QKeys):
    • This phase, mirroring the Python script's logic, currently involves assigning the previously generated query keys. No actual sorting operation is performed within this named phase in the C++ version, matching the Python script's behavior.
  4. Tile and Pivot Generation (Tile-Pivots):
    • The unique sorted input keys are partitioned into smaller collections called "tiles."
    • The first key of each tile is stored as a "pivot." Memory write accesses for these pivot keys are recorded.
  5. Lookup (Lookup):
    • This is the main multi-threaded phase where query keys are matched against the tiled input data.
    • The workload (queries) is divided among NUM_THREADS worker threads.
    • Each thread processes its assigned queries:
      • It performs a backward search on the global pivot keys to efficiently identify a relevant tile for the current query key.
      • It then performs a forward scan within that specific tile to find an exact match for the query key.
      • If a match is found, an entry detailing the match (target coordinates, original input coordinates, offset coordinates) is added to a shared KernelMap data structure.
    • Memory accesses (reads for query keys, pivot keys, tile data; writes for kernel map entries) are recorded by each thread into a thread-local trace buffer. These local traces are then merged into the global memory trace vector, protected by a mutex. The KernelMap is also protected by a separate mutex.

Building the c++ Project

Prerequisites

  • A C++17 compliant compiler (e.g., GCC, Clang, MSVC).
  • CMake (version 3.10 or higher).
  • Zlib development libraries (for gzip compression).
    • On Debian/Ubuntu: sudo apt-get install zlib1g-dev
    • On macOS (using Homebrew): brew install zlib

Build Steps

  1. Clone or download the project source code.
  2. Create a build directory and navigate into it:
    mkdir build
    cd build
  3. Run CMake to configure the project:
    cmake ..
    If Zlib is installed in a non-standard location, you might need to help CMake find it (e.g., cmake .. -DCMAKE_PREFIX_PATH=/path/to/zlib_install_dir).
  4. Compile the project:
    make

Running the Simulation

After a successful build, an executable named minuet_trace_cpp

To run the simulation:

./minuet_trace_cpp --config ../config.json

The program will: Print information about each phase to the console. If debug is enabled (default is true), print detailed outputs like sorted unique keys, segmented query arrays, and the final kernel map. Print all recorded memory trace entries to the console. Generate the map_trace_cpp.bin.gz file in the current working directory (which will be the build directory if run from there). Project Structure

minuet_trace_cpp/
├── CMakeLists.txt          # CMake build script
├── include/
│   └── minuet_trace.hpp    # Header file with declarations and template definitions
└── src/
    ├── main.cpp            # Main application logic, orchestrates simulation phases
    └── minuet_trace.cpp    # Implementation of core simulation functions

Trace Reader

This script reads the memory trace file generated by the Minuet-like kernel map simulation. It provides functionality to filter the trace by computational phase or memory operation type (read/write) and visualize the memory access patterns using Open3D.

# How to run
$ python mem_trace_reader.py map_trace.bin.gz --sizeof-addr 4 (or 8 for gather/scatter traces)

# # Filter trace by phase
$ python mem_trace_reader.py map_trace.bin.gz --filter-phase "Lookup"

# # Filter trace by operation
$ python mem_trace_reader.py map_trace.bin.gz --filter-op "W"

# # Generate memory access pattern plots
$ python mem_trace_reader.py map_trace.bin.gz --plot

# # Save plot to file
$ python mem_trace_reader.py map_trace.bin.gz --plot --plot-file memory_access.png

About

First Commit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •