This is the artifact for our paper "TAIDL: Tensor Accelerator ISA Definition Language with Auto-generation of Scalable Test Oracles". In our paper, we present an ISA specification language for tensor accelerators and auto-generated test oracles (a.k.a. functional simulators).
This artifact consists of TAIDL source code and the necessary scripts to reproduce the evaluation results. To facilitate artifact evaluation, we have automated the entire environment setup and experimental processes as part of Docker images. Our evaluation results were collected using Intel Xeon Platinum 8358 CPU and NVIDIA A100 GPU. We recommend using a machine with an Intel CPU and an NVIDIA GPU to benchmark TAIDL-TO and its baselines. Reproducing all simulation statistics takes approximately 30-45 minutes.
We use Docker images for environment setup of TAIDL and baselines. To run the TAIDL artifact, install Docker using the installation guide.
All experimental workflows are encapsulated as bash scripts located in the scripts/
directory.
These scripts automatically pull and use the appropriate Docker images:
devanshdvj/taidl-micro25-artifact:amd64
- TAIDL environment for amd64/x86-64devanshdvj/taidl-micro25-artifact:arm64
- TAIDL environment for arm64devanshdvj/taidl-micro25-artifact:baseline-amd64
- Baseline environment with Gemmini Spike and Intel SDE to generate data and log simulation times.
Architecture Support:
- amd64/x86_64: Full support for TAIDL including GPU acceleration and baselines.
- arm64: CPU-only support for TAIDL (no GPU support). The
full.sh
script cannot be run on arm64 since baselines are not supported on this architecture.
GPU Support: For NVIDIA GPU usage on amd64 systems, install the NVIDIA Container Toolkit using the installation guide.
This script uses paper's benchmarking data (plots/saved/
) to quickly generate all figures without running any experiments.
These statistics were collected using Intel Xeon Platinum 8358 CPU and NVIDIA A100 GPU.
./scripts/kick-tires.sh
The resulting figures can be found in plots/saved/
.
figure-16-gemmini-tiled-matmul.pdf
- Comparing simulation times of TAIDL-TO and Gemmini Spikefigure-17-oneDNN.pdf
- Comparing simulation times of TAIDL-TO and Intel SDEfigure-18-gemmini-exo.pdf
- Benchmarking TAIDL-TO for Exo-generated Gemmini kernels
This uses pre-generated inputs and golden outputs from Gemmini Spike and Intel SDE to benchmark TAIDL-TO. It does not regenerate any data or run the baselines. This is useful for quickly verifying TAIDL-TO's correctness and performance. This would take around 2-5 minutes to run.
Run using:
./scripts/lite.sh
The resulting figures can be found in plots/pdf/
.
More detailed statistics can be found in plots/csv/
.
figure-16-gemmini-tiled-matmul.pdf
- Benchmarking TAIDL-TO for Gemmini's tiled matrix multiplication kernelsfigure-17-oneDNN.pdf
- Benchmarking TAIDL-TO for oneDNN's Intel AMX kernels.figure-18-gemmini-exo.pdf
- Benchmarking TAIDL-TO for Exo-generated Gemmini kernels
This will benchmark TAIDL-TO along with baselines Gemmini Spike and Intel SDE. The script will also generate new data files containing inputs and outputs from these tools, which are used to verify TAIDL-TO's output. This would take around 30-45 minutes to run.
Run using:
./scripts/full.sh
The resulting figures can be found in plots/pdf/
.
More detailed statistics is available in plots/csv/
.
figure-16-gemmini-tiled-matmul.pdf
- Comparing simulation times of TAIDL-TO and Gemmini Spikefigure-17-oneDNN.pdf
- Comparing simulation times of TAIDL-TO and Intel SDEfigure-18-gemmini-exo.pdf
- Benchmarking TAIDL-TO for Exo-generated Gemmini kernels
accelerators/
- TAIDL accelerator implementations*/
- Accelerator implementation directoryTAIDL_*.py
- ISA definition using TAIDLsim/
- Generated simulation code (API, decorator, utils)tests/
- Kernel implementations and test runner
artifact-baseline/
- Reference implementations for comparisonamx/
- Intel AMX baseline kernels and benchmarking scriptsgemmini/
- Gemmini baseline with Spike simulator integration
artifact-taidl/
- TAIDL Docker environment for multi-architecture supportxla-debug/
- C++ XLA custom call for debugging tensor data
idl/
- TAIDL language infrastructure for generating simulation codeplots/
- Visualization scripts and output datacsv/
- Benchmarking and verification data filespdf/
- Generated comparison plotssaved/
- Paper's benchmarking data for quick plot generation
scripts/
- Automation scriptskick-tires.sh
- Quick plot generation from saved datalite.sh
- Run tests with subset of datafull.sh
- Complete test suite with data regenerationlaunch.sh
- Launch TAIDL Docker environment
Here is a simple example of a TAIDL workflow.
First, launch our provided docker environment using
./scripts/launch.sh
The TAIDL environment is at /taidl/
in the Docker.
Create a new toy/
directory in accelerators/
and define your ISA in TAIDL_toy.py
:
import os, sys
base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
target_dir = os.path.join(os.path.dirname(base_dir), "idl")
sys.path.append(target_dir)
from accelerator import Accelerator
acc = Accelerator("Toy")
# Define data model (memory space)
acc.add_data_model("regs", "32", "16xs8") # 32 registers with 16 elements each
# s8 indicates 8-bit signed integers
# Define instruction: load from HBM to register
instr = acc.add_instruction("load", ["dst", "addr"])
instr.add_semantics("""
%data:16xs8 <- hbm[@a.addr:@a.addr + 16];
%reshaped:1x16xs8 = reshape(%data);
%reshaped:1x16xs8 -> regs[@a.dst, 0];
""")
# Define instruction: store from register to HBM
instr = acc.add_instruction("store", ["src", "addr"])
instr.add_semantics("""
%data:1x16xs8 <- regs[@a.src:@a.src+1, 0:16];
%flattened:16xs8 = reshape(%data);
%flattened:16xs8 -> hbm[@a.addr];
""")
# Define instruction: add two registers
instr = acc.add_instruction("add", ["dst", "src1", "src2"])
instr.add_semantics("""
%a:1x16xs8 <- regs[@a.src1:@a.src1+1, 0:16];
%b:1x16xs8 <- regs[@a.src2:@a.src2+1, 0:16];
%c:1x16xs8 = add(%a, %b);
%c:1x16xs8 -> regs[@a.dst, 0];
""")
acc.generate_api()
Run your TAIDL definition to generate the simulation environment:
cd /taidl/accelerators/toy
python3 TAIDL_toy.py
This creates the sim/
directory with:
api.py
- Operation APIs for your ISAdecorator.py
- Kernel compilation frameworkutils.py
- Helper functions
After completing the steps above, your accelerators/toy/
directory should look like:
accelerators/toy/
├── TAIDL_toy.py # ISA definition (step 1)
├── sim/ # Generated simulation code (step 2)
│ ├── api.py
│ ├── decorator.py
│ └── utils.py
└── tests/ # Your kernel implementations (steps 3-6)
├── kernels.py # Kernel definitions
└── main.py # Test runner
Create tests/kernels.py
to define kernels using your generated API:
# Import the generated TAIDL-TO API
import os, sys
base_dir = os.path.dirname(os.path.abspath(__file__))
target_dir = os.path.join(os.path.dirname(base_dir), "sim")
sys.path.append(target_dir)
from decorator import kernel
import api
import numpy as np
@kernel(hbm=1024,
input=[
{'addr': 0, 'shape': (16,), 'dtype': np.int8},
{'addr': 16, 'shape': (16,), 'dtype': np.int8},
],
output=[
{'addr': 32, 'shape': (16,), 'dtype': np.int8},
])
def my_kernel():
api.load(dst = 0, addr = 0)
api.load(dst = 1, addr = 16)
api.add(dst = 2, src1=0, src2=1)
api.store(src = 2, addr = 32)
Create tests/main.py
to run and verify your kernels:
from kernels import my_kernel
from decorator import set_simulation_backend, verifier
import numpy as np
# Generate random input data
a = np.random.randint(-10, 10, size=16, dtype=np.int8)
b = np.random.randint(-10, 10, size=16, dtype=np.int8)
print("Input A:", a)
print("Input B:", b)
set_simulation_backend("CPU")
_, compile_time = my_kernel("fsim-compile")()
outputs, runtime = my_kernel("fsim")(a, b)
print("Sum: \t", outputs[0])
Run the test with:
cd /taidl/accelerators/toy/tests
python3 main.py
Modify tests/kernels.py
to use api.debug()
to inspect register and memory contents during execution:
@kernel(hbm=1024, input=[...], output=[...])
def debug_kernel():
api.load(dst=0, addr=0)
api.load(dst=1, addr=16)
api.add(dst=2, src1=0, src2=1)
# Debug register contents
api.debug(prefix="reg0", data="regs[0]")
api.debug(prefix="result(reg2)", data="regs[2]")
api.store(src=2, addr=32)
Modify tests/kernels.py
to use api.start_loop("loop_var", start, end)
and api.end_loop()
instead of native Python loops for faster compilation:
@kernel(hbm=1024,
input=[ # 4 vectors of 16 elements
{'addr': 0, 'shape': (4, 16), 'dtype': np.int8},
],
output=[ # Sum of the 4 vectors
{'addr': 256, 'shape': (16,), 'dtype': np.int8},
])
def loop_kernel():
api.load(dst=0, addr=0) # Load first vector to initialize reg[0]
api.start_loop("i", 1, 4) # (End value is exclusive)
api.load(dst=1, addr=f"16 * %i + 0") # Load vector i
api.add(dst=0, src1=0, src2=1) # Accumulate into dst=0
api.end_loop()
api.store(src=0, addr=256) # Store final accumulated result
Arithmetic Operations:
add(A, B)
- Element-wise additionsubtract(A, B)
- Element-wise subtractionmultiply(A, B)
- Element-wise multiplicationdivide(A, B)
- Element-wise division
Math Functions:
exp(A)
- Element-wise exponentialtanh(A)
- Element-wise hyperbolic tangentmaximum(A, B)
- Element-wise maximumminimum(A, B)
- Element-wise minimum
Logic Operations:
xor(A, B)
- Bitwise XOR
Shape Operations:
reshape(A)
- Reshape tensortranspose(A, dimensions={...})
- Transpose tensorconcatenate(A)
- Concatenate tensorsslice(A, slice={...})
- Extract slicedynamic_update_slice(A, B, dims)
- Update slice
Data Type Operations:
convert(A)
- Convert data typebitcast_convert(A)
- Bitcast conversion
Linear Algebra:
dot(A, B, lhs_batch_dims={...}, lhs_contracting_dims={...}, rhs_batch_dims={...}, rhs_contracting_dims={...})
- Matrix multiplication
Broadcast & Constants:
broadcast(A)
- Broadcast tensorbroadcast_type(A)
- Type-aware broadcastconstant(value)
- Create constant tensor
Reduction:
reduce(A, B, dims, operation)
- Reduce along dimensions. Right now, the only options foroperation
areadd_f32
,max_f32
. (ADD MORE)
Conditionals:
select_lt(A, B, C, D)
- Select based on less-than comparisonclamp(min, A, max)
- Clamp values to range
Conditionals:
IF(condition)
{
// statements
}
Loops*:
REPEAT(variable, range)
{
// statements using @l.variable
}
* While REPEAT blocks are supported, it is highly recommended for speed of compilation that you modify your tensor shapes and operations so a REPEAT is not necessary. We have an example of this in TAIDL_AMX.py where we have two versions of the instruction tdpbusd
, one with and without REPEAT.