AOStencil - ARM-Optimized Stencil Computation Compiler

A high-performance DSL compiler for stencil computations optimized for ARM architectures. The source code of Efficient Locality-aware Instruction Stream Scheduling for Stencil Computation on ARM Processors

Features

2D/3D stencil computation optimization
Locality memory allocation and thread binding
With independent instruction stream scheduling with the Serial-FMA to Tree-Based Reduction (SFTBR) method.
Automatic kernel tuning using genetic algorithms
Multi-threaded execution with pthreads
ARM NEON SIMD vectorization support
Domain Specific Language (DSL) for stencil definition
Cross-platform support (tested on Phytium, Kunpeng ARM processors)
More details are available in ics25 paper.

Prerequisites

# Ubuntu/Debian
sudo apt-get install gcc libnuma-dev

Python package

Installation

cd AOStencil
pip install -e .

Usage

Basic Usage

from aostencil import Stencil2d, kernel_tune_stencil_2d, gen_stencil_2d
import numpy as np

# Create coefficient matrix for 9-point stencil
coefficients = np.array([
    [0.1, 0.1, 0.1],
    [0.1, 0.2, 0.1],
    [0.1, 0.1, 0.1]
])
# star shape like coefficients = np.array([
# [0,0,.1,0,0],
# [0,0,.2,0,0],
# [.1,.2,.4,.2,.1],
# [0,0,.2,0,0],
# [0,0,.1,0,0]])

# Initialize 2D stencil with parameters:
# col=8194, row=8192 (grid dimensions)
# coefficients, edge_length=1 (from 3x3 kernel)
# datatype='float' (single precision)
stencil = Stencil2dIR(
    8192,8194,coefficients,0,'float'
)
stencil.set_name("2d9pt_box")

# Configure NUMA topology (16 nodes, 8 cores each)
stencil.set_numa_config(16, 8)
stencil.set_run_config(16, 8)

# Auto-tune for optimal parameters
optimized_stencil = kernel_tune_stencil_2d(stencil)

# Generate optimized C code
kernel_code = gen_stencil_2d(optimized_stencil)

# Save generated kernel
with open('optimized_kernel.c', 'w') as f:
    f.write(kernel_code)

You can find more examples in benchmark directory.

DSL

from aostencil import from_dsl_load_stencil, kernel_tune_stencil_2d

# Load stencil from DSL
with open("stencil.dsl") as f:
    stencil = from_dsl_load_stencil(f.read())
    
# Configure hardware topology (NUMA nodes x cores per node)
stencil.set_numa_config(16, 8)  # Example for Phytium FT-2000+
stencil.set_run_config(16, 8)   # 16 NUMA nodes, 8 cores per node

# Auto-tune and generate optimized kernel
optimized_stencil = kernel_tune_stencil_2d(stencil)

Fine-Tuning

For advanced optimization control, directly use the tuning classes:

from aostencil import Stencil2dIR, stencil2d_opt_search

# Create search instance with custom parameters
tuner = stencil2d_opt_search(
    stencil2d=Stencil2dIR(...),      # Your stencil configuration
    cache_path='tuning_cache',    # Cache for generated kernels
    population_size=200,         # Larger population
    mutation_rate=0.2,            # Balance exploration/exploitation
    test_time_per_iter=50         # More iterations 
)

# Run optimization (returns best found configuration)
optimized_stencil, exec_time = tuner.search(record_log=True)

Benchmark

The benchmark directory contains pre-configured stencil implementations for different ARM architectures:

benchmark/
├── dsl/                   # DSL definition examples
│   ├── 2d9pt_box.dsl      # 2D 9-point box stencil DSL  
│   ├── 3d7pt_star.dsl     # 3D 7-point star stencil DSL
│   └── dsl_example.py     # DSL loading examples
│
├── phytium/               # Phytium FT-2000+ optimized benchmarks
│   ├── 2d_*.py             # 2D stencils
│   ├── 3d_*.py             # 3D stencils
│   ├── run.sh             # Batch execution script
│   └── str_main.py        # Verification kernels
│
├── kunpeng/               # Kunpeng 920 optimized benchmarks
│   ├── 2d_*.py             
│   ├── 3d_*.py
│   └── ...                # Same structure as phytium
│
├── phytium_f64/                   # phytium float64 benchmark
└── kunpeng_f64/                   # kunpeng float64 benchmark

License

This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.

Citation

If you use this project in your research, please cite our paper:

@inproceedings{liu2025aostencil,
  title={Towards Efficient Instruction Stream Scheduling for Stencil Computation on ARM Processors},
  author={Shanghao Liu, Hailong Yang, Xin You, Zhongzhi Luan, Yi Liu, Depei Qian},
  booktitle={2025 International Conference on Supercomputing (ICS '25), June 8--11, 2025, Salt Lake City, UT, USA},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
aostencil		aostencil
benchmark		benchmark
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AOStencil - ARM-Optimized Stencil Computation Compiler

Features

Prerequisites

Python package

Installation

Usage

Basic Usage

DSL

Fine-Tuning

Benchmark

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

buaa-hipo/AOStencil

Folders and files

Latest commit

History

Repository files navigation

AOStencil - ARM-Optimized Stencil Computation Compiler

Features

Prerequisites

Python package

Installation

Usage

Basic Usage

DSL

Fine-Tuning

Benchmark

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages