CUDA: Triton cache significantly improves startup performance
ROCm: Triton cache significantly improves startup performance
This benchmark compares GPU memory usage and startup performance of Triton kernels in two scenarios:
- With Triton cache pre-loaded - Cache exists from previous run
- Without Triton cache - Clean cache state
Key findings:
- Triton cache significantly reduces startup time
- More consistent memory usage patterns with cached kernels
- Improved resource utilization during initial model loading
- NVIDIA GPU (CUDA) or AMD GPU (ROCm)
./benchmark.sh --arch [cuda|rocm]
# Custom cache location and script
./benchmark.sh \
--arch cuda \
--triton-cache-dir ~/alternate_cache \
--script ./custom_script.py
gpu_usage_log.csv
- Time-series memory datagpu_memory_usage_comparison.png
- Visualization plot
-
Cold Start (no cache):
- Purge existing Triton cache
- Run script
- Log GPU memory at 1Hz frequency
-
Warm Start (with cache):
- Reuse generated kernels
- Run identical script
- Compare memory/time metrics
export TRITON_CACHE_DIR="~/.triton/cache" # Default cache location
Apache 2.0 LICENSE