Skip to content

fulvius31/triton-cache-comparison

Repository files navigation

Triton Cache Performance Comparison

Performance Plot
CUDA: Triton cache significantly improves startup performance

Performance Plot
ROCm: Triton cache significantly improves startup performance

Proof of Concept

This benchmark compares GPU memory usage and startup performance of Triton kernels in two scenarios:

  1. With Triton cache pre-loaded - Cache exists from previous run
  2. Without Triton cache - Clean cache state

Key findings:

  • Triton cache significantly reduces startup time
  • More consistent memory usage patterns with cached kernels
  • Improved resource utilization during initial model loading

Prerequisites

Hardware Requirements

  • NVIDIA GPU (CUDA) or AMD GPU (ROCm)

Usage

Basic Benchmark

./benchmark.sh --arch [cuda|rocm]

Advanced Options

# Custom cache location and script
./benchmark.sh \
  --arch cuda \
  --triton-cache-dir ~/alternate_cache \
  --script ./custom_script.py

Expected Output

  1. gpu_usage_log.csv - Time-series memory data
  2. gpu_memory_usage_comparison.png - Visualization plot

Technical Details

Benchmark Process

  1. Cold Start (no cache):

    • Purge existing Triton cache
    • Run script
    • Log GPU memory at 1Hz frequency
  2. Warm Start (with cache):

    • Reuse generated kernels
    • Run identical script
    • Compare memory/time metrics

Key Configuration

export TRITON_CACHE_DIR="~/.triton/cache"  # Default cache location

License

Apache 2.0 LICENSE

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published