02. Usage Guide

This guide walks you through the complete workflow of using TritonParse to analyze Triton kernel compilation processes.

📋 Overview

TritonParse workflow consists of three main steps:

Generate Traces - Capture Triton compilation events
Parse Traces - Process raw logs into structured format
Analyze Results - Visualize and explore using the web interface

🚀 Step 1: Generate Triton Trace Files

Basic Setup

First, integrate TritonParse into your Triton/PyTorch code:

import torch
import triton
import triton.language as tl

# === TritonParse initialization ===
import tritonparse.structured_logging

# Initialize structured logging to capture Triton compilation events
log_path = "./logs/"
tritonparse.structured_logging.init(log_path)
# === End TritonParse initialization ===

# Your original Triton/PyTorch code below...

Example: Complete Triton Kernel

Here's a complete example showing how to instrument a Triton kernel:

import torch
import triton
import triton.language as tl
import tritonparse.structured_logging
import tritonparse.utils

# Initialize logging
log_path = "./logs/"
tritonparse.structured_logging.init(log_path)

@triton.jit
def add_kernel(
    a_ptr,
    b_ptr,
    c_ptr,
    n_elements,
    BLOCK_SIZE: tl.constexpr,
):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements

    a = tl.load(a_ptr + offsets, mask=mask)
    b = tl.load(b_ptr + offsets, mask=mask)
    c = a + b
    tl.store(c_ptr + offsets, c, mask=mask)

def tensor_add(a, b):
    n_elements = a.numel()
    c = torch.empty_like(a)
    BLOCK_SIZE = 1024
    grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
    add_kernel[grid](a, b, c, n_elements, BLOCK_SIZE)
    return c

# Example usage
if __name__ == "__main__":
    # Create test tensors
    a = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)
    b = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)
    
    # Execute kernel (this will be traced)
    c = tensor_add(a, b)
    
    # Parse the generated logs
    tritonparse.utils.unified_parse(
        source=log_path, 
        out="./parsed_output", 
        overwrite=True
    )

PyTorch 2.0 Integration

For PyTorch 2.0 compiled functions:

import torch
import tritonparse.structured_logging
import tritonparse.utils

# Initialize logging
log_path = "./logs/"
tritonparse.structured_logging.init(log_path)

def simple_add(a, b):
    return a + b

# Test with torch.compile
compiled_add = torch.compile(simple_add)

# Create test data
a = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)
b = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)

# Execute compiled function (this will be traced)
result = compiled_add(a, b)

# Parse the generated logs
tritonparse.utils.unified_parse(
    source=log_path, 
    out="./parsed_output", 
    overwrite=True
)

Important Environment Variables

Set these before running your code:

# Disable FX graph cache to ensure compilation happens every time
export TORCHINDUCTOR_FX_GRAPH_CACHE=0

# Enable debug logging (optional)
export TRITONPARSE_DEBUG=1

# Enable NDJSON output (default)
export TRITONPARSE_NDJSON=1

# Enable gzip compression for trace files (optional)
export TRITON_TRACE_GZIP=1

Running the Code

# Run your instrumented code
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python your_script.py

Expected Output:

Triton kernel executed successfully
Torch compiled function executed successfully
WARNING:SourceMapping:No frame_id or frame_compile_id found in the payload.
WARNING:SourceMapping:No frame_id or frame_compile_id found in the payload.
tritonparse log file list: /tmp/tmpXXXXXX/log_file_list.json

🔧 Step 2: Parse Trace Files

Using unified_parse

The unified_parse function processes raw logs into structured format:

import tritonparse.utils

# Parse logs from directory
tritonparse.utils.unified_parse(
    source="./logs/",           # Input directory with raw logs
    out="./parsed_output",      # Output directory for processed files
    overwrite=True              # Overwrite existing output directory
)

Advanced Parsing Options

# Parse with additional options
tritonparse.utils.unified_parse(
    source="./logs/",
    out="./parsed_output",
    overwrite=True,
    rank=0,                     # Analyze specific rank (for multi-GPU)
    all_ranks=False,            # Analyze all ranks
    verbose=True                # Enable verbose logging
)

Understanding the Output

After parsing, you'll have:

parsed_output/
├── kernel_1_hash.gz          # Compressed kernel trace
├── kernel_2_hash.gz          # Another kernel trace
├── ...
└── log_file_list.json        # Index of all generated files

Each .gz file contains:

Kernel metadata (grid size, block size, etc.)
All IR stages (TTGIR, TTIR, LLIR, PTX, AMDGCN)
Source mappings between IR stages
Compilation stack traces
Performance metrics

Command Line Usage

You can also use the command line interface:

# Basic usage
python -m tritonparse.utils ./logs/ -o ./parsed_output

# With options
python -m tritonparse.utils ./logs/ -o ./parsed_output --overwrite --verbose

# Parse specific rank
python -m tritonparse.utils ./logs/ -o ./parsed_output --rank 0

# Parse all ranks
python -m tritonparse.utils ./logs/ -o ./parsed_output --all-ranks

🌐 Step 3: Analyze with Web Interface

Option A: Online Interface (Recommended)

Visit the live tool: https://pytorch-labs.github.io/tritonparse/
Load your trace files:
- Click "Browse Files" or drag-and-drop
- Select .gz files from your parsed_output directory
- Or select .ndjson files from your logs directory
Explore the visualization:
- Overview Tab: Kernel metadata, call stack, IR links
- Comparison Tab: Side-by-side IR comparison with line mapping

Option B: Local Development Interface

For contributors or custom deployments:

cd website
npm install
npm run dev

Access at http://localhost:5173

Supported File Formats

Format	Description	Source Mapping	Recommended
`.gz`	Compressed parsed traces	✅ Yes	✅ Yes
`.ndjson`	Raw trace logs	❌ No	⚠️ Basic use only

Note: .ndjson files don't contain source code mappings between IR stages. Always use .gz files for full functionality.

📊 Understanding the Results

Kernel Overview

The overview page shows:

Kernel Information: Name, hash, grid/block sizes
Compilation Metadata: Device, compile time, memory usage
Call Stack: Python source code that triggered compilation
IR Navigation: Links to different IR representations

Code Comparison

The comparison view offers:

Side-by-side IR viewing: Compare different compilation stages
Synchronized highlighting: Click a line to see corresponding lines in other IRs
Source mapping: Trace transformations across compilation pipeline

IR Stages Explained

Stage	Description	When Generated
TTGIR	Triton GPU IR - High-level GPU operations	After Triton frontend
TTIR	Triton IR - Language-level operations	After parsing
LLIR	LLVM IR - Low-level operations	After LLVM conversion
PTX	NVIDIA PTX Assembly	For NVIDIA GPUs
AMDGCN	AMD GPU Assembly	For AMD GPUs

🎯 Common Use Cases

1. Debugging Compilation Issues

# Enable debug logging
import os
os.environ['TRITONPARSE_DEBUG'] = '1'

# Your problematic kernel code
@triton.jit
def problematic_kernel(...):
    # ... kernel code that fails
    pass

# Analyze the compilation trace to identify issues

2. Performance Analysis

# Trace multiple kernel variants
variants = [
    ('baseline', baseline_kernel),
    ('optimized', optimized_kernel),
]

for name, kernel in variants:
    # Initialize separate logging for each variant
    log_path = f"./logs_{name}/"
    tritonparse.structured_logging.init(log_path)
    
    # Run kernel
    result = kernel[grid](...)
    
    # Parse logs
    tritonparse.utils.unified_parse(
        source=log_path,
        out=f"./parsed_output_{name}",
        overwrite=True
    )

3. Understanding Compilation Pipeline

# Trace a simple kernel to understand compilation stages
@triton.jit
def simple_kernel(x_ptr, y_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    
    x = tl.load(x_ptr + offsets, mask=mask)
    y = x * 2.0  # Simple operation
    tl.store(y_ptr + offsets, y, mask=mask)

# Trace and analyze each compilation stage

🔍 Advanced Features

Filtering Kernels

Set kernel allowlist to trace only specific kernels:

# Only trace kernels matching these patterns
export TRITONPARSE_KERNEL_ALLOWLIST="my_kernel*,important_*"

Multi-GPU Analysis

For multi-GPU setups:

# Parse all ranks
tritonparse.utils.unified_parse(
    source="./logs/",
    out="./parsed_output",
    all_ranks=True  # Analyze all GPU ranks
)

# Or parse specific rank
tritonparse.utils.unified_parse(
    source="./logs/",
    out="./parsed_output",
    rank=1  # Analyze GPU rank 1
)

Launch Tracing

Enable launch metadata tracing:

# Enable launch tracing (experimental)
export TRITON_TRACE_LAUNCH=1

🐛 Troubleshooting

Common Issues

1. No Kernels Found

Error: No kernels found in the processed data

Solutions:

Ensure TORCHINDUCTOR_FX_GRAPH_CACHE=0 is set
Check that your kernel actually executes
Verify Triton is properly installed

2. Empty Log Files

Warning: Empty log directory

Solutions:

Ensure tritonparse.structured_logging.init() is called before kernel execution
Check that your code path actually executes Triton kernels
Verify log directory permissions

3. Source Mapping Warnings

WARNING:SourceMapping:No frame_id or frame_compile_id found in the payload.

Solutions:

This is often normal for PyTorch 2.0 compiled functions
Use .gz files instead of .ndjson for full source mapping
Check that parsing completed successfully

4. Web Interface Issues

Error: Failed to load trace file

Solutions:

Ensure you're using .gz files from parsed_output
Check file size limits (browser dependent)
Try with a smaller trace file first

Debug Tips

Enable verbose logging:
```
export TRITONPARSE_DEBUG=1
```

Check log file contents:

ls -la ./logs/
head -n 5 ./logs/*.ndjson

Verify parsing output:

ls -la ./parsed_output/
zcat ./parsed_output/*.gz | head -n 10

🔗 Next Steps

After successfully generating and analyzing traces:

Learn the Web Interface: Read the Web Interface Guide
Explore Advanced Features: Check Advanced Examples
Understand File Formats: See File Formats documentation
Get Help: Visit our FAQ or GitHub Discussions

📚 Related Documentation

Installation Guide - Setup instructions
Web Interface Guide - Using the visualization interface
Basic Examples - Step-by-step examples
API Reference - Python API documentation

02. Usage Guide

📋 Overview

🚀 Step 1: Generate Triton Trace Files

Basic Setup

Example: Complete Triton Kernel

PyTorch 2.0 Integration

Important Environment Variables

Running the Code

🔧 Step 2: Parse Trace Files

Using unified_parse

Advanced Parsing Options

Understanding the Output

Command Line Usage

🌐 Step 3: Analyze with Web Interface

Option A: Online Interface (Recommended)

Option B: Local Development Interface

Supported File Formats

📊 Understanding the Results

Kernel Overview

Code Comparison

IR Stages Explained

🎯 Common Use Cases

1. Debugging Compilation Issues

2. Performance Analysis

3. Understanding Compilation Pipeline

🔍 Advanced Features

Filtering Kernels

Multi-GPU Analysis

Launch Tracing

🐛 Troubleshooting

Common Issues

1. No Kernels Found

2. Empty Log Files

3. Source Mapping Warnings

4. Web Interface Issues

Debug Tips

🔗 Next Steps

📚 Related Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally