-
Notifications
You must be signed in to change notification settings - Fork 296
Add Claude MD file #2311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
drisspg
wants to merge
1
commit into
main
Choose a base branch
from
drisspg/stack/66
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add Claude MD file #2311
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,322 @@ | ||
# CLAUDE.md | ||
|
||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | ||
|
||
## Project Overview | ||
|
||
torchao is an Architecture Optimization library that accelerates PyTorch models through quantization and sparsification techniques. It provides optimization for weights, gradients, activations, and more for both inference and training with minimal code changes. | ||
|
||
## Prerequisites | ||
|
||
### Required Dependencies | ||
```bash | ||
# Install PyTorch (required before torchao installation) | ||
pip install torch torchvision torchaudio | ||
|
||
# For development, you may need specific PyTorch versions | ||
# Check requirements.txt or setup.py for version constraints | ||
``` | ||
|
||
## Development Commands | ||
|
||
### Installation & Build | ||
```bash | ||
# Development install (Python-only mode, fastest for development) | ||
USE_CPP=0 python setup.py develop | ||
|
||
# Full build with C++/CUDA extensions | ||
python setup.py develop | ||
|
||
# Install specific version of ruff for linting | ||
pip install ruff==0.11.6 | ||
``` | ||
|
||
### Testing | ||
```bash | ||
# Run specific test files | ||
pytest test/float8/test_base.py | ||
pytest test/quantization/test_quant_api.py | ||
pytest test/dtypes/test_affine_quantized.py | ||
|
||
# Run comprehensive float8 tests | ||
./test/float8/test_everything.sh | ||
|
||
# Run all tutorials | ||
./tutorials/run_all.sh | ||
``` | ||
|
||
### Linting & Formatting | ||
```bash | ||
# Install pre-commit hooks (one-time setup) | ||
pre-commit install | ||
|
||
# Run all pre-commit checks | ||
pre-commit run --all-files | ||
|
||
# Run pre-commit on staged files only | ||
pre-commit run | ||
``` | ||
|
||
## Architecture Overview | ||
|
||
### Workflow-Based Structure (2025H1 Refresh) | ||
|
||
TorchAO is transitioning from an AQT-centered structure to a **workflow-based organization** that embraces vertical workflows for optimal performance and maintainability. | ||
|
||
### Current Structure | ||
|
||
**torchao/quantization/** - User-facing APIs | ||
- `quantize_()` - Main quantization function with workflow-specific configs | ||
- `autoquant.py` - Automatic quantization selection | ||
- Configuration classes for different workflows | ||
|
||
**torchao/sparsity/** - User-facing sparsity APIs | ||
- `sparsify_()` - Main sparsification function | ||
- Sparsity configuration classes | ||
|
||
**Vertical Workflows** (in transition): | ||
- **int8_weight_only** - INT8 weight-only quantization workflows | ||
- **float8** - High-performance float8 training (1.5x speedup vs FP16) | ||
- **nf4** - NF4 (4-bit normal float) for QLoRA | ||
- **pt2e** - PT2E graph mode quantization (migrating from PyTorch Core) | ||
- **executorch** - ExecutorTorch workflows (moving from experimental) | ||
|
||
**torchao/csrc/** - Custom kernels | ||
- CUTLASS-based implementations for maximum performance | ||
- ROCm support for AMD GPUs | ||
- CPU kernels with AVX512 optimizations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @metascroy maybe add details on experimental folder including mps kernels. |
||
|
||
**torchao/experimental/** - Experimental features | ||
- MPS acceleration for Apple Silicon | ||
- Low-bit quantization research (1-7 bit weights) | ||
- Prototype workflows before graduation | ||
|
||
### Design Philosophy | ||
|
||
**Vertical Workflows Over Horizontal Abstractions:** | ||
- Self-contained workflows that can move fast on SOTA performance/accuracy | ||
- Workflows choose abstractions that fit their needs rather than forced repo-wide patterns | ||
- Well-fitting abstractions > no abstractions > poorly fitting abstractions | ||
- Duplicated easy-to-understand code preferred over highly abstracted hard-to-understand code | ||
|
||
|
||
## Build Configuration | ||
|
||
The build system uses environment variables for configuration: | ||
|
||
**Core Controls:** | ||
- `USE_CPP=0|1` - Skip C++/CUDA extensions (default: 1, set to 0 for fastest dev setup) | ||
- `USE_CPU_KERNELS=0|1` - Enable optimized CPU kernels (Linux only, default: 0) | ||
- `DEBUG=0|1` - Debug build mode | ||
|
||
**Experimental Features:** | ||
- `BUILD_TORCHAO_EXPERIMENTAL=1` - Enable experimental cmake builds | ||
- `TORCHAO_BUILD_CPU_AARCH64=1` - ARM64 CPU kernels (auto-enabled on Apple Silicon) | ||
- `TORCHAO_BUILD_KLEIDIAI=1` - Kleidi AI library integration (experimental, accuracy issues) | ||
- `TORCHAO_BUILD_EXPERIMENTAL_MPS=1` - MPS acceleration (macOS only, disabled by default) | ||
- `USE_AVX512=1` - Enable AVX512 optimizations for x86 CPUs (default on Linux) | ||
|
||
## Experimental Features (Alpha) | ||
|
||
### Stability Levels | ||
**Alpha Features**: Early development stage requiring further refinement. These are prototypes due to: | ||
- Ongoing hardware support development | ||
- Non-compelling memory benchmarks | ||
- Need for compiler/kernel investment | ||
|
||
### Current Experimental Features | ||
|
||
**MX Training and Inference:** | ||
- **Status**: Prototype (hardware support not yet available) | ||
- **Description**: Tensors based on OCP MX specification for group-wise scaled float8/float6/float4/int8 | ||
- **Usage**: Group-wise quantization with MX data types | ||
|
||
**Int8 Quantized Training:** | ||
- **Status**: Prototype (memory benchmarks not yet compelling) | ||
- **Description**: Full int8 training support | ||
- **Usage**: `quantize_(model, int8_weight_only_quantized_training())` | ||
|
||
**IntX (Low-bit integers):** | ||
- **Status**: Prototype (needs compiler/kernel investment) | ||
- **Description**: Various integer types through bitpacking in pure PyTorch | ||
- **Note**: Int4 remains more compelling than smaller data types currently | ||
|
||
**Bitnet:** | ||
- **Status**: Experimental (dependent on better hardware/kernel support) | ||
- **Description**: Bitnet quantization technique | ||
- **Limitation**: Usefulness highly dependent on hardware improvements | ||
|
||
### Hardware-Specific Experimental Features | ||
|
||
**MPS Kernels (Apple Silicon):** | ||
- **Status**: Experimental (disabled by default) | ||
- **Requirements**: macOS with ARM64 architecture and MPS available | ||
- **Build**: `export TORCHAO_BUILD_EXPERIMENTAL_MPS=1` | ||
- **Features**: Metal shaders for int1mm, int2mm, int3mm, int4mm, int5mm, int6mm, int7mm | ||
|
||
**ARM64/AArch64 CPU Kernels:** | ||
- **Status**: Auto-enabled on ARM64 Macs, manual enable elsewhere | ||
- **Build**: `export TORCHAO_BUILD_CPU_AARCH64=1` | ||
- **Features**: | ||
- Quantized matrix operations with NEON intrinsics | ||
- Bit-packing operations for low-bit quantization | ||
- Lookup table (LUT) operations for weight compression | ||
- Kleidi AI integration (experimental, accuracy issues in CI) | ||
|
||
**Kleidi AI Integration:** | ||
- **Status**: Experimental (disabled by default) | ||
- **Build**: `export TORCHAO_BUILD_KLEIDIAI=1` | ||
- **Requirements**: ARM64 architecture | ||
- **Note**: Increases build time, has shown BF16 accuracy issues in CI tests | ||
|
||
|
||
## Development Patterns and Workflows | ||
|
||
### Common Development Patterns | ||
|
||
**One-line optimizations:** Use `quantize_(model, config)` and `sparsify_(model, config)` for quick model optimization | ||
- `quantize_(m, Int4WeightOnlyConfig())` applies 4-bit weight-only quantization | ||
- `sparsify_(model, BlockSparseWeightConfig())` applies block-sparse weight configuration | ||
|
||
**Model-specific optimizations:** Specialized patterns for different model types | ||
- **SAM2**: Use "Fast Mode" with `torch.compile` and "Furious Mode" with FP16 precision | ||
- **LLMs**: Common patterns include KV cache quantization and Sparse-Marlin integration | ||
|
||
**Composability focus:** Design optimizations to work with `torch.compile()` and FSDP2 without graph breaks | ||
|
||
### Workflow Development Approach | ||
|
||
**Workflow-First Development:** | ||
- Focus on vertical workflows rather than horizontal tensor subclass abstractions | ||
- Each workflow is self-contained and optimized for its specific use case | ||
- Workflows can choose their own abstractions and implementation patterns | ||
|
||
**Workflow Implementation Patterns:** | ||
- **Config-based**: Each workflow provides configuration classes for `quantize_()` | ||
- **Kernel integration**: Workflows integrate with `torchao/csrc/` kernels as needed | ||
- **Composability**: Workflows maintain compatibility with `torch.compile` and FSDP2 | ||
- **Independence**: Workflows avoid dependencies on repo-wide abstractions unless beneficial | ||
|
||
**Abstraction Selection:** | ||
- Workflows choose abstractions that make their implementation cleaner and more maintainable | ||
- No enforcement of repo-wide abstractions without clear benefits | ||
- Many-to-many mapping between abstractions and workflows is acceptable | ||
|
||
### Development Tasks | ||
|
||
**Adding a new workflow:** | ||
1. Create a new workflow directory in the appropriate location | ||
2. Implement a configuration class for the workflow | ||
3. Add the config to `torchao/quantization/__init__.py` for `quantize_()` integration | ||
4. Implement the workflow using patterns that fit your use case (tensor subclass, module swap, etc.) | ||
5. Add any required kernels to `torchao/csrc/` or `torchao/kernel/` | ||
6. Choose helpful abstractions from common utilities as needed | ||
|
||
**Current workflow examples:** | ||
- `int8_weight_only` - Uses AQT patterns where beneficial | ||
- `float8` - Uses `Float8Tensor` and specialized training patterns | ||
- `nf4` - Uses NF4-specific tensor subclass for QLoRA | ||
- `pt2e` - Uses graph mode quantization patterns | ||
|
||
**Performance optimization:** | ||
1. Custom kernels go in `torchao/csrc/` with architecture-specific builds (SM90a, SM100a) | ||
2. Use `opcheck()` in tests to ensure `torch.compile` compatibility | ||
3. Implement fallback paths for unsupported configurations | ||
|
||
**Testing:** | ||
1. Follow patterns in `test/` directory, use `pytest` for individual tests | ||
2. Use `TorchAOBasicTestCase` and `TorchAOCompileTestCase` for tensor subclass tests | ||
3. Include SQNR assertions for quantization accuracy verification | ||
|
||
**Experimental workflows:** | ||
1. Develop in `torchao/experimental/` or `torchao/prototype/` as appropriate | ||
2. Use `_check_torchao_ops_loaded()` to verify experimental kernels are loaded | ||
3. Follow Alpha feature guidelines for prototype development | ||
4. Graduate to main workflow structure when ready for production use | ||
5. Focus on vertical workflow patterns rather than forcing horizontal abstractions | ||
|
||
## Common Issues and Debugging | ||
|
||
### Frequent Issues and Solutions | ||
|
||
**torch.compile() graph breaks:** | ||
- **Issue**: Custom kernels causing graph breaks when used with `torch.compile()` | ||
- **Debug**: Run with `fullgraph=True` and `TORCH_LOGS="output_code"` to inspect generated code | ||
- **Solution**: Ensure tensor subclasses implement `__tensor_flatten__` and `__tensor_unflatten__` | ||
|
||
**Device and data type compatibility:** | ||
- **Issue**: Some experimental features only support specific devices (e.g., CPU-only embedding quantization) | ||
- **Solution**: Check feature documentation for supported devices and data types | ||
- **Example**: MPS quantization requires macOS with ARM64 architecture | ||
|
||
**Performance analysis:** | ||
- **Issue**: Need to benchmark optimized models | ||
- **Tools**: Use `print_op_and_shapes.py` to identify relevant shapes for microbenchmarking | ||
- **Profiling**: Add `--profile=profile_path` to benchmark scripts for Chrome traces | ||
|
||
**Accuracy degradation:** | ||
- **Issue**: Quantization/sparsity causing accuracy loss | ||
- **Analysis**: Check scale/zero_point for quantization, mask for sparsity | ||
- **Solution**: Consider Quantization-Aware Training (QAT) for accuracy recovery | ||
|
||
### Common Debugging Commands | ||
|
||
```bash | ||
# Check CUDA availability and version | ||
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, version: {torch.version.cuda}')" | ||
|
||
# Check build configuration | ||
python -c "import torchao; print(torchao.__file__)" | ||
|
||
# Debug torch.compile issues | ||
TORCH_LOGS="output_code" python your_script.py | ||
|
||
# Run specific test with verbose output | ||
pytest -v -s test/quantization/test_quant_api.py::test_specific_function | ||
|
||
# Check for CUDA kernel compilation issues | ||
USE_CPP=1 python setup.py develop --verbose | ||
|
||
# Verify experimental kernels are loaded | ||
python -c "from torchao.experimental import _check_torchao_ops_loaded; _check_torchao_ops_loaded()" | ||
|
||
# Profile model performance | ||
python benchmark_script.py --profile=profile_output.json | ||
``` | ||
|
||
## Testing and Benchmarking | ||
|
||
### Testing Infrastructure | ||
|
||
**Test organization:** | ||
- Unit tests: `test_base.py` for core components like `Float8Tensor` | ||
- Integration tests: `test_integration.py` for AOTI compilation with tensor subclasses | ||
- Numerical accuracy: `test_numerics_integration.py` for Float8 operations | ||
|
||
**Test utilities:** | ||
- `TorchAOBasicTestCase`: Basic tensor subclass testing | ||
- `TorchAOCompileTestCase`: `torch.compile` compatibility testing | ||
- SQNR assertions for minimum signal-to-quantization noise ratio | ||
|
||
### Performance Benchmarking | ||
|
||
**Microbenchmarks:** | ||
- `bench_matmul.py`: Benchmark `torch._scaled_mm` function | ||
- `bench_linear_float8.py`: Benchmark `nn.Linear` vs `Float8Linear` | ||
- `benchmark_aq.py`: Benchmark various quantized tensor subclasses | ||
|
||
**Model-level benchmarks:** | ||
- **Llama**: `generate.py` for generation performance, `eval.py` for evaluation | ||
- **SAM**: `eval_combo.py` for SAM model benchmarking | ||
- Enable profiling with `generate_model_profile` for detailed analysis | ||
|
||
**Continuous Integration:** | ||
- `dashboard_perf_test.yml`: Nightly A100 benchmarks with dashboard visualization | ||
- `torchao_experimental_test.yml`: Experimental feature validation | ||
|
||
## Important Notes | ||
|
||
- Always run `pre-commit run --all-files` before committing | ||
- Use `USE_CPP=0` for faster iteration during Python-only development | ||
- CUTLASS kernels have architecture-specific builds (SM90a, SM100a) based on CUDA version | ||
- Git submodules (CUTLASS) are automatically initialized during build |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this install all the pre-reqs? I found that torch wasnt one of those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamping but this might be good to fix