| Blog | Documentation | Slack| Discussion Forum |
This repository is a fork of FlashInfer and will be updated when new release versions of FlashInfer are published, until FlashInfer team decides to support Windows officially (see flashinfer-ai#964)
Don't open a new Issue to request a specific commit build. Wait for a new stable release.
Don't open Issues for general FlashInfer questions or non Windows related problems. Only Windows specific issues. Any Issue opened that is not Windows specific will be closed automatically.
Don't request a wheel for your specific environment. Currently, the only wheels I will publish are for Python 3.12 + CUDA 12.4 + torch 2.6.0. If you have another versions, build your own wheel from source by following the instructions below.
- Ensure that you have the correct Python and CUDA version of the wheel. The Python and CUDA version of the wheel is specified in the release version
- Download the wheel from the release version of your preference
- Install it with
pip install DOWNLOADED_WHEEL_PATH
Due to standard console (cmd.exe) command length limit of 8192 chars, PowerShell is required to do the build.
A Visual Studio 2019 or newer is required to launch the compiler x64 environment. The installation path is referred in the instructions as VISUAL_STUDIO_INSTALL_PATH. For example, for Visual Studio 2022 default installation, replace VISUAL_STUDIO_INSTALL_PATH with C:\Program Files\Microsoft Visual Studio\2022\Community
CUDA path will be found automatically if you have the bin folder in your PATH, or have the CUDA installation path settled on well-known environment vars like CUDA_ROOT, CUDA_HOME or CUDA_PATH.
If none of these are present, make sure to set the environment variable before starting the build: set CUDA_ROOT=CUDA_INSTALLATION_PATH
- Open a PowerShell (powershell.exe)
- Clone the FlashInfer repository:
cd C:\ & git clone https://github.com/SystemPanic/flashinfer-windows.git
- Execute (in PowerShell)
Import-Module "VISUAL_STUDIO_INSTALL_PATH\Common7\Tools\Microsoft.VisualStudio.DevShell.dll"
Enter-VsDevShell -VsInstallPath "VISUAL_STUDIO_INSTALL_PATH" -DevCmdArguments '-arch=x64'
- Change the working directory to the cloned repository path, for example:
cd C:\flashinfer-windows
- Set the following environment variables:
$env:DISTUTILS_USE_SDK=1;
$env:FLASHINFER_ENABLE_AOT=1;
#(replace 10 with your desired cpu threads to use in parallel to speed up compilation)
$env:MAX_JOBS=10;
#Optional environment variables:
#To build only against your specific GPU CUDA arch (to speed up compilation), replace YOUR_CUDA_ARCH with your CUDA arch number. For example, for RTX 4090: $env:TORCH_CUDA_ARCH_LIST="8.9";
$env:TORCH_CUDA_ARCH_LIST="YOUR_CUDA_ARCH";
#To force the usage of your installed Pytorch (for nightly / custom builds)
$env:FLASHINFER_USE_CURRENT_TORCH=1;
- Build & install:
pip install . --no-build-isolation
FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, SparseAttention, PageAttention, Sampling, and more. FlashInfer focuses on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.
Check our v0.2 release blog for new features!
The core features of FlashInfer include:
- Efficient Sparse/Dense Attention Kernels: Efficient single/batch attention for sparse(paged)/dense KV-storage on CUDA Cores and Tensor Cores (both FA2 & FA3) templates. The vector-sparse attention can achieve 90% of the bandwidth of dense kernels with same problem size.
- Load-Balanced Scheduling: FlashInfer decouples
plan
/run
stage of attention computation where we schedule the computation of variable-length inputs inplan
stage to alleviate load-imbalance issue. - Memory Efficiency: FlashInfer offers Cascade Attention for hierarchical KV-Cache, and implements Head-Query fusion for accelerating Grouped-Query Attention, and efficient kernels for low-precision attention and fused-RoPE attention for compressed KV-Cache.
- Customizable Attention: Bring your own attention variants through JIT-compilation.
- CUDAGraph and torch.compile Compatibility: FlashInfer kernels can be captured by CUDAGraphs and torch.compile for low-latency inference.
- Efficient LLM-specific Operators: High-Performance fused kernel for Top-P, Top-K/Min-P sampling without the need to sorting.
FlashInfer supports PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.
- [Mar 10, 2025] Blog Post Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.
- [Mar 1, 2025] Checkout flashinfer's intra-kernel profiler for visualizing the timeline of each threadblock in GPU kernels.
- [Dec 16, 2024] Blog Post FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
- [Sept 2024] We've launched a Slack workspace for Flashinfer users and developers. Join us for timely support, discussions, updates and knowledge sharing!
- [Jan 31, 2024] Blog Post Cascade Inference: Memory-Efficient Shared Prefix Batch Decoding
- [Jan 31, 2024] Blog Post Accelerating Self-Attentions for LLM Serving with FlashInfer
Using our PyTorch API is the easiest way to get started:
We provide prebuilt python wheels for Linux. Install FlashInfer with the following command:
# For CUDA 12.6 & torch 2.6
pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6
# For other CUDA & torch versions, check https://docs.flashinfer.ai/installation.html
To try the latest features from the main branch, use our nightly-built wheels:
pip install flashinfer-python -i https://flashinfer.ai/whl/nightly/cu126/torch2.6
For a JIT version (compiling every kernel from scratch, NVCC is required), install from PyPI:
pip install flashinfer-python
Alternatively, build FlashInfer from source:
git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
pip install -e . -v
To pre-compile essential kernels, set the environment variable FLASHINFER_ENABLE_AOT=1
before running the installation command:
FLASHINFER_ENABLE_AOT=1 pip install -e . -v
For more details, refer to the Install from Source documentation.
Below is a minimal example of using FlashInfer's single-request decode/append/prefill attention kernels:
import torch
import flashinfer
kv_len = 2048
num_kv_heads = 32
head_dim = 128
k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
# decode attention
num_qo_heads = 32
q = torch.randn(num_qo_heads, head_dim).half().to(0)
o = flashinfer.single_decode_with_kv_cache(q, k, v) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly
# append attention
append_qo_len = 128
q = torch.randn(append_qo_len, num_qo_heads, head_dim).half().to(0) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, pos_encoding_mode="ROPE_LLAMA") # append attention with LLaMA style RoPE on-the-fly, apply causal mask
# prefill attention
qo_len = 2048
q = torch.randn(qo_len, num_qo_heads, head_dim).half().to(0) # prefill attention
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=False) # prefill attention without RoPE on-the-fly, do not apply causal mask
Check out documentation for usage of batch decode/append/prefill kernels and shared-prefix cascading kernels.
Starting from FlashInfer v0.2, users can customize their own attention variants with additional parameters. For more details, refer to our JIT examples.
We profile FlashInfer kernel performance with nvbench and you can compile and run the benchmarks with the following commands:
mkdir build
cp cmake/config.cmake build # you can modify the config.cmake to enable/disable benchmarks and change CUDA architectures
cd build
cmake ..
make -j12
You can run ./bench_{single/batch}_{prefill/decode}
to benchmark the performance (e.g. ./bench_single_prefill
for single-request prefill attention). ./bench_{single/batch}_{prefill/decode} --help
will show you the available options.
FlashInfer also provides C++ API and TVM bindings, please refer to documentation for more details.
We are thrilled to share that FlashInfer is being adopted by many cutting-edge projects, including but not limited to:
FlashInfer is inspired by FlashAttention 1&2, vLLM, stream-K, cutlass and AITemplate projects.
If you find FlashInfer helpful in your project or research, please consider citing our paper:
@article{ye2025flashinfer,
title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},
author = {
Ye, Zihao and
Chen, Lequn and
Lai, Ruihang and
Lin, Wuwei and
Zhang, Yineng and
Wang, Stephanie and
Chen, Tianqi and
Kasikci, Baris and
Grover, Vinod and
Krishnamurthy, Arvind and
Ceze, Luis
},
journal = {arXiv preprint arXiv:2501.01005},
year = {2025},
url = {https://arxiv.org/abs/2501.01005}
}