Releases: pytorch/FBGEMM
FBGEMM v1.2.0 Release Notes
Highlights
TBE GPU
- Added support for
int64_t
table indices and offsets in TBE inference - Improved TBE benchmark utilities with the introduction of the Embeddings Estimator and Generator (EEG)
TBE CPU
- Added Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf operator
- Make FloatToFloat16 conversion 75x faster using SVE2 instructions
- Added FP32 GEMM kernels
TBE SSD
- Fix OOM issues during init
- Improvements to L1 and L2 flush
Gen AI Ops
- GenAI ops are now separately packaged into FBGEMM GenAI package for easier build and installation
- Various FP8 grouped GEMM optimizations
- BF16I4 preshuffled grouped GEMM
- BF16 stacked grouped GEMM
- F8I4 grouped GEMM optimizations
- Added nccl_alltoall function
ROCm
- Added preliminary ROCm OSS build support for GenAI ops
Better Engineering
- Added build support for CUDA 12.8
- Introduced a set of utilities to harden CUDA kernel launches against a set of runtime errors
Software Requirements
FBGEMM_GPU v1.2.0 has been tested and known to work on the following setups:
- PyTorch: v2.7
- CUDA: v11.8, 12.6, 12.8
- Python: v3.9, 3.10, 3.11, 3.12, 3.13
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU (instructions here) and FBGEMM-GenAI (instructions here).
Availability
FBGEMM_GPU and FBGEMM GenAI can be fetched directly from PyPI:
# FBGEMM_GPU - CUDA (only the CUDA 12.6 variant is available)
pip install fbgemm-gpu==1.2.0
# FBGEMM_GPU - CPU
pip install fbgemm-gpu-cpu==1.2.0
# FBGEMM GenAI
pip install fbgemm-gpu-genai==1.2.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU - CUDA
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu128/
# FBGEMM_GPU - CPU
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu
# FBGEMM GenAI
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu
Changes
CPU
GEMM
- [Improvement] Improve Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon by 5%-15% (#3860)
- [New] Use enum to select floating point format in FbgemmEmbedding APIs (#3842)
- [New] Add generic IEEE754 truncation code (#3820)
- [New] Enable KleidiAI for FP32 (#3818)
- [Improvement] Move float conversion functions from Types.h into new FloatConversion.h (#3760)
- [Fix] Use kleidiAI on static builds (#3806)
- [Fix] Fix KleidiAI FP16 (#3769)
- [Improvement] Pull ARM's matrix transpose PR (#3660)
- [New] Add NEON implementation of Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf (#3707)
- [Improvement] avoid extra copy in PackedGemmMatrixB constructor (#3691)
- [Improvement] Remove FENV pragma (#3629)
- [Improvement] Make FloatToFloat16 conversion 75x faster using SVE2 instructions (#3626)
- [New] add a new constructor to PackedGemmMatrixB (#3598)
- [New] Move FP32 kernels to OSS (#3568)
GenAI
GenAI Ops
- [Improvement] Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (#3790) (#3942)
- [New] Add harness for comms benchmark (#3936)
- [Improvement] Refactoring of NoPE (#3840)
- [Improvement] support fp16 dtypes for input weight and bias (#3931)
- [Fix] fix fp8 kv cache dequantize kernels (#3896)
- [Fix] fix fp8 kv cache dequantize kernels (#3896)
- [Improvement] scatter_add 0 size support (#3861)
- [Improvement] Retuned CK GMM fp8/bf16 with perf fixes (#3851)
- [Improvement] Enable groupwise scales for F8I4 Grouped Gemm (#3884)
- [Fix] Fix empty input view. (#3880)
- [New] FP8 Rowwise Dequant Kernel (#3873)
- [New]
torch.ops.fbgemm.gather_scale_dense_tokens
for oss. (#3855) - [Improvement] Replace rms_norm as norm (#3841)
- [Improvement] Move DeepGemm scale transpose to quantize (#3834)
- [Improvement] follow up to reflect rowwise scale inputs for x, w in
quantize_ops
scripts (#3839) - [New] add rowwise scaling support (#3822)
- [Improvement] update to tune for small
m
s and quantized gemv (#3712) - [New] Add Preshuffled FP8 x INT4 Grouped Gemm Kernel (#3800)
- [New] FBGEMM Add Columnwise Weight Scaling to F8I4 GEMM (#3766)
- [Improvement] update the sorting kernel for bf16 ck fmoe kernel (#3817)
- [Fix] fix volatile synchronization with acquire/relax (#3728)
- [Improvement] Force determinism by unswizzle (#3727)
- [New] add fp8 kv nope (#3786)
- [Improvement] move common op to vector utils (#3759)
- [Improvement] Gather/Scatter. (#3743)
- [Improvement] reduce scatter supports last dim (#3726)
- [Improvement] Add custom reduce scatter to llama_comms (#3730)
- [New] Adds shapes information to enable torch.compile. (#3724)
- [Improvement] avoid propagation of NaN (#3723)
- [New]
torch.ops.fbgemm.scatter_add_along_first_dim
.. (#3720) - [New]
torch.ops.fbgemm.gather_along_first_dim
. (#3719) - [New] Paged Attention Support (#3698)
- [New] custom reduce scatter (#3686)
- [Fix] Recover custom collective test (#3687)
- [Improvement] update sweep_utils.py to test more precision gemv kernel (#3678)
- [New] add fp8fp8 fast_gemv_quantized (#3677)
- [New] add mixed precision fp8 fast_gemv_quantized kernel (#3675)
- [Improvement] adjust interface (#3669)
- [Improvement] CK MoE: cherry-pick #1808 (#3609)
- [Improvement] fix llm shapes in quantize bench and add ldm shapes (#3611)
- [Improvement] Return if no data to allreduce (#3586)
- [Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
- [Improvement] Make zero_start_index_M optional for dynamic BF16 Grouped Gemm (#3553)
- [New] Add nccl_alltoall function (#3551)
- [New] Add fused_moe kernel to ck_extension (#3518)
GEMM
- [Improvement] Update cutlass verison to 3.8V2 (#3772)
- [Improvement] Update Cutlass to V3.8-2 (#3767)
- [Improvement] fp8_gemm (non_persistent): adding optimal configs for 8k & 16k shapes (#3764)
- [New] new tuning for fp8 rowwise (#3756)
- [Improvement] Add DeepGEMM blockwise GEMM in quantize bench (#3746)
- [Improvement] Enable DeepGEMM in quantize bench (#3745)
- [Improvement] reduce overhead for f8f8bf16_rowwise_grouped_dynamic on amd (#3742)
- [Improvement] Performance Optimization: Optimized TileShape Configuration for f8 (#3617) (#3735)
- [Improvement] Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) (#3710)
- [Improvement] adding an option to skip zeroing output tensor for f8f8bf16_rowwise_grouped_dynamic (#3685)
- [Improvement] Update CK (#3701)
- [Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu +10 (#3844)
- [New] Make F8I4 grouped GEMM process M_sizes with INT32 (#3853)
- [Improvement] Skip empty groups in FP8 Stacked Gemm (#3862)
- [New] Enable preshuffled mixed dtype Cutlass Gemm (#3722)
- [Improvement] [CUTLASS] Minor Cutlass change to fix CI (#3779)
- [Improvement] Clean up cutlass FP8 Grouped Gemm Kernel Setup (#3864)
- [New] Modernize bf16 cutlass grouped gemm (#3889)
- [Improvement] [CUTLASS] Include new cutlass support for groupwise mixed dtype grouped gemm. (#3885)
- [New] Add DEEPGEMM Masked API. (#3949)
- [Improvement] Use Int64 Indexing in Grouped Gemm (#3930)
- [Improvement] Add correctness testing for shuffled mixed dtype GEMMs. (#3932)
- [New] BF16I4 Preshuffled Grouped Gemm (#3917)
- [New] Preshuffled BF16I4 Gemm Kernel (#3913)
- [New] Enable rowwise scaling for DeepGemm (#3874)
- [New] bf16 stacked group gemm (#3888)
- [New] F8I4 Grouped Gemm Optimization for Sparse M (#3854)
FP8
- [Fix] FBGEMM fp8 ck GEMM fix for irregular GEMM shapes (#3894)
- [Fix] fix stacked version fp8 rowwise group gemm registration in quantize_bench (#3902)
- [Fix] A hotfix for FBGEMM fp8 rowwise with irregular gemm sizes (#3883)
- [Improvement] Transpose FP8 GEMM inputs for better tuning (#3866)
- [New] Enable FP8 Triton dequantized block-wise kernel (#3788)
- [Improvement] Refactor stacked version of FP8 Grouped Gemm for reduced overhead (#3699)
- [Improvement] changing config for fp8 gemm (#3668)
- [Improvement] Add option to disable fast_accumulation for fp8 gemm. (#3714)
- [New] Add cublas FP8 tensorwise GEMM in fbgemm quantize bench (#3693)
- [Improvement] write_k_back for fp8 ROPE (#3679)
- [Improvement] Moves utility functions into a standalone file. (#3671)
- [Fix] Fix f8f8bf16_lite quantize op input in
quantize_and_compute
(#3667) - [Improvement] Optimize zero fill (#3666)
- [Improvement] FP8 Grouped Gemm Optimization (#3655)
- [New] Add sweep_utils.py script to tune heuristics (#3656)
- [Improvement] loose unit test
atol
rtol
tolerance to eliminate ut flakiness (#3664) - [New] Port oss f16_fast_gemv into fbcode (#3610)
- [New] fp8 rowwise regular gemm tuning for llm new shapes (#3654)
- [Improvement] k_norm in rope for fp8 kv cache (#3633)
- [Improvement] Fix zero_start_index_M argument for triton rowwise quantize (#3639)
- [Fix] Fix handling of dynamic FP8 grouped gemm on Nvidia (#3616)
- [Improvement] Improve FP8 grouped GEMM perf via tileshape and cooperative (#3653)
- [Improvement] Refactor FP8 grouped GEMM with dynamic and static versions (#3561)
- [New] Support FP8 grouped GEMM with rowwise scailing (#3560)
- [Fix] [CUTLASS] Use custom copy of cutlass to enable FP8 Grouped Gemm. (#3649)
- [Fix] kv_dq zero initialization to avoid NaNs from FA3 (#3632)
- [Improvement] amd fp8 rowwise batched gemm tuning (#3624)
- [Improvement] Improve handling for FP8 grouped gemm without zero_start_index_M (#3615)
- [New] amd fp8 rowwise gemm prefill shape tuning (#36...
FBGEMM_GPU v1.1.0 Release Notes
Highlights
TBE GPU
- Introducing support for int32_t indices in TBE training
- Extended TBE support for larger embedding dimensions
- Made the learning rate a tensor value
- Improvements on indices bounds checking
TBE CPU
- Improved ARM support with SVE implementations for matrix multiplication and float matrix transpose
- Improved the EmbeddingSpMDMAutovec API
- Migrated FP32 ops to OSS
TBE SSD
- Enabled VBE in SSD-TBE
- Async initialization of RockDB SSD tensors and pad before writing to rocksDB
- Improvements on indices bounds and other constraints checking
Gen AI Ops
- Custom allgather support multiple dtypes, with dtype checking to prevent silent failures
ROCm
- Add CK FP8 Batched GEMM and Rowwise GEMM kernels along with heuristic tuning
- Fixed CK FP8 rowwise quantization for some GEMM shapes
- Introduced HIP-specific optimizations to the TBE forward and backward passes
SLL ops
- Migrated Sequence Learning Library (SLL) ops to OSS
Better Engineering
- Restructured the build to produce multipiple smaller shared libraries instead of a single large binary
- New and improved tests and benchmarks
- Improved ROCm build variant support
- Add build support for CUDA 12.6 and Python 3.13
Software Requirements
FBGEMM_GPU v1.1.0 has been tested and known to work on the following setups:
- PyTorch: v2.6
- CUDA: v11.8, 12.4, 12.6
- Python: v3.9, 3.10, 3.11, 3.12, 3.13
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.1.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.1.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu124/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu126/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table Batched Embedding (TBE) operators
For GPU
- [New] Add support for
int32_t
indices in TBE training (#3377, #3375, #3374, #3372, #3371, #3324, #3267, #3264, #3263 #3257) - [New] Add support for int64_t indices and offsets in TBE inference (#3254, #3233)
- [New] Extend TBE support for larger embedding dimensions (#3462, #3467)
- [New] Make
learning rate
tensor (Backend) (#3287, #3310, #3332) - [New] Add PTA checks to embedding_bounds_check kernels" (#3318)
- [Fix] Fix PackedTensorAccessor for batch_index_select (#3281)
- [Fix] Set cache_precision = weights_precision in TBE if it is not explicitly set (#3370)
- [Fix] Fix pt2_wrapper registration for unified TBE interface (#3238)
- [Fix] Fix PT2 compliant opcheck tests (#3404)
- [Fix] Fix FBGEMM_GPU_MEMCHECK in Split optimizers (#3416)
- [Fix] Fix learning rate as tensor for PT2 compile (#3407)
- [New] Add new optimizer state
row_counter
for Adam [Frontend] (#3558) - [New] Add new optimizer state
row_counter
for Adam [Backend] (#3342) - [Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [7C/N]"" (#3258)
- [Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [8/N]"" (#3255)
- [Fix] Fix global weight decay Faketensor test (#3341)
- [Fix] Fix pt2_wrapper registration for unified TBE interface (#3237)
- [Fix] Fix ""Cannot call numel() on tensor with symbolic sizes/strides"" (#3368)
- [Fix] Fix grid size overflow in generate_vbe_metadata (#3484)
- [Fix] Fix an integer overflow in permute_multi_embedding() (#3465)
- [Fix] Fix the sync point caused by iter_cpu.item() (#3401)
- [Fix] Fix global weight decay Faketensor test (#3341)
- [Fix] Hot fix to skip VBE CPU reshaping for MTIA (#3466)
- [Fix] address mem over used during flushing (#3460)
- [Improvement] Add
iter
singular value into TBE optimizer state (#3228) - [Improvement] V2 fwd modified warps (#3570)
- [Improvement] Add enable_async_update into tbe signature and config (#3431, #3461)"
- [Improvement] Adjust kNumThreads for bounds_check_indices_kernel (#3299)
- [Improvement] Reduce registers in bounds_check_indices" (#3298)
- [Improvement] Mark unified autograd function traceable (#3378)
- [Improvement] Improve bounds_check_indices for VBE (#3388, #3386)
- [Improvement] Do not call
scalar_type
(#3394) - [Improvement] optimizer 1d -- EMA in place (fbgemm part) (#3402)
- [Improvement] Clean up nbit_forward tests (#3286)
- [Improvement] Remove unused-variable in some generated code (#3327)
- [Improvement] Limit grid size of bounds_check_indices" (#3282)
- [Improvement] Support config based bound check version via extended modes (#3418)
- [Improvement] Use int64_t index for SplitOptimizer grad (#3447)
- [Improvement] Remove unused arg from generate_vbe_metadata frontend (#3453)
- [Improvement] Add generate_vbe_metadata test (#3483)
- [Improvement] Support config based bound check version via extended modes (#3454)
- [Improvement] make
iter
PT2 compatible (#3253) - [Improvement] Add meta function for PT2 wrappers (#3240)
- [Improvement] Nesterov (#3232)
For CPU
- [New] Introduce SVE function for matrix multiplication (#3348)
- [New] Add sve implementation for float matrix transpose (#3421)
- [New] autovec specialization framework (#3393)
- [New] Move FP32 kernels to OSS (#3568)
- [Improvement] Pull in PR for Kleidi-based FP16 kernel (#3507)
- [Improvement] Use local buffer where possible (#3304)
- [Improvement] Refactor GenerateEmbeddingXXX functions (#3307)
- [Improvement] Increase local_storage size to 512 floats (#3357)
- [Improvement] Adjust EmbeddingSpMDMAutovec API (#3366)
- [Improvement] Split loops to work around loop vectorizer weakness (#3406)
- [Improvement] Do an early check that data_size is not negative (#3305)
- [Improvement] Fix strict aliasing violation, code cleanup (#3306)
SSD TBE Operators
- [New] Enable VBE in SSD-TBE (#3247)
- [Improvement] put KVTensorWrapper in its own header (#3575)
- [Improvement] Moving KVTensorWrapper to a header file to be used in ModelStore checkpointing code (#3276)
- [Improvement] Async initialization of RockDB SSD tensors (#3520)
- [Improvement] pad before writing to rocksDB (#3245)
- [Improvement] use RocksDB iterator to read key range from ssd embedding (#3495)
- [Improvement] Log total duration spent prefetching (#3487)
- [Improvement] address mem over used during flushing (#3460)
- [Improvement] Create move TBE to right device, and set Cache Load in TBE class (#3438)
- [Improvement] Unit test for new move tbe from device/cache_load method (#3437)
- [Improvement] make L2/rocksdb update async optional (#3429)
- [Improvement] Drop RoPE when filling KV cache (#3346)
- [Improvement] Remove setting total_cache_hash_size as buffer (#3441)
- [Improvement] Add meta registrations for kv_cache operators (#3442)
- [Improvement] remove output dtype restriction in SSD TBE (#3524)
- [Improvement] change pmt require grad to false when detached (#3525)
- [Improvement] add more attributes to PartiallyMaterializedTensor (#3300)
- [Improvement] skip broken inference test that uses ssd TBE (#3494)
- [Improvement] "coro => fut" (#3430)
- [Improvement] Reland of D65489998 Optimize sharding performance of embeddings (#3549)
- [Improvement] Remove torch.jit.script (#3562)
GenAI Support and Operators
- [New] Add nccl_alltoall function (#3551)
- [New] custom allgather support multiple dtypes (#3498)
- [Improvement] Make sure fake tensor functions return on proper device (#3258)
- [Improvement] Add CPU registrations to custom operators (#3262)
- [Improvement] Check src & dst dtypes in allgather to prevent silent failures (#3523)
- [Improvement] Better shape function registration (#3237, #3340)
- [Improvement] Package re-organization improvements (#3546, #3251, #3419, #3268, #3512)
FP8 and other Quantization support
- [New] New autotune config for M=4 (#3277)
- [New] MoE FP8 grouped GEMM (#3321)
- [New] Add shape check on GroupedGEMM kernel (#3449)
- [New] Tuning for fp8 gemm with emu1.7 shapes (#3436)
- [Improvement] more fp8 tuning for decode and not need to pad (#3576)
- [Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
- [Improvement] Split FP8 Grouped Gemm into dynamic and static version (#3543)
- [Improvement] Warp-specialized FP8 rowsise GEMM kernel (#3532)
- [Improvement] Add Cutlass FP8 Grouped Gemm to Quantize Bench (#3530)
- [Improvement] Fixed FBGEMM fp8 rowwise for irregular shapes (#3491)
- [Improvement] Properly define preallocated output as mutable in fp8 rowwise gemm (#3476)
- [Improvement] Fix FP8 Rowwise Gemm Compilation with Auto-functionalize V2 (#3457)
- [Improvement] Support zero-size inputs in FP8 cuda quantize kernel (#3448)
- [Improvement] update FP8 GEMM tuning for emu1.7 7B shapes (#3391)
- [Improvement] Customize FP8 grouped GEMM for non-zero calculation for token choice MoE (#3383)
- [Improvement] Support FP8 grouped GEMM with cudagraph (#3373)
- [Improvement] Refactor FP8 grouped GEMM to prepare cudagraph support (#3369)
- [Improvement] Improve FP8 BMM heuristic for large shapes and MoE E2E performance (#3344)
- [Improvement] retune some of the EMU1.6 7B FP8 GEMM shapes (#3328)
- [Improvement] Make FP8 BMM output contiguous (#3270)
- [Improvement] Tune FP8 rowwise bmm tile hueristic (#3256)
- [Improvement] more FP8 GEMM tuning for LDM shapes (#3414)
- [Improvement] Split up
f8f8bf16_rowwise_batched.cu
(#3381) - [Improvement] use sym int in quantize.cpp for f8f8bf16_rowwise_meta (#3410)
- [Improvement] Remove triton.ops dependency from fbgemm (#3329)...
FBGEMM_GPU v1.0.0 Release Notes
Stable API
We provide the stable API support starting from FBGEMM_GPU v1.0.0. This includes Table batched embedding (TBE) modules, Pooled embedding operators and modules, Sparse operators, Jagged tensor operators and Quantization operators.
- API backward compatibility guarantees via thorough testing. We guarantee that our stable APIs will be backward compatible within a major version, meaning that the stable APIs for v1.0.0 will be compatible with every future release unless explicitly announced in advance
*Enhanced documentation, ensuring that every stable API has comprehensive and up-to-date documentation. - Functionality guarantees are only provided through unit testing framework. We do NOT guarantee any functionalities that are NOT explicitly tested and documented in our unit tests.
- No performance guarantees. However, we are committed to providing support on a best-effort basis.
More details can be found in stable API documentation
Highlights
Table Batched Embedding (TBE)
- New optimizer support for TBE Training
- Enhanced Global weight decay support in TBE
- Improvement and bug fixes for TBE training and inference modules and sparse operators
For SSD
- New pipeline prefetching enabled
- New cache and indices related ops
- Integration of L3 cache to TBE operators
- Many improvements to kernel and logging
For CPU
- New type support for CPU Sequence TBE
- Kernel improvements and bug fixes
Generative AI
- Gen AI Ops support and improvement
- Improvements to Triton-based and CUTLASS-based operators
- New and optimized FP8 GEMM and quantization operators
Others
- Optimized MX4 quantization operators
- New dequantization operator
- Removal of python 3.8 Support
Better engineering
- Code refactoring and reorganization for faster builds
- New and improved tests and benchmarks
- Improved AMD support
Software Requirements
FBGEMM_GPU v1.0.0 has been tested and known to work on the following setups:
- PyTorch: v2.5
- CUDA: v11.8, 12.1, 12.4
- Python: v3.9, 3.10, 3.11, 3.12
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.0.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.0.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
For GPU
- [New] Ensemble adagrad optimizer (#3197, #2955, #2954, #3161, #3091, #2981, #2889, #3180, #3158)
- [New] Bounds check in prefetch in TBE training (#3015)
- [New] Method to update internal hyperparameters for FBGEMM TBE (#3025)
- [Improvement] Enhanced Global Weight Decay and state tracking (#2904, #2897, #2882, #2896, #2890, #2884, #2883 )
- [Improvement]
masked_index_*
values index type fix (#2979) - [Improvement] generate_vbe_metadata fixes (#3095, #3087)
- [Improvement] Fixes on the efficiency of VBE TBE forward due to blocking D2H copy (#2862)
- [Improvement] Work around on offsets and indices type mismatch int TBE training (#3037)
- [Improvement] Add a host map option for a UVM tensor alloc (#3073)
- [Improvement]
uvm_to_device
expose device as interface (#3030) - [Improvement] Add Meta backend/dispatcher for new_unified_tensor (#3005)
- [Improvement] General TBE enhancements and bug fixes (#2892, #3114, #3022, #2958)
- [Improvement] Consolidate repeat code in TBE inference (#3028)
For CPU
- [New] Add int4 to int4 CPU Sequence TBE kernel (#2996, #2994)
- [New] Use auto-vec kernel in CPU sequential embedding lookup for int8 tables (#2863, #2878)
- [Improvement] Work around OMP barrier issue with MSVCand unused var error (#2918, #3084)
SSD Table batched embedding (TBE) operators
- [New] Enable pipeline prefetching (#2963)
- [New] Enable cache line locking support in SSD kernel (#2949)
- [New] Add L2 flush (#3110)
- [New] Added SSD ODS and IO/mem stats (#2906, #2913, #3035)
- [New] Add SSDScratchPadIndicesQueue (#2911, #2948)
- [New] Integrate l2 cache to TBE operator (#2959, #3032, #3031 )
- [New] Add ssd_update_row_addrs (#2953)
- [New] Add bounds check in SSD-TBE (#3013)
- [New] Add 32-bit index support in SSD kernels (#3064)
- [New] Add kv cache related ops (#3001, #2968)
- [New] Add compact_indices op (#3075 )
- [New] Create embedding cache interface and impl RocksDB cache (#2858)
- [New] Reduce prefetch SM usage when using pipeline prefetching (#2991)
- [New] Add a host map option for a UVM tensor alloc (#3003)
- [New] Add masked_index_select and refactor masked_index_put (#2910)
- [Improvement] Add parallelism on cache update (#3062)
- [Improvement] add parameter server attributes (#2947)
- [Improvement] Make the scratch pad tensor UVA (#2844)
- [Improvement] Use less thread blocks for find_uncached kernel (#3101)
- [Improvement] Fix stream sync for scratch pad eviction (#2843)
- [Improvement] Make indices related to cache eviction UVA tensors (#3077
- [Improvement] Split cachelib cache into header and src (#3063)
- [Improvement] Record more functions and logging in SSD TBE (#2854, #2867, #2975)
- [Improvement] Attach eviction filling logic to set_cache (#3034)
- [Improvement] Move set_cache and set_async to background thread (#3033)
- [Improvement] Refactoring vec copy in masked_index_put_kernel (#2861, #2908)
- [Improvement] Increase memcpy and compute overlap (#2860)
- [Improvement] Add set_async in background thread (#3036 )
- [Improvement] Make evicted_rows a UVA buffer (#3079 )
- [Improvement] General enhancement and bug fixes (#2937, #2993, #3151, #3089, #2898, #2930)
GenAI Support and Operators
- [New] Decode and Prefill support (#3009 )
- [New] Support rope with block tables (#3146)
- [New] EP support (#3071)
- [New] Implement SDPA kernel wrapper to use run_kernel flow for perf (#2820)
- [Improvement] Move mqa code (#3011)
- [Improvement] BE improvements to init_comms #3103
Triton GEMM support
- [New] Enable torch.compile compatibility for triton fp8 rowwise gemm (#2978)
- [New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
- [New] GEMM custom op enablement (#3046)
- [New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
- [Improvement] Add fused bias to Triton FP8 Rowwise Kernels (#2852)
- [Improvement] Triton dependency ( #3027)
- [Improvement] Fix triton fp8 handling of non-contiguous inputs (#2919)
- [Improvement] More autotune configs and bug fixes in TMA kernel (#3078, #3066, #3072)
- [Improvement] Fp8 gemm tweak for 405B Decoding (#3104 )
FP8 and other Quantization support
- [New] CK FP8 Optimizations and fixes (#2940, #2912, #2987, #3017, (#2893 )
- [New] FP8 kernel development and enablement (#2866)
- [New] GenAI CK Version update and integration (#2865, #2971)
- [Improvement] Also hipify the fp8 related cuda functions (#2834 )
- [Improvement] Auto-generation of CUTLASS Extension Kernel Templates (#2932)
- [Improvement] Marlin Mixed Input Kernel Productionization (#3008)
- [Improvement] Remove redundant torch.abs (#3020, #2822 )
- [Improvement] Tuning for 405B/70B Prefill with small seqlen (#3042)
- [Improvement] Added new instances for 405B decoding (#2936 )
Permute and Pooled Embeddings Ops
- [New] Implementation of permute_multi_embedding (#2833)
- [Improvement] Clean up and removal of unused exception (#2832, #2891)
- [Improvement] Use at::parallel_for in cpu kernel (#2817)
- [Improvement] Add dispatch_to_cpu for the operators (#2874, #2881)
- [Improvement] Print the exact variable values triggering the alert in Merge Pooled Embedding (#3038)
Sparse Operators
- [New] Support original indices for FBGEMM block bucketization flag (#2999, #2925)
- [Improvement] Fix pack_segments backward when grad is non-contig (#3006)
- [Improvement] Fix FBGEMM_GPU_MEMCHECK in sparse_ops_cuda (#2943 )
- [Improvement] Update sparse_ops.py to use generic gpu target fbgemm_gpu:input_combine to support both nvidia and AMD(#2905)
- [Improvement] Add abstract impl and functions (#2962, #2983, #3000 )
- [Improvement] Use guard_size_oblivious in tbe_input_combine_abstract fake kernel (#2923)
- [Improvement] Out variant for asynchronous_exclusive_cumsum_cpu + some more static dispatch kernels (#3090)
Quantize ops
- [New] Add a CPU nbit to float dequantization op that supports torch.quintMxN type (#2995)
MX4 Ops
- [New] Optimize FBGEMM Triton MX4 Quantize-Dequantize (#2838, #2837)
- [New] Rounding Mode Support (#2821, #2816, #2933, #2859 )
- [New] FBGEMM/TorchRec MX4 padding support (#3055, #3047, #3010 )
- [New] Add Stochastic downcasting to MX4 Quantization (#2899)
- [New] Support for other MX4 formats in Triton kernels (#2900)
- [Improvement] Refactor MX4 Kernel to operate on flat tensors (#2836)
- [Improvement] Optimize MX4 padding to minimize need for tuning (#3040)
Benchmarks / Tests
- [New] Add schema compatibility test (#3130)
- [New] Add SSD/UVM caching in TBE device benchmark (#3076)
- [New] Add EmbeddingSpMDM8BitBenchmarkOutTypeFloat16 (#2952 )
- [New] Add benchmark EmbeddingSpMDMNBitBenchmarkOutTypeFloat16 (#2901 )
- [New] Add unit test for int4 to int4 sequence CPU TBE (#2997)
- [New] Add rocm support for fp8 benchmarks (#2965)
- [New] Add rotating buffer feature to quantize_bench #2857)
- [New] Benchmark of fbge...
FBGEMM_GPU v0.8.0 Release Notes
Release Note
Highlights
Table Batched Embedding
For GPU
- New Table Batched Embedding (TBE) operators and momentum type support
- New Intraining Embedding Pruning (ITEP) operators
- VBE support for Dense TBE
- Global weight decay support in TBE
- New type support and improvement to SSD TBE
- Improvement and bug fixes for TBE training and inference modules and sparse operators
For MTIA
- MTIA support for DenseTBE
Generative AI
- Gen AI Ops integration
- Support for Triton-based and CUTLASS-based operators (#2552, #2537)
- New FP8 GEMM and quantization operators
- New query attention operators
- New Car and All-To-All (NCCL-based) communication operators
- AMD Support for FP8
Others
- New MX4 quantization operators
- Support for CUDA 12.4
Better engineering
- Code refactoring and reorganization for faster builds
- New tests and benchmarks
- Improved AMD support
Software Requirements
FBGEMM_GPU v0.8.0 has been tested and known to work on the following setups:
- PyTorch: v2.4
- CUDA: v11.8, 12.1, 12.4
- Python: v3.8, 3.9, 3.10, 3.11, 3.12
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only the CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.8.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.8.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
For GPU
- [New] VBE support for Dense TBE (#2628, #2620, #2641)
- [New] BF16 momentum support in PARTIAL_ROWWISE_ADAM (#2524, #2522, #2518)
- [New] Global weight decay support (#2516, #2507, #2506)
- [New] Multi-pass prefetch for memory efficiency (#2566)
- [Improvement] Work around masked_select for numel > MAX_INT (#2648)
- [Improvement] Fused optim in backward capability with aot_autograd (#2651)
- [Improvement] Weights mutations declaration in TBE backward ops schemas (#2698)
- [Improvement] Helper ops to support cache conflict misses (#2571)
- [Improvement] Fixed the hang issue in some TBE GPU optimizers (#2509)
- [Improvement] Misc TBE fixes and refactoring (#2583, #2597, #2529)
- [Improvement] Cache prefetch and conflict miss improvements (#2596, #2514)
For MTIA
- [New] Support MTIA in DenseTableBatchedEmbeddingBagsCodegen (#2680)
SSD Table batched embedding (TBE) operators
- [New] Add FP16 weight and output support to SSD TBE (#2638)
- [New] Implementation of PS KV DB for FBGEMM TBE operator (#2664, #2642)
- [Improvement] Removal of D->H sync when calling
lxu_cache_lookup
(#2672) - [Improvement] Recording of functions in SSD TBE (#2670)
- [Improvement] Added options, assertions and logs for training and inference SSD TBE (#2689, #2657)
- [Improvement] SSD TBE backend fixes (#2645, #2671)
New Operator Groups
- [New] Intraining Embedding Pruning (ITEP) ops (#2700, #2690, #2682)
- [New] Populate bucketize permute kernel (#2533)
- [New] MX4 quantization support (#2709, #2703, #2696, #2675, #2659)
GenAI FP8 Operators
- [New] FP8 enablement (#2615, #2637)
- [New] CK FP8 GEMM kernels (#2630)
- [New] FP8 Rowwise GEMM (#2585, #2622)
- [New] FP8 quantization and conversions to FP32/FP16 (#2686, #2681, #2593, #2540, #2677)
- [New] FP8 blockwise GEMM (#2676, #2600)
- [New] Triton-based FP8 GEMM and quantization support (#2701, #2688, #2643)
- [New] AMD support for FP8 (#2582, #2658, #2611)
GenAI Support and Operators
- [New] Integrated Gen AI ops into the build (#2512)
- [New] Support for Triton-based operators (#2570, #2618)
- [New] Support for CUTLASS-based operators (#2552, #2537)
- [New] Car and All-To-All (NCCL-based) communication ops (#2606, #2667, #2631, #2624)
- [New] Grouped query attention ops (#2673, #2504)
- [New] CK BF16 GEMM (#2617)
- [New] W4A8 GEMM kernels (#2558, #2607)
Pooled Embeddings
- [Improvement] Clean up unused pooled embedding ops (#2626)
- [Improvement] PyTorch compatibility fixes (#2619, #2629)
Sparse Operators
- [Improvement] Increased dynamic shared memory size to support larger bucket sizes (#2500)
- [Improvement] UINT8 support for reorder sequence embedding operator (#2531)
- [Improvement] Fixed CPU blocking D2H in JaggedIndexSelect2dOp backward (#2510)
Benchmarks / Tests
- [New] Unified benchmarks and unit tests for FP8 (#2609, #2699, #2666)
- [Improvement] SSD TBE benchmarks (#2579, #2580)
- [Improvement] SSD TBE tests (#2665, #2647)
- [Improvement] Fixes for TBE tests and benchmarks (#2632)
- [Improvement] nbit_cache benchmark bandwidth calculation (#2511)
Build / CI improvements and Fixes
- [New] Support for CUDA 12.4 (#2565)
- [Improvement] Improved AMD support (#2541, #2679)
- [Improvement] Strengthened artifact installation process (#2491)
- [Improvement] Memcheck added across operators (#2576, #2574, #2572, #2612, #2594, #2589, #2578)
- [Improvement] Refactoring of large header files (#2650)
- [Improvement] Improved build scripts to support debug flags and custom (i.e. GenAI) variants (#2702)
FBGEMM_GPU v0.7.0
Release Note
Highlights
- New optimizer and output type supports for Table Batched Embedding (TBE) training
- Improvement and bug fixes for TBE variable batch size
- Enhanced TBE pipeline prefetching for UVM caching
- Many improvements on TBE CPU kernels
- New and enhanced low-precision operators
- Code refactoring and reorganization for faster builds
- New tests and benchmarks
- PyTorch 2 support for various operators
- Clang compilation support
Software Requirements
FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:
- PyTorch: v2.3
- CUDA: v11.8, 12.1
- Python: v3.8, 3.9, 3.10, 3.11, 3.12
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.7.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.7.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
- [New] Added BF16 output support in TBE training (#2382)
- [New] Added Support int8 output for sequence embeddings (#2316)
- [New] Added an auto-vectorization implementation for CPU TBE-NBit kernel with user selection (#2182, #2299)
- [New] Added CowClip optimizer (#2226, #2243)
- [Improvement] Extended support and bug fixes for variable batch size TBE (#2256, #2388, #2394, #2333)
- [Improvement] Optimized cache fetch for forward split (#2216, #2282, #2289, #2262, #2218)
- [Improvement] Caching and cache lookup for pipeline prefetching fixes and enhancements (#2164, #2309, #2287, #2308)
- [Improvement] Built hip rules by default (#2380)
- [New] Added a method to TBE module to recompute buffers (#2338)
- [New] Added meta functions for PyTorch 2 support (#2347)
- [New] Added support for MTIA in TBE modules (#2273, #2286)
- [Improvement] Improved TBE logging and stats report (#2379, #2378, #2377, #2386, #2337)
- [Improvement] General fixes and enhancements (#2235, #2398, #2212, #2269, #1782, #2270, #2265, #2385, #2370, #2349, #2312, #2411, #2400)
- [Deprecation] Optimizers deprecated (#2253, #2252)
- [Deprecation] Removed double type support from fbgemm_cuda_utils.cuh (#2335)
- [Deprecation] Removed INT8 weight/output support from TBE GPU training
Jagged Tensor Operators
- [Improvement] Removed device-host synchronization from keyed jagged index select (#2315)
- [Improvement] Fixed half->int build error (#2240)
Index Select Operators
- [Improvement] Fixed BF16 group_index_select_2d on AMD GPU (#2321)
Low-precision operators
- [New] CPU implementation of per-channel quantize operator (#2341)
- [New] CPU implementation for qlinear_channelwise operator (#2343)
- [New] Enabled CPU int8 output to dequantization to bf16 on CUDA (#2242)
- [New] Enabled dequantization for bf16 (#2241)
Pooled Embedding
- [Improvement] Used gpu_library_selector for permute_pooled_embedding_ops_gpu (#2340)
Misc
- [New] Implementation of CPU version of all_to_one_device (#2251)
- [Improvement] Performance improvement of _block_bucketize_sparse_features_cuda_kernel1 (#2331)
- [New] Created cumem_utils_cpu and added to all_deps_cpu (#2215)
- [New] Added float support to
asynchronous_complete_cumsum_cpu
(#2383) - [Improvement] Added early exit to sparse ops (#2277, #2276, #2213, #2259)
- [New] STBE GPU coalescing kernel (#2275)
- [Improvement] Removed symint from tbe_input_combine_with_length_abstract (#2336)
- [New] GPU timing and basic reporting framework (#2314)
- [Improvement] Fixes and FBGEMM PT2 compliance (#2223, #2224, #2225, #2231, #2327)
Benchmarks / Tests
- [New] Added dynamic quantize GEMM benchmark (#2297, #2295, #2271)
- [New] Added a new CPU nbit-TBE benchmark that tries to reduce CPU frequency noise (#2306)
- [New] Added unit test for stochastic rounding for UVM caching (#2324)
- [New] Added unit test AsyncSeriesTimer (#2364)
- [New] Added int32 overflow unit test for TBE UVM caching (#2303)
- [Improvement] Disabled dynamo testing in TBE (#2381)
- [Improvement] Refactored and re-organized tests (#2305, #2292, #2291, #2284, #2281, #2274, #2272, #2266, #2263, #2260, #2407, #2406, #2402, #2304, #2399, #2393)
- [Improvement] General fixes for tests and benchmarks (#2301, #2300, #2298, #2255, #2205, #2296)
Build / CI improvements and Fixes
- [Improvement] Optimized EmbeddingSpMDMNBit_autovec (#2267)
- [Improvement] Switched between hip and cuda c++ lib so load (#2236)
- [Improvement] Fixred bf16 support issues (#2238)
- [New] Enabled Clang compilation in OSS for fbgemm_gpu (CPU and CUDA) (#2334, #2345, #2330, #2323)
- [New] Upgraded ROCm version (#2405)
- [Improvement] Enabled
-Winfinite-recursion
in deeplearning/PACKAGE (#2329) - [Improvement] Fixed shadowed variable in deeplearning/fbgemm/src/GroupwiseConv.cc (#2268)
- [Improvement] General CI and build system enhancement (#2489, #2430, #2427, #2423, #2356, #2348, #2342, #2328, #2307, #2211, #2219, #2220, #2228, #2233)
- [Improvement] Documentation enhancement (#2294, #2278, #2258, #2249, #2227, #2232, #2244, #2239, #2237)
FBGEMM_GPU v0.6.0
Release Note
Highlights
- Improvement and bug fixes for TBE variable batch size
- Many TBE extensions and benchmarks
- Enhanced TBE pipeline prefetching for UVM caching
- Code refactoring and reorganization for faster builds
- Many improvements and new sparse ops added
- Improved low precision ops
- Support for Python 3.12
- PyTorch 2 support for various operators
Software Requirements
FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:
- PyTorch: v2.2
- CUDA: v11.8, 12.1
- Python: v3.8, 3.9, 3.10, 3.11, 3.12
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.6.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.6.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
- [Improvement] Extended support and bug fixes for variable batch size (#2012, #2043, #2107, #2150, #2188)
- [Improvement] caching and cache lookup for pipeline prefetching (#2147, #2154, #2151)
- [New] Support MTIA device type in FBGEMM TBE training (#1994)
- [New] Enable sequence TBE CPU via AVX (#2195)
- [New] Enable subwarp only for unweighted (#2051)
- [New] Add meta functions (#2094, #2102)
- [New] Add reverse qparam option for MTIA (#2109)
- [New] uvm_cache_stats for direct mapped (#1951, #1952)
- [Improvement] use memcpy for cpu emb inplace update (#2166)
- [Improvement] Remove indices and offsets copying from prefetch (#2186)
- [Improvement] Improve perf for L=0 cases for TBE v2 (#2046)
- [Improvement] General fixes and enhancements (#2030, #2009)
Jagged Tensor Operators
- [Improvement] Fix incorrect SymInt signature on dense_to_jagged (#2039)
- [Improvement] Fix non-contiguous tensor problem in jagged_index_select (#2060, #2061)
Index Select Operators
- [Improvement] Get total D from CPU buffer in batch_index_select_dim0 (#2079)
Low-precision operators
- [New] Add BF16 in padded FP8 quantize ops (#2010)
- [Improvement] Improve quantize_comm error message (#2018)
- [Improvement] Fix illegal memory access error and initialize empty values on fp8 quantize kernel (#2131, #2176)
Pooled Embedding
- [New] Add permute_duplicate_pooled_embeddings op for CPU (#1939)
- [Improvement] Use PyTorch's p2p access enable function (#2000)
- [New] Add support for duplicate in permutations for permute_pooled_embs_split (#1940)
- [Improvement] Improve all_to_one error message (#2019)
- [New] Add meta function for fbgemm::merge_pooled_embeddings operator (#2069)
- [New] Add variable batch per feature support to EBC (tw/cw only) (#1986)
Misc
- [New] Add meta backend for new_managed_tensor and sparse ops (#1990, #2028, #2029, #2072)
- [New] Use 4k page instead of 2M for managed tensor (#2058)
- [New] Add BF16 support for reorder_batched_ad_indices (#2116)
- [New] SymInts for sparse ops (#2017, #2089)
- [New] Support for CPU/GPU compilation (#2040)
- [New] Add impl_abstract (#2084, #2087, #2090, #2097, #2098, #2129, #2132, )
- [Improvement] Make FBGEMM PT2 compliant (#2174, #2172, #2170, #2180, #2181, #2201, #2198)
- [Improvement] Fix invalid CUDA configuration error for the empty input (#1993)
Benchmarks / Tests
- [New] Benchmark block_bucketize_sparse_features uneven sharding (#2140, #2169)
- [New] Add unit test for unique cache lookup (#2160)
- [New] Add autogenerated opcheck tests (#2050, #2069, #2073, #2092, #2118, #2139, #2152, #2173, #2193)
- [New] Add test for fbgemm ops. (#2136, #2082)
- [Improvement] Modified TBE testbench to use FBGEMM generate_rquests function to generate indices and offsets (#1882)
- [Improvement] Remove FP64 from TBE CPU tests (#2049)
- [Improvement] Add warmup_runs to TBE benchmarks and run at least 1 warmup iter #2163
- [Improvement] Add --pooling in TBE nbit_cpu benchmark (#2200)
- [Improvement] Fill embedding tables with randomized scales and bias in split-TBE benchmarks (#2031)
Build / CI improvements and Fixes
- [Improvement] General CI and build system enhancement
(#2065, #2071, #2078, #2149, #2189, #2203, #2204, #2209, #2047) - [Improvement] Reorganized code to enable faster builds (#1881, #2083, #2085, #2095, #2141, #2112, #2133, #2145, #2196, #2100, #2103)
- [New] Add support for Python 3.12 (#2194)
- [New] Updates for ROCm 5.6, 5.7 and 6.0 support and Hip.cmake changes (#2066, #2088, #2106)
- [New] Add debug flags for HIP runs (#2206)
- [Improvement] unknown c++ flag detection in CMake (#2057)
- [Improvement] Fix inconsistent dll linkage warning (#2059, #2064)
- [Improvement] Fix heap-buffer-overflow in radix_sort_parallel (#2075)
- [Improvement] Update AVX2 and AVX512 flags (#2167)
FBGEMM_GPU v0.5.0
Release Notes
Highlights
- TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement)
- Many TBE extensions including defused TBE backward-optimizer, variable batch size support, pipeline prefetching support for UVM caching
- Many improvements and new sparse ops added
- ARM support
- SM 9.0 support for CUDA 12.1 for H100 GPUs
- PyTorch 2 support for various operators, i.e., jagged tensor, pooled embedding ops
Software Requirements
FBGEMM_GPU v0.5.0 has been tested and known to work on the following setups:
- PyTorch: v2.1
- CUDA: v11.8, 12.1
- Python: v3.8, 3.9, 3.10, 3.11
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.5.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.5.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
- [Improvement] TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement) (#1641, #1804, #1787, #1904)
- [New] Variable batch size support to TBE training (#1653, #1752, #1633, #1634, #1713, #1717, #1943)
- [New] BFloat16 support for TBE CPU (#1839, #1851)
- [New] Defused TBE backward-optimizer and SplitTBE optimizer (#1819, #1820, #1821)
- [New] Max norm support for rowwise_adagrad (#1781)
- [New] Support for 1024-2048 embedding dimension in TBE inference (#1656)
- [Improvement] Backends via PyTorch dispatcher (#1948, #1976)
- [Improvement] Deprecate many TBE optimizers (#1766, #1767, #1771, #1796, #1774, #1773, #1775, #1791, #1793)
- [New] TBE UVM cache pipeline prefetching (#1883, #1893)
Jagged Tensor Operators
- [New] New jagged tensor operators (#1690)
- [New] Backends (Meta) (#1880, #1960)
- [Improvement] Jagged operator optimizations (#1643, #1646, #1644, #1661, #1662, #1691, #1692, #1777)
- [Improvement] Symbolic shape tracing on jagged operators for PyTorch 2 (#1758)
Index Select Operators
- [New] batch_index_select_dim0 with TBE backend (#1897)
- [New] Variable input sizes support for
group_index_select_dim0
(#1968) - [Improvement] Improve
group_index_select
(#1764, #1884)
Low-precision operators
- [New] Meta Backend FP8RowwiseQuantizedToFloat (#1890)
- [New] Column-wise parallel quantization/dequantization (#1743)
- [New] BF16 Support in FP8 quantize ops (#1961)
- [Improvement] FP8 row-wise quantization optimization/improvement (#1729, #1858, #1981, #1909)
Pooled Embedding
- [New] reduce_to_one (#1571)
- [New] permute_duplicate_pooled_embeddings op (#1912)
- [New] BF16 support for permute_pooled_embeddings op 1937
- [New] Variable size input-output support for
permute_pooled_embs_kernel
(#1913) - [New] Backends (Meta) (#1853)
- [Improvement] multi-gpu
all_to_one
enhancements (#1674, #1962)
Misc
- [New] CUB kernel for 2D
asynchronous_complete_cumsum
(#1707) - [New] Backends (Meta) (#1709, #1905, #1970, #1971)
- [New] BF16 support in
permute_indices_weights_kernel_2
(#1852) - [New] FP16 and BF16 support in
pack_segments
(#1708) - [New] BF16 support for HBC ops. (#1744)
- [New] BFloat16 support (#1832, #1865)
- [Improvement] Speedup
reorder_batched_ad_indices
(#1901, #1902, #1932, #1933, 1711)
Benchmarks / Tests
- [New] CLI support to GEMMsBenchmark (#1721, #1725)
- [New] Benchmark for variable batch on TBE (#1559)
- [New] BF16 output test coverage (#1835, #1838)
- [New] Benchmark for reorder_batched_ad_indices (#1895)
- [New] CPU support (#1874, #1926)
- [Improvement] GroupIndexSelect Benchmark with zero_grad (#1559)
- [Improvement] Add
nbit-cpu-with-spec
benchmark in FBGEMM-GPU's TBE benchmark suite (#1892)
Build / CI improvements and Fixes
- [New] C++17 Support to FBGEMM and FBGEMM_GPU OSS builds (#1652)
- [New] ARM Support in OSS CI (#1813)
- [New] SM 9.0 Support for CUDA 12.1 (#1825, #2002)
- [Improvement] General CI and build system enhancement (#1658, #1695, #1697, #1702, #1719, #1751, #1784, #1795, #1836, #1958, #2020, #2024)
- [Improvement] Reorganized code to enable faster builds (#1843, #1849, #1856, #1860, #1863, #1864, #1866, #1886, #1694, #1705, #1710, #1723, #1757, #1783, #1871, #1873, #1879, #1944, #1816, #1753)
FBGEMM_GPU v0.4.1
Release Notes
Software Requirements
FBGEMM_GPU v0.4.1 has been tested and known to work on the following setups:
- PyTorch: v2.0
- CUDA: v11.7, 11.8
- Python: v3.8, 3.9, 3.10, 3.11
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU may be fetched directly from PyPI:
# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.1
# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.1
Changes
This is a minor release whose main purpose is to deliver Python 3.11 support.
FBGEMM_GPU v0.4.0
Release Notes
Software Requirements
FBGEMM_GPU v0.4.0 has been tested and known to work on the following setups:
- PyTorch: v2.0
- CUDA: v11.7, 11.8
- Python: v3.8, 3.9, 3.10 (3.11 not supported yet)
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU may be fetched directly from PyPI:
# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.0
# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.0
Changes
Table batched embedding (TBE) operators
- [New] SSD for inference TBE (#1473, #1479, #1485, #1517, #1533, #1535)
- [New] Inplace TBE update (#1480, #1482, #1492, #1529)
- [New] BF16 support for inference TBE (#1498, #1503)
- [New] BF16 support for TBE on CPU (#1540, #1583)
- [Improvement] Training TBE backward performance improvement (#1563)
UVM cache improvement
- [New] Delta in-place update (#1436)
- [New] UVM caching stats report (#1623, #1462, #1433, #1623, #1570)
- [Improvement]
[lfu|lru]_cache_insert_byte_kernel
vectorization (#1475)
Jagged Tensor Operators
- [New] Backends (Meta and Autograd) (#1461, #1466, #1467, #1469, #1468, #1477, #1556)
- [New] BF16 support (#1472, #1560)
- [New] FP32 + BF16 hybrid support for
jagged_dense_dense_elementwise_add_jagged
(#1487) - [New] Jagged tensors with no inner dense dimension support (#1267)
- [New] New jagged tensor operators (#1557, #1577, #1578, #1579, #1594, #1595)
Index Select Operators
- [New]
group_index_select
(#1421, #1592) - [New]
index_select
for selecting KeyJaggedTensor dim 1 (previously support only dim 0) (#1429) - [New]
jagged_index_select
for CPU (#1586)
Low-precision operators
- [New] FP8 rowwise quantized communication (#1423)
Misc
- Support 2D inputs for
asynchronous_complete_cumsum
(#1573)
Benchmarks / Tests
- [New]
nbit_device_with_spec
for table batched embedding inference benchmark (#1455, #1465) - [New] Variable bag sizes for TBE benchmark (#1450)
- [Improvement] Parallel
bottom_unique_k_per_row
for faster Zipf data generation (for FBGEMM benchmarks) (#1447)
Build / CI improvements and Fixes
v0.3.2
Minor release