27 Apr 08:31

q10

625f9ce

FBGEMM v1.2.0 Release Notes Latest

Latest

Highlights

TBE GPU

Added support for int64_t table indices and offsets in TBE inference
Improved TBE benchmark utilities with the introduction of the Embeddings Estimator and Generator (EEG)

TBE CPU

Added Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf operator
Make FloatToFloat16 conversion 75x faster using SVE2 instructions
Added FP32 GEMM kernels

TBE SSD

Fix OOM issues during init
Improvements to L1 and L2 flush

Gen AI Ops

GenAI ops are now separately packaged into FBGEMM GenAI package for easier build and installation
Various FP8 grouped GEMM optimizations
BF16I4 preshuffled grouped GEMM
BF16 stacked grouped GEMM
F8I4 grouped GEMM optimizations
Added nccl_alltoall function

ROCm

Added preliminary ROCm OSS build support for GenAI ops

Better Engineering

Added build support for CUDA 12.8
Introduced a set of utilities to harden CUDA kernel launches against a set of runtime errors

Software Requirements

FBGEMM_GPU v1.2.0 has been tested and known to work on the following setups:

PyTorch: v2.7
CUDA: v11.8, 12.6, 12.8
Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU (instructions here) and FBGEMM-GenAI (instructions here).

Availability

FBGEMM_GPU and FBGEMM GenAI can be fetched directly from PyPI:

# FBGEMM_GPU - CUDA (only the CUDA 12.6 variant is available)
pip install fbgemm-gpu==1.2.0

# FBGEMM_GPU - CPU
pip install fbgemm-gpu-cpu==1.2.0

# FBGEMM GenAI
pip install fbgemm-gpu-genai==1.2.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU - CUDA
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu128/

# FBGEMM_GPU - CPU
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu

# FBGEMM GenAI 
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu

Changes

CPU

GEMM

[Improvement] Improve Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon by 5%-15% (#3860)
[New] Use enum to select floating point format in FbgemmEmbedding APIs (#3842)
[New] Add generic IEEE754 truncation code (#3820)
[New] Enable KleidiAI for FP32 (#3818)
[Improvement] Move float conversion functions from Types.h into new FloatConversion.h (#3760)
[Fix] Use kleidiAI on static builds (#3806)
[Fix] Fix KleidiAI FP16 (#3769)
[Improvement] Pull ARM's matrix transpose PR (#3660)
[New] Add NEON implementation of Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf (#3707)
[Improvement] avoid extra copy in PackedGemmMatrixB constructor (#3691)
[Improvement] Remove FENV pragma (#3629)
[Improvement] Make FloatToFloat16 conversion 75x faster using SVE2 instructions (#3626)
[New] add a new constructor to PackedGemmMatrixB (#3598)
[New] Move FP32 kernels to OSS (#3568)

GenAI

GenAI Ops

[Improvement] Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (#3790) (#3942)
[New] Add harness for comms benchmark (#3936)
[Improvement] Refactoring of NoPE (#3840)
[Improvement] support fp16 dtypes for input weight and bias (#3931)
[Fix] fix fp8 kv cache dequantize kernels (#3896)
[Fix] fix fp8 kv cache dequantize kernels (#3896)
[Improvement] scatter_add 0 size support (#3861)
[Improvement] Retuned CK GMM fp8/bf16 with perf fixes (#3851)
[Improvement] Enable groupwise scales for F8I4 Grouped Gemm (#3884)
[Fix] Fix empty input view. (#3880)
[New] FP8 Rowwise Dequant Kernel (#3873)
[New] torch.ops.fbgemm.gather_scale_dense_tokens for oss. (#3855)
[Improvement] Replace rms_norm as norm (#3841)
[Improvement] Move DeepGemm scale transpose to quantize (#3834)
[Improvement] follow up to reflect rowwise scale inputs for x, w in quantize_ops scripts (#3839)
[New] add rowwise scaling support (#3822)
[Improvement] update to tune for small ms and quantized gemv (#3712)
[New] Add Preshuffled FP8 x INT4 Grouped Gemm Kernel (#3800)
[New] FBGEMM Add Columnwise Weight Scaling to F8I4 GEMM (#3766)
[Improvement] update the sorting kernel for bf16 ck fmoe kernel (#3817)
[Fix] fix volatile synchronization with acquire/relax (#3728)
[Improvement] Force determinism by unswizzle (#3727)
[New] add fp8 kv nope (#3786)
[Improvement] move common op to vector utils (#3759)
[Improvement] Gather/Scatter. (#3743)
[Improvement] reduce scatter supports last dim (#3726)
[Improvement] Add custom reduce scatter to llama_comms (#3730)
[New] Adds shapes information to enable torch.compile. (#3724)
[Improvement] avoid propagation of NaN (#3723)
[New] torch.ops.fbgemm.scatter_add_along_first_dim.. (#3720)
[New] torch.ops.fbgemm.gather_along_first_dim. (#3719)
[New] Paged Attention Support (#3698)
[New] custom reduce scatter (#3686)
[Fix] Recover custom collective test (#3687)
[Improvement] update sweep_utils.py to test more precision gemv kernel (#3678)
[New] add fp8fp8 fast_gemv_quantized (#3677)
[New] add mixed precision fp8 fast_gemv_quantized kernel (#3675)
[Improvement] adjust interface (#3669)
[Improvement] CK MoE: cherry-pick #1808 (#3609)
[Improvement] fix llm shapes in quantize bench and add ldm shapes (#3611)
[Improvement] Return if no data to allreduce (#3586)
[Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
[Improvement] Make zero_start_index_M optional for dynamic BF16 Grouped Gemm (#3553)
[New] Add nccl_alltoall function (#3551)
[New] Add fused_moe kernel to ck_extension (#3518)

GEMM

[Improvement] Update cutlass verison to 3.8V2 (#3772)
[Improvement] Update Cutlass to V3.8-2 (#3767)
[Improvement] fp8_gemm (non_persistent): adding optimal configs for 8k & 16k shapes (#3764)
[New] new tuning for fp8 rowwise (#3756)
[Improvement] Add DeepGEMM blockwise GEMM in quantize bench (#3746)
[Improvement] Enable DeepGEMM in quantize bench (#3745)
[Improvement] reduce overhead for f8f8bf16_rowwise_grouped_dynamic on amd (#3742)
[Improvement] Performance Optimization: Optimized TileShape Configuration for f8 (#3617) (#3735)
[Improvement] Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) (#3710)
[Improvement] adding an option to skip zeroing output tensor for f8f8bf16_rowwise_grouped_dynamic (#3685)
[Improvement] Update CK (#3701)
[Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu +10 (#3844)
[New] Make F8I4 grouped GEMM process M_sizes with INT32 (#3853)
[Improvement] Skip empty groups in FP8 Stacked Gemm (#3862)
[New] Enable preshuffled mixed dtype Cutlass Gemm (#3722)
[Improvement] [CUTLASS] Minor Cutlass change to fix CI (#3779)
[Improvement] Clean up cutlass FP8 Grouped Gemm Kernel Setup (#3864)
[New] Modernize bf16 cutlass grouped gemm (#3889)
[Improvement] [CUTLASS] Include new cutlass support for groupwise mixed dtype grouped gemm. (#3885)
[New] Add DEEPGEMM Masked API. (#3949)
[Improvement] Use Int64 Indexing in Grouped Gemm (#3930)
[Improvement] Add correctness testing for shuffled mixed dtype GEMMs. (#3932)
[New] BF16I4 Preshuffled Grouped Gemm (#3917)
[New] Preshuffled BF16I4 Gemm Kernel (#3913)
[New] Enable rowwise scaling for DeepGemm (#3874)
[New] bf16 stacked group gemm (#3888)
[New] F8I4 Grouped Gemm Optimization for Sparse M (#3854)

FP8

[Fix] FBGEMM fp8 ck GEMM fix for irregular GEMM shapes (#3894)
[Fix] fix stacked version fp8 rowwise group gemm registration in quantize_bench (#3902)
[Fix] A hotfix for FBGEMM fp8 rowwise with irregular gemm sizes (#3883)
[Improvement] Transpose FP8 GEMM inputs for better tuning (#3866)
[New] Enable FP8 Triton dequantized block-wise kernel (#3788)
[Improvement] Refactor stacked version of FP8 Grouped Gemm for reduced overhead (#3699)
[Improvement] changing config for fp8 gemm (#3668)
[Improvement] Add option to disable fast_accumulation for fp8 gemm. (#3714)
[New] Add cublas FP8 tensorwise GEMM in fbgemm quantize bench (#3693)
[Improvement] write_k_back for fp8 ROPE (#3679)
[Improvement] Moves utility functions into a standalone file. (#3671)
[Fix] Fix f8f8bf16_lite quantize op input in quantize_and_compute (#3667)
[Improvement] Optimize zero fill (#3666)
[Improvement] FP8 Grouped Gemm Optimization (#3655)
[New] Add sweep_utils.py script to tune heuristics (#3656)
[Improvement] loose unit test atol rtol tolerance to eliminate ut flakiness (#3664)
[New] Port oss f16_fast_gemv into fbcode (#3610)
[New] fp8 rowwise regular gemm tuning for llm new shapes (#3654)
[Improvement] k_norm in rope for fp8 kv cache (#3633)
[Improvement] Fix zero_start_index_M argument for triton rowwise quantize (#3639)
[Fix] Fix handling of dynamic FP8 grouped gemm on Nvidia (#3616)
[Improvement] Improve FP8 grouped GEMM perf via tileshape and cooperative (#3653)
[Improvement] Refactor FP8 grouped GEMM with dynamic and static versions (#3561)
[New] Support FP8 grouped GEMM with rowwise scailing (#3560)
[Fix] [CUTLASS] Use custom copy of cutlass to enable FP8 Grouped Gemm. (#3649)
[Fix] kv_dq zero initialization to avoid NaNs from FA3 (#3632)
[Improvement] amd fp8 rowwise batched gemm tuning (#3624)
[Improvement] Improve handling for FP8 grouped gemm without zero_start_index_M (#3615)
[New] amd fp8 rowwise gemm prefill shape tuning (#36...

Assets 2

29 Jan 20:54

q10

v1.1.0

7243043

FBGEMM_GPU v1.1.0 Release Notes

Highlights

TBE GPU

Introducing support for int32_t indices in TBE training
Extended TBE support for larger embedding dimensions
Made the learning rate a tensor value
Improvements on indices bounds checking

TBE CPU

Improved ARM support with SVE implementations for matrix multiplication and float matrix transpose
Improved the EmbeddingSpMDMAutovec API
Migrated FP32 ops to OSS

TBE SSD

Enabled VBE in SSD-TBE
Async initialization of RockDB SSD tensors and pad before writing to rocksDB
Improvements on indices bounds and other constraints checking

Gen AI Ops

Custom allgather support multiple dtypes, with dtype checking to prevent silent failures

ROCm

Add CK FP8 Batched GEMM and Rowwise GEMM kernels along with heuristic tuning
Fixed CK FP8 rowwise quantization for some GEMM shapes
Introduced HIP-specific optimizations to the TBE forward and backward passes

SLL ops

Migrated Sequence Learning Library (SLL) ops to OSS

Better Engineering

Restructured the build to produce multipiple smaller shared libraries instead of a single large binary
New and improved tests and benchmarks
Improved ROCm build variant support
Add build support for CUDA 12.6 and Python 3.13

Software Requirements

FBGEMM_GPU v1.1.0 has been tested and known to work on the following setups:

PyTorch: v2.6
CUDA: v11.8, 12.4, 12.6
Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.1.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.1.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu124/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu126/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) operators

For GPU

[New] Add support for int32_t indices in TBE training (#3377, #3375, #3374, #3372, #3371, #3324, #3267, #3264, #3263 #3257)
[New] Add support for int64_t indices and offsets in TBE inference (#3254, #3233)
[New] Extend TBE support for larger embedding dimensions (#3462, #3467)
[New] Make learning rate tensor (Backend) (#3287, #3310, #3332)
[New] Add PTA checks to embedding_bounds_check kernels" (#3318)
[Fix] Fix PackedTensorAccessor for batch_index_select (#3281)
[Fix] Set cache_precision = weights_precision in TBE if it is not explicitly set (#3370)
[Fix] Fix pt2_wrapper registration for unified TBE interface (#3238)
[Fix] Fix PT2 compliant opcheck tests (#3404)
[Fix] Fix FBGEMM_GPU_MEMCHECK in Split optimizers (#3416)
[Fix] Fix learning rate as tensor for PT2 compile (#3407)
[New] Add new optimizer state row_counter for Adam [Frontend] (#3558)
[New] Add new optimizer state row_counter for Adam [Backend] (#3342)
[Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [7C/N]"" (#3258)
[Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [8/N]"" (#3255)
[Fix] Fix global weight decay Faketensor test (#3341)
[Fix] Fix pt2_wrapper registration for unified TBE interface (#3237)
[Fix] Fix ""Cannot call numel() on tensor with symbolic sizes/strides"" (#3368)
[Fix] Fix grid size overflow in generate_vbe_metadata (#3484)
[Fix] Fix an integer overflow in permute_multi_embedding() (#3465)
[Fix] Fix the sync point caused by iter_cpu.item() (#3401)
[Fix] Fix global weight decay Faketensor test (#3341)
[Fix] Hot fix to skip VBE CPU reshaping for MTIA (#3466)
[Fix] address mem over used during flushing (#3460)
[Improvement] Add iter singular value into TBE optimizer state (#3228)
[Improvement] V2 fwd modified warps (#3570)
[Improvement] Add enable_async_update into tbe signature and config (#3431, #3461)"
[Improvement] Adjust kNumThreads for bounds_check_indices_kernel (#3299)
[Improvement] Reduce registers in bounds_check_indices" (#3298)
[Improvement] Mark unified autograd function traceable (#3378)
[Improvement] Improve bounds_check_indices for VBE (#3388, #3386)
[Improvement] Do not call scalar_type (#3394)
[Improvement] optimizer 1d -- EMA in place (fbgemm part) (#3402)
[Improvement] Clean up nbit_forward tests (#3286)
[Improvement] Remove unused-variable in some generated code (#3327)
[Improvement] Limit grid size of bounds_check_indices" (#3282)
[Improvement] Support config based bound check version via extended modes (#3418)
[Improvement] Use int64_t index for SplitOptimizer grad (#3447)
[Improvement] Remove unused arg from generate_vbe_metadata frontend (#3453)
[Improvement] Add generate_vbe_metadata test (#3483)
[Improvement] Support config based bound check version via extended modes (#3454)
[Improvement] make iter PT2 compatible (#3253)
[Improvement] Add meta function for PT2 wrappers (#3240)
[Improvement] Nesterov (#3232)

For CPU

[New] Introduce SVE function for matrix multiplication (#3348)
[New] Add sve implementation for float matrix transpose (#3421)
[New] autovec specialization framework (#3393)
[New] Move FP32 kernels to OSS (#3568)
[Improvement] Pull in PR for Kleidi-based FP16 kernel (#3507)
[Improvement] Use local buffer where possible (#3304)
[Improvement] Refactor GenerateEmbeddingXXX functions (#3307)
[Improvement] Increase local_storage size to 512 floats (#3357)
[Improvement] Adjust EmbeddingSpMDMAutovec API (#3366)
[Improvement] Split loops to work around loop vectorizer weakness (#3406)
[Improvement] Do an early check that data_size is not negative (#3305)
[Improvement] Fix strict aliasing violation, code cleanup (#3306)

SSD TBE Operators

[New] Enable VBE in SSD-TBE (#3247)
[Improvement] put KVTensorWrapper in its own header (#3575)
[Improvement] Moving KVTensorWrapper to a header file to be used in ModelStore checkpointing code (#3276)
[Improvement] Async initialization of RockDB SSD tensors (#3520)
[Improvement] pad before writing to rocksDB (#3245)
[Improvement] use RocksDB iterator to read key range from ssd embedding (#3495)
[Improvement] Log total duration spent prefetching (#3487)
[Improvement] address mem over used during flushing (#3460)
[Improvement] Create move TBE to right device, and set Cache Load in TBE class (#3438)
[Improvement] Unit test for new move tbe from device/cache_load method (#3437)
[Improvement] make L2/rocksdb update async optional (#3429)
[Improvement] Drop RoPE when filling KV cache (#3346)
[Improvement] Remove setting total_cache_hash_size as buffer (#3441)
[Improvement] Add meta registrations for kv_cache operators (#3442)
[Improvement] remove output dtype restriction in SSD TBE (#3524)
[Improvement] change pmt require grad to false when detached (#3525)
[Improvement] add more attributes to PartiallyMaterializedTensor (#3300)
[Improvement] skip broken inference test that uses ssd TBE (#3494)
[Improvement] "coro => fut" (#3430)
[Improvement] Reland of D65489998 Optimize sharding performance of embeddings (#3549)
[Improvement] Remove torch.jit.script (#3562)

GenAI Support and Operators

[New] Add nccl_alltoall function (#3551)
[New] custom allgather support multiple dtypes (#3498)
[Improvement] Make sure fake tensor functions return on proper device (#3258)
[Improvement] Add CPU registrations to custom operators (#3262)
[Improvement] Check src & dst dtypes in allgather to prevent silent failures (#3523)
[Improvement] Better shape function registration (#3237, #3340)
[Improvement] Package re-organization improvements (#3546, #3251, #3419, #3268, #3512)

FP8 and other Quantization support

[New] New autotune config for M=4 (#3277)
[New] MoE FP8 grouped GEMM (#3321)
[New] Add shape check on GroupedGEMM kernel (#3449)
[New] Tuning for fp8 gemm with emu1.7 shapes (#3436)
[Improvement] more fp8 tuning for decode and not need to pad (#3576)
[Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
[Improvement] Split FP8 Grouped Gemm into dynamic and static version (#3543)
[Improvement] Warp-specialized FP8 rowsise GEMM kernel (#3532)
[Improvement] Add Cutlass FP8 Grouped Gemm to Quantize Bench (#3530)
[Improvement] Fixed FBGEMM fp8 rowwise for irregular shapes (#3491)
[Improvement] Properly define preallocated output as mutable in fp8 rowwise gemm (#3476)
[Improvement] Fix FP8 Rowwise Gemm Compilation with Auto-functionalize V2 (#3457)
[Improvement] Support zero-size inputs in FP8 cuda quantize kernel (#3448)
[Improvement] update FP8 GEMM tuning for emu1.7 7B shapes (#3391)
[Improvement] Customize FP8 grouped GEMM for non-zero calculation for token choice MoE (#3383)
[Improvement] Support FP8 grouped GEMM with cudagraph (#3373)
[Improvement] Refactor FP8 grouped GEMM to prepare cudagraph support (#3369)
[Improvement] Improve FP8 BMM heuristic for large shapes and MoE E2E performance (#3344)
[Improvement] retune some of the EMU1.6 7B FP8 GEMM shapes (#3328)
[Improvement] Make FP8 BMM output contiguous (#3270)
[Improvement] Tune FP8 rowwise bmm tile hueristic (#3256)
[Improvement] more FP8 GEMM tuning for LDM shapes (#3414)
[Improvement] Split up f8f8bf16_rowwise_batched.cu (#3381)
[Improvement] use sym int in quantize.cpp for f8f8bf16_rowwise_meta (#3410)
[Improvement] Remove triton.ops dependency from fbgemm (#3329)...

Assets 2

19 Oct 20:57

spcyppt

v1.0.0

73a64e7

FBGEMM_GPU v1.0.0 Release Notes

Stable API

We provide the stable API support starting from FBGEMM_GPU v1.0.0. This includes Table batched embedding (TBE) modules, Pooled embedding operators and modules, Sparse operators, Jagged tensor operators and Quantization operators.

API backward compatibility guarantees via thorough testing. We guarantee that our stable APIs will be backward compatible within a major version, meaning that the stable APIs for v1.0.0 will be compatible with every future release unless explicitly announced in advance
*Enhanced documentation, ensuring that every stable API has comprehensive and up-to-date documentation.
Functionality guarantees are only provided through unit testing framework. We do NOT guarantee any functionalities that are NOT explicitly tested and documented in our unit tests.
No performance guarantees. However, we are committed to providing support on a best-effort basis.

More details can be found in stable API documentation

Highlights

Table Batched Embedding (TBE)

New optimizer support for TBE Training
Enhanced Global weight decay support in TBE
Improvement and bug fixes for TBE training and inference modules and sparse operators

For SSD

New pipeline prefetching enabled
New cache and indices related ops
Integration of L3 cache to TBE operators
Many improvements to kernel and logging

For CPU

New type support for CPU Sequence TBE
Kernel improvements and bug fixes

Generative AI

Gen AI Ops support and improvement
Improvements to Triton-based and CUTLASS-based operators
New and optimized FP8 GEMM and quantization operators

Others

Optimized MX4 quantization operators
New dequantization operator
Removal of python 3.8 Support

Better engineering

Code refactoring and reorganization for faster builds
New and improved tests and benchmarks
Improved AMD support

Software Requirements

FBGEMM_GPU v1.0.0 has been tested and known to work on the following setups:

PyTorch: v2.5
CUDA: v11.8, 12.1, 12.4
Python: v3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.0.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.0.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

For GPU

[New] Ensemble adagrad optimizer (#3197, #2955, #2954, #3161, #3091, #2981, #2889, #3180, #3158)
[New] Bounds check in prefetch in TBE training (#3015)
[New] Method to update internal hyperparameters for FBGEMM TBE (#3025)
[Improvement] Enhanced Global Weight Decay and state tracking (#2904, #2897, #2882, #2896, #2890, #2884, #2883 )
[Improvement] masked_index_* values index type fix (#2979)
[Improvement] generate_vbe_metadata fixes (#3095, #3087)
[Improvement] Fixes on the efficiency of VBE TBE forward due to blocking D2H copy (#2862)
[Improvement] Work around on offsets and indices type mismatch int TBE training (#3037)
[Improvement] Add a host map option for a UVM tensor alloc (#3073)
[Improvement] uvm_to_device expose device as interface (#3030)
[Improvement] Add Meta backend/dispatcher for new_unified_tensor (#3005)
[Improvement] General TBE enhancements and bug fixes (#2892, #3114, #3022, #2958)
[Improvement] Consolidate repeat code in TBE inference (#3028)

For CPU

[New] Add int4 to int4 CPU Sequence TBE kernel (#2996, #2994)
[New] Use auto-vec kernel in CPU sequential embedding lookup for int8 tables (#2863, #2878)
[Improvement] Work around OMP barrier issue with MSVCand unused var error (#2918, #3084)

SSD Table batched embedding (TBE) operators

[New] Enable pipeline prefetching (#2963)
[New] Enable cache line locking support in SSD kernel (#2949)
[New] Add L2 flush (#3110)
[New] Added SSD ODS and IO/mem stats (#2906, #2913, #3035)
[New] Add SSDScratchPadIndicesQueue (#2911, #2948)
[New] Integrate l2 cache to TBE operator (#2959, #3032, #3031 )
[New] Add ssd_update_row_addrs (#2953)
[New] Add bounds check in SSD-TBE (#3013)
[New] Add 32-bit index support in SSD kernels (#3064)
[New] Add kv cache related ops (#3001, #2968)
[New] Add compact_indices op (#3075 )
[New] Create embedding cache interface and impl RocksDB cache (#2858)
[New] Reduce prefetch SM usage when using pipeline prefetching (#2991)
[New] Add a host map option for a UVM tensor alloc (#3003)
[New] Add masked_index_select and refactor masked_index_put (#2910)
[Improvement] Add parallelism on cache update (#3062)
[Improvement] add parameter server attributes (#2947)
[Improvement] Make the scratch pad tensor UVA (#2844)
[Improvement] Use less thread blocks for find_uncached kernel (#3101)
[Improvement] Fix stream sync for scratch pad eviction (#2843)
[Improvement] Make indices related to cache eviction UVA tensors (#3077
[Improvement] Split cachelib cache into header and src (#3063)
[Improvement] Record more functions and logging in SSD TBE (#2854, #2867, #2975)
[Improvement] Attach eviction filling logic to set_cache (#3034)
[Improvement] Move set_cache and set_async to background thread (#3033)
[Improvement] Refactoring vec copy in masked_index_put_kernel (#2861, #2908)
[Improvement] Increase memcpy and compute overlap (#2860)
[Improvement] Add set_async in background thread (#3036 )
[Improvement] Make evicted_rows a UVA buffer (#3079 )
[Improvement] General enhancement and bug fixes (#2937, #2993, #3151, #3089, #2898, #2930)

GenAI Support and Operators

[New] Decode and Prefill support (#3009 )
[New] Support rope with block tables (#3146)
[New] EP support (#3071)
[New] Implement SDPA kernel wrapper to use run_kernel flow for perf (#2820)
[Improvement] Move mqa code (#3011)
[Improvement] BE improvements to init_comms #3103

Triton GEMM support

[New] Enable torch.compile compatibility for triton fp8 rowwise gemm (#2978)
[New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
[New] GEMM custom op enablement (#3046)
[New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
[Improvement] Add fused bias to Triton FP8 Rowwise Kernels (#2852)
[Improvement] Triton dependency ( #3027)
[Improvement] Fix triton fp8 handling of non-contiguous inputs (#2919)
[Improvement] More autotune configs and bug fixes in TMA kernel (#3078, #3066, #3072)
[Improvement] Fp8 gemm tweak for 405B Decoding (#3104 )

FP8 and other Quantization support

[New] CK FP8 Optimizations and fixes (#2940, #2912, #2987, #3017, (#2893 )
[New] FP8 kernel development and enablement (#2866)
[New] GenAI CK Version update and integration (#2865, #2971)
[Improvement] Also hipify the fp8 related cuda functions (#2834 )
[Improvement] Auto-generation of CUTLASS Extension Kernel Templates (#2932)
[Improvement] Marlin Mixed Input Kernel Productionization (#3008)
[Improvement] Remove redundant torch.abs (#3020, #2822 )
[Improvement] Tuning for 405B/70B Prefill with small seqlen (#3042)
[Improvement] Added new instances for 405B decoding (#2936 )

Permute and Pooled Embeddings Ops

[New] Implementation of permute_multi_embedding (#2833)
[Improvement] Clean up and removal of unused exception (#2832, #2891)
[Improvement] Use at::parallel_for in cpu kernel (#2817)
[Improvement] Add dispatch_to_cpu for the operators (#2874, #2881)
[Improvement] Print the exact variable values triggering the alert in Merge Pooled Embedding (#3038)

Sparse Operators

[New] Support original indices for FBGEMM block bucketization flag (#2999, #2925)
[Improvement] Fix pack_segments backward when grad is non-contig (#3006)
[Improvement] Fix FBGEMM_GPU_MEMCHECK in sparse_ops_cuda (#2943 )
[Improvement] Update sparse_ops.py to use generic gpu target fbgemm_gpu:input_combine to support both nvidia and AMD(#2905)
[Improvement] Add abstract impl and functions (#2962, #2983, #3000 )
[Improvement] Use guard_size_oblivious in tbe_input_combine_abstract fake kernel (#2923)
[Improvement] Out variant for asynchronous_exclusive_cumsum_cpu + some more static dispatch kernels (#3090)

Quantize ops

[New] Add a CPU nbit to float dequantization op that supports torch.quintMxN type (#2995)

MX4 Ops

[New] Optimize FBGEMM Triton MX4 Quantize-Dequantize (#2838, #2837)
[New] Rounding Mode Support (#2821, #2816, #2933, #2859 )
[New] FBGEMM/TorchRec MX4 padding support (#3055, #3047, #3010 )
[New] Add Stochastic downcasting to MX4 Quantization (#2899)
[New] Support for other MX4 formats in Triton kernels (#2900)
[Improvement] Refactor MX4 Kernel to operate on flat tensors (#2836)
[Improvement] Optimize MX4 padding to minimize need for tuning (#3040)

Benchmarks / Tests

[New] Add schema compatibility test (#3130)
[New] Add SSD/UVM caching in TBE device benchmark (#3076)
[New] Add EmbeddingSpMDM8BitBenchmarkOutTypeFloat16 (#2952 )
[New] Add benchmark EmbeddingSpMDMNBitBenchmarkOutTypeFloat16 (#2901 )
[New] Add unit test for int4 to int4 sequence CPU TBE (#2997)
[New] Add rocm support for fp8 benchmarks (#2965)
[New] Add rotating buffer feature to quantize_bench #2857)
[New] Benchmark of fbge...

Assets 2

29 Jul 22:40

spcyppt

v0.8.0

8846c27

FBGEMM_GPU v0.8.0 Release Notes

Release Note

Highlights

Table Batched Embedding

For GPU

New Table Batched Embedding (TBE) operators and momentum type support
New Intraining Embedding Pruning (ITEP) operators
VBE support for Dense TBE
Global weight decay support in TBE
New type support and improvement to SSD TBE
Improvement and bug fixes for TBE training and inference modules and sparse operators

For MTIA

MTIA support for DenseTBE

Generative AI

Gen AI Ops integration
Support for Triton-based and CUTLASS-based operators (#2552, #2537)
New FP8 GEMM and quantization operators
New query attention operators
New Car and All-To-All (NCCL-based) communication operators
AMD Support for FP8

Others

New MX4 quantization operators
Support for CUDA 12.4

Better engineering

Code refactoring and reorganization for faster builds
New tests and benchmarks
Improved AMD support

Software Requirements

FBGEMM_GPU v0.8.0 has been tested and known to work on the following setups:

PyTorch: v2.4
CUDA: v11.8, 12.1, 12.4
Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.8.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.8.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

For GPU

[New] VBE support for Dense TBE (#2628, #2620, #2641)
[New] BF16 momentum support in PARTIAL_ROWWISE_ADAM (#2524, #2522, #2518)
[New] Global weight decay support (#2516, #2507, #2506)
[New] Multi-pass prefetch for memory efficiency (#2566)
[Improvement] Work around masked_select for numel > MAX_INT (#2648)
[Improvement] Fused optim in backward capability with aot_autograd (#2651)
[Improvement] Weights mutations declaration in TBE backward ops schemas (#2698)
[Improvement] Helper ops to support cache conflict misses (#2571)
[Improvement] Fixed the hang issue in some TBE GPU optimizers (#2509)
[Improvement] Misc TBE fixes and refactoring (#2583, #2597, #2529)
[Improvement] Cache prefetch and conflict miss improvements (#2596, #2514)

For MTIA

[New] Support MTIA in DenseTableBatchedEmbeddingBagsCodegen (#2680)

SSD Table batched embedding (TBE) operators

[New] Add FP16 weight and output support to SSD TBE (#2638)
[New] Implementation of PS KV DB for FBGEMM TBE operator (#2664, #2642)
[Improvement] Removal of D->H sync when calling lxu_cache_lookup (#2672)
[Improvement] Recording of functions in SSD TBE (#2670)
[Improvement] Added options, assertions and logs for training and inference SSD TBE (#2689, #2657)
[Improvement] SSD TBE backend fixes (#2645, #2671)

New Operator Groups

[New] Intraining Embedding Pruning (ITEP) ops (#2700, #2690, #2682)
[New] Populate bucketize permute kernel (#2533)
[New] MX4 quantization support (#2709, #2703, #2696, #2675, #2659)

GenAI FP8 Operators

[New] FP8 enablement (#2615, #2637)
[New] CK FP8 GEMM kernels (#2630)
[New] FP8 Rowwise GEMM (#2585, #2622)
[New] FP8 quantization and conversions to FP32/FP16 (#2686, #2681, #2593, #2540, #2677)
[New] FP8 blockwise GEMM (#2676, #2600)
[New] Triton-based FP8 GEMM and quantization support (#2701, #2688, #2643)
[New] AMD support for FP8 (#2582, #2658, #2611)

GenAI Support and Operators

[New] Integrated Gen AI ops into the build (#2512)
[New] Support for Triton-based operators (#2570, #2618)
[New] Support for CUTLASS-based operators (#2552, #2537)
[New] Car and All-To-All (NCCL-based) communication ops (#2606, #2667, #2631, #2624)
[New] Grouped query attention ops (#2673, #2504)
[New] CK BF16 GEMM (#2617)
[New] W4A8 GEMM kernels (#2558, #2607)

Pooled Embeddings

[Improvement] Clean up unused pooled embedding ops (#2626)
[Improvement] PyTorch compatibility fixes (#2619, #2629)

Sparse Operators

[Improvement] Increased dynamic shared memory size to support larger bucket sizes (#2500)
[Improvement] UINT8 support for reorder sequence embedding operator (#2531)
[Improvement] Fixed CPU blocking D2H in JaggedIndexSelect2dOp backward (#2510)

Benchmarks / Tests

[New] Unified benchmarks and unit tests for FP8 (#2609, #2699, #2666)
[Improvement] SSD TBE benchmarks (#2579, #2580)
[Improvement] SSD TBE tests (#2665, #2647)
[Improvement] Fixes for TBE tests and benchmarks (#2632)
[Improvement] nbit_cache benchmark bandwidth calculation (#2511)

Build / CI improvements and Fixes

[New] Support for CUDA 12.4 (#2565)
[Improvement] Improved AMD support (#2541, #2679)
[Improvement] Strengthened artifact installation process (#2491)
[Improvement] Memcheck added across operators (#2576, #2574, #2572, #2612, #2594, #2589, #2578)
[Improvement] Refactoring of large header files (#2650)
[Improvement] Improved build scripts to support debug flags and custom (i.e. GenAI) variants (#2702)

Assets 2

26 Apr 17:53

spcyppt

v0.7.0

8c06a63

FBGEMM_GPU v0.7.0

Release Note

Highlights

New optimizer and output type supports for Table Batched Embedding (TBE) training
Improvement and bug fixes for TBE variable batch size
Enhanced TBE pipeline prefetching for UVM caching
Many improvements on TBE CPU kernels
New and enhanced low-precision operators
Code refactoring and reorganization for faster builds
New tests and benchmarks
PyTorch 2 support for various operators
Clang compilation support

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

PyTorch: v2.3
CUDA: v11.8, 12.1
Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.7.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.7.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

[New] Added BF16 output support in TBE training (#2382)
[New] Added Support int8 output for sequence embeddings (#2316)
[New] Added an auto-vectorization implementation for CPU TBE-NBit kernel with user selection (#2182, #2299)
[New] Added CowClip optimizer (#2226, #2243)
[Improvement] Extended support and bug fixes for variable batch size TBE (#2256, #2388, #2394, #2333)
[Improvement] Optimized cache fetch for forward split (#2216, #2282, #2289, #2262, #2218)
[Improvement] Caching and cache lookup for pipeline prefetching fixes and enhancements (#2164, #2309, #2287, #2308)
[Improvement] Built hip rules by default (#2380)
[New] Added a method to TBE module to recompute buffers (#2338)
[New] Added meta functions for PyTorch 2 support (#2347)
[New] Added support for MTIA in TBE modules (#2273, #2286)
[Improvement] Improved TBE logging and stats report (#2379, #2378, #2377, #2386, #2337)
[Improvement] General fixes and enhancements (#2235, #2398, #2212, #2269, #1782, #2270, #2265, #2385, #2370, #2349, #2312, #2411, #2400)
[Deprecation] Optimizers deprecated (#2253, #2252)
[Deprecation] Removed double type support from fbgemm_cuda_utils.cuh (#2335)
[Deprecation] Removed INT8 weight/output support from TBE GPU training

Jagged Tensor Operators

[Improvement] Removed device-host synchronization from keyed jagged index select (#2315)
[Improvement] Fixed half->int build error (#2240)

Index Select Operators

[Improvement] Fixed BF16 group_index_select_2d on AMD GPU (#2321)

Low-precision operators

[New] CPU implementation of per-channel quantize operator (#2341)
[New] CPU implementation for qlinear_channelwise operator (#2343)
[New] Enabled CPU int8 output to dequantization to bf16 on CUDA (#2242)
[New] Enabled dequantization for bf16 (#2241)

Pooled Embedding

[Improvement] Used gpu_library_selector for permute_pooled_embedding_ops_gpu (#2340)

Misc

[New] Implementation of CPU version of all_to_one_device (#2251)
[Improvement] Performance improvement of _block_bucketize_sparse_features_cuda_kernel1 (#2331)
[New] Created cumem_utils_cpu and added to all_deps_cpu (#2215)
[New] Added float support to asynchronous_complete_cumsum_cpu (#2383)
[Improvement] Added early exit to sparse ops (#2277, #2276, #2213, #2259)
[New] STBE GPU coalescing kernel (#2275)
[Improvement] Removed symint from tbe_input_combine_with_length_abstract (#2336)
[New] GPU timing and basic reporting framework (#2314)
[Improvement] Fixes and FBGEMM PT2 compliance (#2223, #2224, #2225, #2231, #2327)

Benchmarks / Tests

[New] Added dynamic quantize GEMM benchmark (#2297, #2295, #2271)
[New] Added a new CPU nbit-TBE benchmark that tries to reduce CPU frequency noise (#2306)
[New] Added unit test for stochastic rounding for UVM caching (#2324)
[New] Added unit test AsyncSeriesTimer (#2364)
[New] Added int32 overflow unit test for TBE UVM caching (#2303)
[Improvement] Disabled dynamo testing in TBE (#2381)
[Improvement] Refactored and re-organized tests (#2305, #2292, #2291, #2284, #2281, #2274, #2272, #2266, #2263, #2260, #2407, #2406, #2402, #2304, #2399, #2393)
[Improvement] General fixes for tests and benchmarks (#2301, #2300, #2298, #2255, #2205, #2296)

Build / CI improvements and Fixes

[Improvement] Optimized EmbeddingSpMDMNBit_autovec (#2267)
[Improvement] Switched between hip and cuda c++ lib so load (#2236)
[Improvement] Fixred bf16 support issues (#2238)
[New] Enabled Clang compilation in OSS for fbgemm_gpu (CPU and CUDA) (#2334, #2345, #2330, #2323)
[New] Upgraded ROCm version (#2405)
[Improvement] Enabled -Winfinite-recursion in deeplearning/PACKAGE (#2329)
[Improvement] Fixed shadowed variable in deeplearning/fbgemm/src/GroupwiseConv.cc (#2268)
[Improvement] General CI and build system enhancement (#2489, #2430, #2427, #2423, #2356, #2348, #2342, #2328, #2307, #2211, #2219, #2220, #2228, #2233)
[Improvement] Documentation enhancement (#2294, #2278, #2258, #2249, #2227, #2232, #2244, #2239, #2237)

Assets 2

31 Jan 19:40

spcyppt

v0.6.0

e0d208e

FBGEMM_GPU v0.6.0

Release Note

Highlights

Improvement and bug fixes for TBE variable batch size
Many TBE extensions and benchmarks
Enhanced TBE pipeline prefetching for UVM caching
Code refactoring and reorganization for faster builds
Many improvements and new sparse ops added
Improved low precision ops
Support for Python 3.12
PyTorch 2 support for various operators

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

PyTorch: v2.2
CUDA: v11.8, 12.1
Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.6.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.6.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

[Improvement] Extended support and bug fixes for variable batch size (#2012, #2043, #2107, #2150, #2188)
[Improvement] caching and cache lookup for pipeline prefetching (#2147, #2154, #2151)
[New] Support MTIA device type in FBGEMM TBE training (#1994)
[New] Enable sequence TBE CPU via AVX (#2195)
[New] Enable subwarp only for unweighted (#2051)
[New] Add meta functions (#2094, #2102)
[New] Add reverse qparam option for MTIA (#2109)
[New] uvm_cache_stats for direct mapped (#1951, #1952)
[Improvement] use memcpy for cpu emb inplace update (#2166)
[Improvement] Remove indices and offsets copying from prefetch (#2186)
[Improvement] Improve perf for L=0 cases for TBE v2 (#2046)
[Improvement] General fixes and enhancements (#2030, #2009)

Jagged Tensor Operators

[Improvement] Fix incorrect SymInt signature on dense_to_jagged (#2039)
[Improvement] Fix non-contiguous tensor problem in jagged_index_select (#2060, #2061)

Index Select Operators

[Improvement] Get total D from CPU buffer in batch_index_select_dim0 (#2079)

Low-precision operators

[New] Add BF16 in padded FP8 quantize ops (#2010)
[Improvement] Improve quantize_comm error message (#2018)
[Improvement] Fix illegal memory access error and initialize empty values on fp8 quantize kernel (#2131, #2176)

Pooled Embedding

[New] Add permute_duplicate_pooled_embeddings op for CPU (#1939)
[Improvement] Use PyTorch's p2p access enable function (#2000)
[New] Add support for duplicate in permutations for permute_pooled_embs_split (#1940)
[Improvement] Improve all_to_one error message (#2019)
[New] Add meta function for fbgemm::merge_pooled_embeddings operator (#2069)
[New] Add variable batch per feature support to EBC (tw/cw only) (#1986)

Misc

[New] Add meta backend for new_managed_tensor and sparse ops (#1990, #2028, #2029, #2072)
[New] Use 4k page instead of 2M for managed tensor (#2058)
[New] Add BF16 support for reorder_batched_ad_indices (#2116)
[New] SymInts for sparse ops (#2017, #2089)
[New] Support for CPU/GPU compilation (#2040)
[New] Add impl_abstract (#2084, #2087, #2090, #2097, #2098, #2129, #2132, )
[Improvement] Make FBGEMM PT2 compliant (#2174, #2172, #2170, #2180, #2181, #2201, #2198)
[Improvement] Fix invalid CUDA configuration error for the empty input (#1993)

Benchmarks / Tests

[New] Benchmark block_bucketize_sparse_features uneven sharding (#2140, #2169)
[New] Add unit test for unique cache lookup (#2160)
[New] Add autogenerated opcheck tests (#2050, #2069, #2073, #2092, #2118, #2139, #2152, #2173, #2193)
[New] Add test for fbgemm ops. (#2136, #2082)
[Improvement] Modified TBE testbench to use FBGEMM generate_rquests function to generate indices and offsets (#1882)
[Improvement] Remove FP64 from TBE CPU tests (#2049)
[Improvement] Add warmup_runs to TBE benchmarks and run at least 1 warmup iter #2163
[Improvement] Add --pooling in TBE nbit_cpu benchmark (#2200)
[Improvement] Fill embedding tables with randomized scales and bias in split-TBE benchmarks (#2031)

Build / CI improvements and Fixes

[Improvement] General CI and build system enhancement
(#2065, #2071, #2078, #2149, #2189, #2203, #2204, #2209, #2047)
[Improvement] Reorganized code to enable faster builds (#1881, #2083, #2085, #2095, #2141, #2112, #2133, #2145, #2196, #2100, #2103)
[New] Add support for Python 3.12 (#2194)
[New] Updates for ROCm 5.6, 5.7 and 6.0 support and Hip.cmake changes (#2066, #2088, #2106)
[New] Add debug flags for HIP runs (#2206)
[Improvement] unknown c++ flag detection in CMake (#2057)
[Improvement] Fix inconsistent dll linkage warning (#2059, #2064)
[Improvement] Fix heap-buffer-overflow in radix_sort_parallel (#2075)
[Improvement] Update AVX2 and AVX512 flags (#2167)

Assets 2

05 Oct 23:52

spcyppt

v0.5.0

a4151dd

FBGEMM_GPU v0.5.0

Release Notes

Highlights

TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement)
Many TBE extensions including defused TBE backward-optimizer, variable batch size support, pipeline prefetching support for UVM caching
Many improvements and new sparse ops added
ARM support
SM 9.0 support for CUDA 12.1 for H100 GPUs
PyTorch 2 support for various operators, i.e., jagged tensor, pooled embedding ops

Software Requirements

FBGEMM_GPU v0.5.0 has been tested and known to work on the following setups:

PyTorch: v2.1
CUDA: v11.8, 12.1
Python: v3.8, 3.9, 3.10, 3.11

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.5.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.5.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

[Improvement] TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement) (#1641, #1804, #1787, #1904)
[New] Variable batch size support to TBE training (#1653, #1752, #1633, #1634, #1713, #1717, #1943)
[New] BFloat16 support for TBE CPU (#1839, #1851)
[New] Defused TBE backward-optimizer and SplitTBE optimizer (#1819, #1820, #1821)
[New] Max norm support for rowwise_adagrad (#1781)
[New] Support for 1024-2048 embedding dimension in TBE inference (#1656)
[Improvement] Backends via PyTorch dispatcher (#1948, #1976)
[Improvement] Deprecate many TBE optimizers (#1766, #1767, #1771, #1796, #1774, #1773, #1775, #1791, #1793)
[New] TBE UVM cache pipeline prefetching (#1883, #1893)

Jagged Tensor Operators

[New] New jagged tensor operators (#1690)
[New] Backends (Meta) (#1880, #1960)
[Improvement] Jagged operator optimizations (#1643, #1646, #1644, #1661, #1662, #1691, #1692, #1777)
[Improvement] Symbolic shape tracing on jagged operators for PyTorch 2 (#1758)

Index Select Operators

[New] batch_index_select_dim0 with TBE backend (#1897)
[New] Variable input sizes support for group_index_select_dim0 (#1968)
[Improvement] Improve group_index_select(#1764, #1884)

Low-precision operators

[New] Meta Backend FP8RowwiseQuantizedToFloat (#1890)
[New] Column-wise parallel quantization/dequantization (#1743)
[New] BF16 Support in FP8 quantize ops (#1961)
[Improvement] FP8 row-wise quantization optimization/improvement (#1729, #1858, #1981, #1909)

Pooled Embedding

[New] reduce_to_one (#1571)
[New] permute_duplicate_pooled_embeddings op (#1912)
[New] BF16 support for permute_pooled_embeddings op 1937
[New] Variable size input-output support for permute_pooled_embs_kernel (#1913)
[New] Backends (Meta) (#1853)
[Improvement] multi-gpu all_to_one enhancements (#1674, #1962)

Misc

[New] CUB kernel for 2D asynchronous_complete_cumsum (#1707)
[New] Backends (Meta) (#1709, #1905, #1970, #1971)
[New] BF16 support in permute_indices_weights_kernel_2 (#1852)
[New] FP16 and BF16 support in pack_segments (#1708)
[New] BF16 support for HBC ops. (#1744)
[New] BFloat16 support (#1832, #1865)
[Improvement] Speedup reorder_batched_ad_indices (#1901, #1902, #1932, #1933, 1711)

Benchmarks / Tests

[New] CLI support to GEMMsBenchmark (#1721, #1725)
[New] Benchmark for variable batch on TBE (#1559)
[New] BF16 output test coverage (#1835, #1838)
[New] Benchmark for reorder_batched_ad_indices (#1895)
[New] CPU support (#1874, #1926)
[Improvement] GroupIndexSelect Benchmark with zero_grad (#1559)
[Improvement] Add nbit-cpu-with-spec benchmark in FBGEMM-GPU's TBE benchmark suite (#1892)

Build / CI improvements and Fixes

[New] C++17 Support to FBGEMM and FBGEMM_GPU OSS builds (#1652)
[New] ARM Support in OSS CI (#1813)
[New] SM 9.0 Support for CUDA 12.1 (#1825, #2002)
[Improvement] General CI and build system enhancement (#1658, #1695, #1697, #1702, #1719, #1751, #1784, #1795, #1836, #1958, #2020, #2024)
[Improvement] Reorganized code to enable faster builds (#1843, #1849, #1856, #1860, #1863, #1864, #1866, #1886, #1694, #1705, #1710, #1723, #1757, #1783, #1871, #1873, #1879, #1944, #1816, #1753)

Assets 2

24 Mar 23:37

q10

v0.4.1

64833b5

FBGEMM_GPU v0.4.1

Release Notes

Software Requirements

FBGEMM_GPU v0.4.1 has been tested and known to work on the following setups:

PyTorch: v2.0
CUDA: v11.7, 11.8
Python: v3.8, 3.9, 3.10, 3.11

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU may be fetched directly from PyPI:

# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.1

# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.1

Changes

This is a minor release whose main purpose is to deliver Python 3.11 support.

[New] Add support for Python 3.11 (#1646)
[Improvement] Add support for group size > 54 in group_index_select (#1611)
[Improvement] Implement cache miss emulation in UVM_CACHING (#1637)))
[Improvement] Add TensorAccessor with memcheck (#1602)

Assets 2

15 Mar 17:08

q10

v0.4.0

ea96ea3

FBGEMM_GPU v0.4.0

Release Notes

Software Requirements

FBGEMM_GPU v0.4.0 has been tested and known to work on the following setups:

PyTorch: v2.0
CUDA: v11.7, 11.8
Python: v3.8, 3.9, 3.10 (3.11 not supported yet)

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU may be fetched directly from PyPI:

# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.0

# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.0

Changes

Table batched embedding (TBE) operators

[New] SSD for inference TBE (#1473, #1479, #1485, #1517, #1533, #1535)
[New] Inplace TBE update (#1480, #1482, #1492, #1529)
[New] BF16 support for inference TBE (#1498, #1503)
[New] BF16 support for TBE on CPU (#1540, #1583)
[Improvement] Training TBE backward performance improvement (#1563)

UVM cache improvement

[New] Delta in-place update (#1436)
[New] UVM caching stats report (#1623, #1462, #1433, #1623, #1570)
[Improvement] [lfu|lru]_cache_insert_byte_kernel vectorization (#1475)

Jagged Tensor Operators

[New] Backends (Meta and Autograd) (#1461, #1466, #1467, #1469, #1468, #1477, #1556)
[New] BF16 support (#1472, #1560)
[New] FP32 + BF16 hybrid support for jagged_dense_dense_elementwise_add_jagged (#1487)
[New] Jagged tensors with no inner dense dimension support (#1267)
[New] New jagged tensor operators (#1557, #1577, #1578, #1579, #1594, #1595)

Index Select Operators

[New] group_index_select (#1421, #1592)
[New] index_select for selecting KeyJaggedTensor dim 1 (previously support only dim 0) (#1429)
[New] jagged_index_select for CPU (#1586)

Low-precision operators

[New] FP8 rowwise quantized communication (#1423)

Misc

Support 2D inputs for asynchronous_complete_cumsum (#1573)

Benchmarks / Tests

[New] nbit_device_with_spec for table batched embedding inference benchmark (#1455, #1465)
[New] Variable bag sizes for TBE benchmark (#1450)
[Improvement] Parallel bottom_unique_k_per_row for faster Zipf data generation (for FBGEMM benchmarks) (#1447)

Build / CI improvements and Fixes

[New] Linter integration (#1427)
[Improvement] General CI and build system enhancement (#1444, #1407, #1541, #1542, #1544, #1546, #1549, #1562, #1568, #1589, #1603, #1598, #1606, #1619, #1627, #1631, #1635)
[Improvement] AMD GPU CI and build system enhancement (#1537, #1552, #1543)

Assets 2

19 Jan 22:22

mjanderson09

v0.3.2

b2be702

v0.3.2

Minor release

Assets 2

Releases: pytorch/FBGEMM

FBGEMM v1.2.0 Release Notes

Highlights

TBE GPU

TBE CPU

TBE SSD

Gen AI Ops

ROCm

Better Engineering

Software Requirements

Availability

Changes

CPU

GEMM

GenAI

GenAI Ops

GEMM

FP8

FBGEMM_GPU v1.1.0 Release Notes

Highlights

TBE GPU

TBE CPU

TBE SSD

Gen AI Ops

ROCm

SLL ops

Better Engineering

Software Requirements

Availability

Changes

Table Batched Embedding (TBE) operators

For GPU

For CPU

SSD TBE Operators

GenAI Support and Operators

FP8 and other Quantization support

FBGEMM_GPU v1.0.0 Release Notes

Stable API

Highlights

Table Batched Embedding (TBE)

For SSD

For CPU

Generative AI

Others

Better engineering

Software Requirements

Availability

Changes

Table batched embedding (TBE) operators

For GPU

For CPU

SSD Table batched embedding (TBE) operators

GenAI Support and Operators

Triton GEMM support

FP8 and other Quantization support

Permute and Pooled Embeddings Ops

Sparse Operators

Quantize ops

MX4 Ops

Benchmarks / Tests

FBGEMM_GPU v0.8.0 Release Notes

Release Note

Highlights

Table Batched Embedding

For GPU

For MTIA

Generative AI

Others

Better engineering

Software Requirements

Availability

Changes

Table batched embedding (TBE) operators

For GPU

For MTIA

SSD Table batched embedding (TBE) operators

New Operator Groups

GenAI FP8 Operators

GenAI Support and Operators

Pooled Embeddings