Skip to content

Releases: pytorch/FBGEMM

FBGEMM v1.2.0 Release Notes

27 Apr 08:31
@q10 q10
Compare
Choose a tag to compare

Highlights

TBE GPU

  • Added support for int64_t table indices and offsets in TBE inference
  • Improved TBE benchmark utilities with the introduction of the Embeddings Estimator and Generator (EEG)

TBE CPU

  • Added Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf operator
  • Make FloatToFloat16 conversion 75x faster using SVE2 instructions
  • Added FP32 GEMM kernels

TBE SSD

  • Fix OOM issues during init
  • Improvements to L1 and L2 flush

Gen AI Ops

  • GenAI ops are now separately packaged into FBGEMM GenAI package for easier build and installation
  • Various FP8 grouped GEMM optimizations
  • BF16I4 preshuffled grouped GEMM
  • BF16 stacked grouped GEMM
  • F8I4 grouped GEMM optimizations
  • Added nccl_alltoall function

ROCm

  • Added preliminary ROCm OSS build support for GenAI ops

Better Engineering

  • Added build support for CUDA 12.8
  • Introduced a set of utilities to harden CUDA kernel launches against a set of runtime errors

Software Requirements

FBGEMM_GPU v1.2.0 has been tested and known to work on the following setups:

  • PyTorch: v2.7
  • CUDA: v11.8, 12.6, 12.8
  • Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU (instructions here) and FBGEMM-GenAI (instructions here).

Availability

FBGEMM_GPU and FBGEMM GenAI can be fetched directly from PyPI:

# FBGEMM_GPU - CUDA (only the CUDA 12.6 variant is available)
pip install fbgemm-gpu==1.2.0

# FBGEMM_GPU - CPU
pip install fbgemm-gpu-cpu==1.2.0

# FBGEMM GenAI
pip install fbgemm-gpu-genai==1.2.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU - CUDA
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu128/

# FBGEMM_GPU - CPU
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu

# FBGEMM GenAI 
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu

Changes

CPU

GEMM

  • [Improvement] Improve Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon by 5%-15% (#3860)
  • [New] Use enum to select floating point format in FbgemmEmbedding APIs (#3842)
  • [New] Add generic IEEE754 truncation code (#3820)
  • [New] Enable KleidiAI for FP32 (#3818)
  • [Improvement] Move float conversion functions from Types.h into new FloatConversion.h (#3760)
  • [Fix] Use kleidiAI on static builds (#3806)
  • [Fix] Fix KleidiAI FP16 (#3769)
  • [Improvement] Pull ARM's matrix transpose PR (#3660)
  • [New] Add NEON implementation of Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf (#3707)
  • [Improvement] avoid extra copy in PackedGemmMatrixB constructor (#3691)
  • [Improvement] Remove FENV pragma (#3629)
  • [Improvement] Make FloatToFloat16 conversion 75x faster using SVE2 instructions (#3626)
  • [New] add a new constructor to PackedGemmMatrixB (#3598)
  • [New] Move FP32 kernels to OSS (#3568)

GenAI

GenAI Ops

  • [Improvement] Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (#3790) (#3942)
  • [New] Add harness for comms benchmark (#3936)
  • [Improvement] Refactoring of NoPE (#3840)
  • [Improvement] support fp16 dtypes for input weight and bias (#3931)
  • [Fix] fix fp8 kv cache dequantize kernels (#3896)
  • [Fix] fix fp8 kv cache dequantize kernels (#3896)
  • [Improvement] scatter_add 0 size support (#3861)
  • [Improvement] Retuned CK GMM fp8/bf16 with perf fixes (#3851)
  • [Improvement] Enable groupwise scales for F8I4 Grouped Gemm (#3884)
  • [Fix] Fix empty input view. (#3880)
  • [New] FP8 Rowwise Dequant Kernel (#3873)
  • [New] torch.ops.fbgemm.gather_scale_dense_tokens for oss. (#3855)
  • [Improvement] Replace rms_norm as norm (#3841)
  • [Improvement] Move DeepGemm scale transpose to quantize (#3834)
  • [Improvement] follow up to reflect rowwise scale inputs for x, w in quantize_ops scripts (#3839)
  • [New] add rowwise scaling support (#3822)
  • [Improvement] update to tune for small ms and quantized gemv (#3712)
  • [New] Add Preshuffled FP8 x INT4 Grouped Gemm Kernel (#3800)
  • [New] FBGEMM Add Columnwise Weight Scaling to F8I4 GEMM (#3766)
  • [Improvement] update the sorting kernel for bf16 ck fmoe kernel (#3817)
  • [Fix] fix volatile synchronization with acquire/relax (#3728)
  • [Improvement] Force determinism by unswizzle (#3727)
  • [New] add fp8 kv nope (#3786)
  • [Improvement] move common op to vector utils (#3759)
  • [Improvement] Gather/Scatter. (#3743)
  • [Improvement] reduce scatter supports last dim (#3726)
  • [Improvement] Add custom reduce scatter to llama_comms (#3730)
  • [New] Adds shapes information to enable torch.compile. (#3724)
  • [Improvement] avoid propagation of NaN (#3723)
  • [New] torch.ops.fbgemm.scatter_add_along_first_dim.. (#3720)
  • [New] torch.ops.fbgemm.gather_along_first_dim. (#3719)
  • [New] Paged Attention Support (#3698)
  • [New] custom reduce scatter (#3686)
  • [Fix] Recover custom collective test (#3687)
  • [Improvement] update sweep_utils.py to test more precision gemv kernel (#3678)
  • [New] add fp8fp8 fast_gemv_quantized (#3677)
  • [New] add mixed precision fp8 fast_gemv_quantized kernel (#3675)
  • [Improvement] adjust interface (#3669)
  • [Improvement] CK MoE: cherry-pick #1808 (#3609)
  • [Improvement] fix llm shapes in quantize bench and add ldm shapes (#3611)
  • [Improvement] Return if no data to allreduce (#3586)
  • [Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
  • [Improvement] Make zero_start_index_M optional for dynamic BF16 Grouped Gemm (#3553)
  • [New] Add nccl_alltoall function (#3551)
  • [New] Add fused_moe kernel to ck_extension (#3518)

GEMM

  • [Improvement] Update cutlass verison to 3.8V2 (#3772)
  • [Improvement] Update Cutlass to V3.8-2 (#3767)
  • [Improvement] fp8_gemm (non_persistent): adding optimal configs for 8k & 16k shapes (#3764)
  • [New] new tuning for fp8 rowwise (#3756)
  • [Improvement] Add DeepGEMM blockwise GEMM in quantize bench (#3746)
  • [Improvement] Enable DeepGEMM in quantize bench (#3745)
  • [Improvement] reduce overhead for f8f8bf16_rowwise_grouped_dynamic on amd (#3742)
  • [Improvement] Performance Optimization: Optimized TileShape Configuration for f8 (#3617) (#3735)
  • [Improvement] Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) (#3710)
  • [Improvement] adding an option to skip zeroing output tensor for f8f8bf16_rowwise_grouped_dynamic (#3685)
  • [Improvement] Update CK (#3701)
  • [Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu +10 (#3844)
  • [New] Make F8I4 grouped GEMM process M_sizes with INT32 (#3853)
  • [Improvement] Skip empty groups in FP8 Stacked Gemm (#3862)
  • [New] Enable preshuffled mixed dtype Cutlass Gemm (#3722)
  • [Improvement] [CUTLASS] Minor Cutlass change to fix CI (#3779)
  • [Improvement] Clean up cutlass FP8 Grouped Gemm Kernel Setup (#3864)
  • [New] Modernize bf16 cutlass grouped gemm (#3889)
  • [Improvement] [CUTLASS] Include new cutlass support for groupwise mixed dtype grouped gemm. (#3885)
  • [New] Add DEEPGEMM Masked API. (#3949)
  • [Improvement] Use Int64 Indexing in Grouped Gemm (#3930)
  • [Improvement] Add correctness testing for shuffled mixed dtype GEMMs. (#3932)
  • [New] BF16I4 Preshuffled Grouped Gemm (#3917)
  • [New] Preshuffled BF16I4 Gemm Kernel (#3913)
  • [New] Enable rowwise scaling for DeepGemm (#3874)
  • [New] bf16 stacked group gemm (#3888)
  • [New] F8I4 Grouped Gemm Optimization for Sparse M (#3854)

FP8

  • [Fix] FBGEMM fp8 ck GEMM fix for irregular GEMM shapes (#3894)
  • [Fix] fix stacked version fp8 rowwise group gemm registration in quantize_bench (#3902)
  • [Fix] A hotfix for FBGEMM fp8 rowwise with irregular gemm sizes (#3883)
  • [Improvement] Transpose FP8 GEMM inputs for better tuning (#3866)
  • [New] Enable FP8 Triton dequantized block-wise kernel (#3788)
  • [Improvement] Refactor stacked version of FP8 Grouped Gemm for reduced overhead (#3699)
  • [Improvement] changing config for fp8 gemm (#3668)
  • [Improvement] Add option to disable fast_accumulation for fp8 gemm. (#3714)
  • [New] Add cublas FP8 tensorwise GEMM in fbgemm quantize bench (#3693)
  • [Improvement] write_k_back for fp8 ROPE (#3679)
  • [Improvement] Moves utility functions into a standalone file. (#3671)
  • [Fix] Fix f8f8bf16_lite quantize op input in quantize_and_compute (#3667)
  • [Improvement] Optimize zero fill (#3666)
  • [Improvement] FP8 Grouped Gemm Optimization (#3655)
  • [New] Add sweep_utils.py script to tune heuristics (#3656)
  • [Improvement] loose unit test atol rtol tolerance to eliminate ut flakiness (#3664)
  • [New] Port oss f16_fast_gemv into fbcode (#3610)
  • [New] fp8 rowwise regular gemm tuning for llm new shapes (#3654)
  • [Improvement] k_norm in rope for fp8 kv cache (#3633)
  • [Improvement] Fix zero_start_index_M argument for triton rowwise quantize (#3639)
  • [Fix] Fix handling of dynamic FP8 grouped gemm on Nvidia (#3616)
  • [Improvement] Improve FP8 grouped GEMM perf via tileshape and cooperative (#3653)
  • [Improvement] Refactor FP8 grouped GEMM with dynamic and static versions (#3561)
  • [New] Support FP8 grouped GEMM with rowwise scailing (#3560)
  • [Fix] [CUTLASS] Use custom copy of cutlass to enable FP8 Grouped Gemm. (#3649)
  • [Fix] kv_dq zero initialization to avoid NaNs from FA3 (#3632)
  • [Improvement] amd fp8 rowwise batched gemm tuning (#3624)
  • [Improvement] Improve handling for FP8 grouped gemm without zero_start_index_M (#3615)
  • [New] amd fp8 rowwise gemm prefill shape tuning (#36...
Read more

FBGEMM_GPU v1.1.0 Release Notes

29 Jan 20:54
@q10 q10
Compare
Choose a tag to compare

Highlights

TBE GPU

  • Introducing support for int32_t indices in TBE training
  • Extended TBE support for larger embedding dimensions
  • Made the learning rate a tensor value
  • Improvements on indices bounds checking

TBE CPU

  • Improved ARM support with SVE implementations for matrix multiplication and float matrix transpose
  • Improved the EmbeddingSpMDMAutovec API
  • Migrated FP32 ops to OSS

TBE SSD

  • Enabled VBE in SSD-TBE
  • Async initialization of RockDB SSD tensors and pad before writing to rocksDB
  • Improvements on indices bounds and other constraints checking

Gen AI Ops

  • Custom allgather support multiple dtypes, with dtype checking to prevent silent failures

ROCm

  • Add CK FP8 Batched GEMM and Rowwise GEMM kernels along with heuristic tuning
  • Fixed CK FP8 rowwise quantization for some GEMM shapes
  • Introduced HIP-specific optimizations to the TBE forward and backward passes

SLL ops

  • Migrated Sequence Learning Library (SLL) ops to OSS

Better Engineering

  • Restructured the build to produce multipiple smaller shared libraries instead of a single large binary
  • New and improved tests and benchmarks
  • Improved ROCm build variant support
  • Add build support for CUDA 12.6 and Python 3.13

Software Requirements

FBGEMM_GPU v1.1.0 has been tested and known to work on the following setups:

  • PyTorch: v2.6
  • CUDA: v11.8, 12.4, 12.6
  • Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.1.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.1.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu124/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu126/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) operators

For GPU

  • [New] Add support for int32_t indices in TBE training (#3377, #3375, #3374, #3372, #3371, #3324, #3267, #3264, #3263 #3257)
  • [New] Add support for int64_t indices and offsets in TBE inference (#3254, #3233)
  • [New] Extend TBE support for larger embedding dimensions (#3462, #3467)
  • [New] Make learning rate tensor (Backend) (#3287, #3310, #3332)
  • [New] Add PTA checks to embedding_bounds_check kernels" (#3318)
  • [Fix] Fix PackedTensorAccessor for batch_index_select (#3281)
  • [Fix] Set cache_precision = weights_precision in TBE if it is not explicitly set (#3370)
  • [Fix] Fix pt2_wrapper registration for unified TBE interface (#3238)
  • [Fix] Fix PT2 compliant opcheck tests (#3404)
  • [Fix] Fix FBGEMM_GPU_MEMCHECK in Split optimizers (#3416)
  • [Fix] Fix learning rate as tensor for PT2 compile (#3407)
  • [New] Add new optimizer state row_counter for Adam [Frontend] (#3558)
  • [New] Add new optimizer state row_counter for Adam [Backend] (#3342)
  • [Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [7C/N]"" (#3258)
  • [Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [8/N]"" (#3255)
  • [Fix] Fix global weight decay Faketensor test (#3341)
  • [Fix] Fix pt2_wrapper registration for unified TBE interface (#3237)
  • [Fix] Fix ""Cannot call numel() on tensor with symbolic sizes/strides"" (#3368)
  • [Fix] Fix grid size overflow in generate_vbe_metadata (#3484)
  • [Fix] Fix an integer overflow in permute_multi_embedding() (#3465)
  • [Fix] Fix the sync point caused by iter_cpu.item() (#3401)
  • [Fix] Fix global weight decay Faketensor test (#3341)
  • [Fix] Hot fix to skip VBE CPU reshaping for MTIA (#3466)
  • [Fix] address mem over used during flushing (#3460)
  • [Improvement] Add iter singular value into TBE optimizer state (#3228)
  • [Improvement] V2 fwd modified warps (#3570)
  • [Improvement] Add enable_async_update into tbe signature and config (#3431, #3461)"
  • [Improvement] Adjust kNumThreads for bounds_check_indices_kernel (#3299)
  • [Improvement] Reduce registers in bounds_check_indices" (#3298)
  • [Improvement] Mark unified autograd function traceable (#3378)
  • [Improvement] Improve bounds_check_indices for VBE (#3388, #3386)
  • [Improvement] Do not call scalar_type (#3394)
  • [Improvement] optimizer 1d -- EMA in place (fbgemm part) (#3402)
  • [Improvement] Clean up nbit_forward tests (#3286)
  • [Improvement] Remove unused-variable in some generated code (#3327)
  • [Improvement] Limit grid size of bounds_check_indices" (#3282)
  • [Improvement] Support config based bound check version via extended modes (#3418)
  • [Improvement] Use int64_t index for SplitOptimizer grad (#3447)
  • [Improvement] Remove unused arg from generate_vbe_metadata frontend (#3453)
  • [Improvement] Add generate_vbe_metadata test (#3483)
  • [Improvement] Support config based bound check version via extended modes (#3454)
  • [Improvement] make iter PT2 compatible (#3253)
  • [Improvement] Add meta function for PT2 wrappers (#3240)
  • [Improvement] Nesterov (#3232)

For CPU

  • [New] Introduce SVE function for matrix multiplication (#3348)
  • [New] Add sve implementation for float matrix transpose (#3421)
  • [New] autovec specialization framework (#3393)
  • [New] Move FP32 kernels to OSS (#3568)
  • [Improvement] Pull in PR for Kleidi-based FP16 kernel (#3507)
  • [Improvement] Use local buffer where possible (#3304)
  • [Improvement] Refactor GenerateEmbeddingXXX functions (#3307)
  • [Improvement] Increase local_storage size to 512 floats (#3357)
  • [Improvement] Adjust EmbeddingSpMDMAutovec API (#3366)
  • [Improvement] Split loops to work around loop vectorizer weakness (#3406)
  • [Improvement] Do an early check that data_size is not negative (#3305)
  • [Improvement] Fix strict aliasing violation, code cleanup (#3306)

SSD TBE Operators

  • [New] Enable VBE in SSD-TBE (#3247)
  • [Improvement] put KVTensorWrapper in its own header (#3575)
  • [Improvement] Moving KVTensorWrapper to a header file to be used in ModelStore checkpointing code (#3276)
  • [Improvement] Async initialization of RockDB SSD tensors (#3520)
  • [Improvement] pad before writing to rocksDB (#3245)
  • [Improvement] use RocksDB iterator to read key range from ssd embedding (#3495)
  • [Improvement] Log total duration spent prefetching (#3487)
  • [Improvement] address mem over used during flushing (#3460)
  • [Improvement] Create move TBE to right device, and set Cache Load in TBE class (#3438)
  • [Improvement] Unit test for new move tbe from device/cache_load method (#3437)
  • [Improvement] make L2/rocksdb update async optional (#3429)
  • [Improvement] Drop RoPE when filling KV cache (#3346)
  • [Improvement] Remove setting total_cache_hash_size as buffer (#3441)
  • [Improvement] Add meta registrations for kv_cache operators (#3442)
  • [Improvement] remove output dtype restriction in SSD TBE (#3524)
  • [Improvement] change pmt require grad to false when detached (#3525)
  • [Improvement] add more attributes to PartiallyMaterializedTensor (#3300)
  • [Improvement] skip broken inference test that uses ssd TBE (#3494)
  • [Improvement] "coro => fut" (#3430)
  • [Improvement] Reland of D65489998 Optimize sharding performance of embeddings (#3549)
  • [Improvement] Remove torch.jit.script (#3562)

GenAI Support and Operators

  • [New] Add nccl_alltoall function (#3551)
  • [New] custom allgather support multiple dtypes (#3498)
  • [Improvement] Make sure fake tensor functions return on proper device (#3258)
  • [Improvement] Add CPU registrations to custom operators (#3262)
  • [Improvement] Check src & dst dtypes in allgather to prevent silent failures (#3523)
  • [Improvement] Better shape function registration (#3237, #3340)
  • [Improvement] Package re-organization improvements (#3546, #3251, #3419, #3268, #3512)

FP8 and other Quantization support

  • [New] New autotune config for M=4 (#3277)
  • [New] MoE FP8 grouped GEMM (#3321)
  • [New] Add shape check on GroupedGEMM kernel (#3449)
  • [New] Tuning for fp8 gemm with emu1.7 shapes (#3436)
  • [Improvement] more fp8 tuning for decode and not need to pad (#3576)
  • [Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
  • [Improvement] Split FP8 Grouped Gemm into dynamic and static version (#3543)
  • [Improvement] Warp-specialized FP8 rowsise GEMM kernel (#3532)
  • [Improvement] Add Cutlass FP8 Grouped Gemm to Quantize Bench (#3530)
  • [Improvement] Fixed FBGEMM fp8 rowwise for irregular shapes (#3491)
  • [Improvement] Properly define preallocated output as mutable in fp8 rowwise gemm (#3476)
  • [Improvement] Fix FP8 Rowwise Gemm Compilation with Auto-functionalize V2 (#3457)
  • [Improvement] Support zero-size inputs in FP8 cuda quantize kernel (#3448)
  • [Improvement] update FP8 GEMM tuning for emu1.7 7B shapes (#3391)
  • [Improvement] Customize FP8 grouped GEMM for non-zero calculation for token choice MoE (#3383)
  • [Improvement] Support FP8 grouped GEMM with cudagraph (#3373)
  • [Improvement] Refactor FP8 grouped GEMM to prepare cudagraph support (#3369)
  • [Improvement] Improve FP8 BMM heuristic for large shapes and MoE E2E performance (#3344)
  • [Improvement] retune some of the EMU1.6 7B FP8 GEMM shapes (#3328)
  • [Improvement] Make FP8 BMM output contiguous (#3270)
  • [Improvement] Tune FP8 rowwise bmm tile hueristic (#3256)
  • [Improvement] more FP8 GEMM tuning for LDM shapes (#3414)
  • [Improvement] Split up f8f8bf16_rowwise_batched.cu (#3381)
  • [Improvement] use sym int in quantize.cpp for f8f8bf16_rowwise_meta (#3410)
  • [Improvement] Remove triton.ops dependency from fbgemm (#3329)...
Read more

FBGEMM_GPU v1.0.0 Release Notes

19 Oct 20:57
Compare
Choose a tag to compare

Stable API

We provide the stable API support starting from FBGEMM_GPU v1.0.0. This includes Table batched embedding (TBE) modules, Pooled embedding operators and modules, Sparse operators, Jagged tensor operators and Quantization operators.

  • API backward compatibility guarantees via thorough testing. We guarantee that our stable APIs will be backward compatible within a major version, meaning that the stable APIs for v1.0.0 will be compatible with every future release unless explicitly announced in advance
    *Enhanced documentation, ensuring that every stable API has comprehensive and up-to-date documentation.
  • Functionality guarantees are only provided through unit testing framework. We do NOT guarantee any functionalities that are NOT explicitly tested and documented in our unit tests.
  • No performance guarantees. However, we are committed to providing support on a best-effort basis.

More details can be found in stable API documentation

Highlights

Table Batched Embedding (TBE)

  • New optimizer support for TBE Training
  • Enhanced Global weight decay support in TBE
  • Improvement and bug fixes for TBE training and inference modules and sparse operators

For SSD

  • New pipeline prefetching enabled
  • New cache and indices related ops
  • Integration of L3 cache to TBE operators
  • Many improvements to kernel and logging

For CPU

  • New type support for CPU Sequence TBE
  • Kernel improvements and bug fixes

Generative AI

  • Gen AI Ops support and improvement
  • Improvements to Triton-based and CUTLASS-based operators
  • New and optimized FP8 GEMM and quantization operators

Others

  • Optimized MX4 quantization operators
  • New dequantization operator
  • Removal of python 3.8 Support

Better engineering

  • Code refactoring and reorganization for faster builds
  • New and improved tests and benchmarks
  • Improved AMD support

Software Requirements

FBGEMM_GPU v1.0.0 has been tested and known to work on the following setups:

  • PyTorch: v2.5
  • CUDA: v11.8, 12.1, 12.4
  • Python: v3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.0.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.0.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

For GPU

  • [New] Ensemble adagrad optimizer (#3197, #2955, #2954, #3161, #3091, #2981, #2889, #3180, #3158)
  • [New] Bounds check in prefetch in TBE training (#3015)
  • [New] Method to update internal hyperparameters for FBGEMM TBE (#3025)
  • [Improvement] Enhanced Global Weight Decay and state tracking (#2904, #2897, #2882, #2896, #2890, #2884, #2883 )
  • [Improvement] masked_index_* values index type fix (#2979)
  • [Improvement] generate_vbe_metadata fixes (#3095, #3087)
  • [Improvement] Fixes on the efficiency of VBE TBE forward due to blocking D2H copy (#2862)
  • [Improvement] Work around on offsets and indices type mismatch int TBE training (#3037)
  • [Improvement] Add a host map option for a UVM tensor alloc (#3073)
  • [Improvement] uvm_to_device expose device as interface (#3030)
  • [Improvement] Add Meta backend/dispatcher for new_unified_tensor (#3005)
  • [Improvement] General TBE enhancements and bug fixes (#2892, #3114, #3022, #2958)
  • [Improvement] Consolidate repeat code in TBE inference (#3028)

For CPU

  • [New] Add int4 to int4 CPU Sequence TBE kernel (#2996, #2994)
  • [New] Use auto-vec kernel in CPU sequential embedding lookup for int8 tables (#2863, #2878)
  • [Improvement] Work around OMP barrier issue with MSVCand unused var error (#2918, #3084)

SSD Table batched embedding (TBE) operators

  • [New] Enable pipeline prefetching (#2963)
  • [New] Enable cache line locking support in SSD kernel (#2949)
  • [New] Add L2 flush (#3110)
  • [New] Added SSD ODS and IO/mem stats (#2906, #2913, #3035)
  • [New] Add SSDScratchPadIndicesQueue (#2911, #2948)
  • [New] Integrate l2 cache to TBE operator (#2959, #3032, #3031 )
  • [New] Add ssd_update_row_addrs (#2953)
  • [New] Add bounds check in SSD-TBE (#3013)
  • [New] Add 32-bit index support in SSD kernels (#3064)
  • [New] Add kv cache related ops (#3001, #2968)
  • [New] Add compact_indices op (#3075 )
  • [New] Create embedding cache interface and impl RocksDB cache (#2858)
  • [New] Reduce prefetch SM usage when using pipeline prefetching (#2991)
  • [New] Add a host map option for a UVM tensor alloc (#3003)
  • [New] Add masked_index_select and refactor masked_index_put (#2910)
  • [Improvement] Add parallelism on cache update (#3062)
  • [Improvement] add parameter server attributes (#2947)
  • [Improvement] Make the scratch pad tensor UVA (#2844)
  • [Improvement] Use less thread blocks for find_uncached kernel (#3101)
  • [Improvement] Fix stream sync for scratch pad eviction (#2843)
  • [Improvement] Make indices related to cache eviction UVA tensors (#3077
  • [Improvement] Split cachelib cache into header and src (#3063)
  • [Improvement] Record more functions and logging in SSD TBE (#2854, #2867, #2975)
  • [Improvement] Attach eviction filling logic to set_cache (#3034)
  • [Improvement] Move set_cache and set_async to background thread (#3033)
  • [Improvement] Refactoring vec copy in masked_index_put_kernel (#2861, #2908)
  • [Improvement] Increase memcpy and compute overlap (#2860)
  • [Improvement] Add set_async in background thread (#3036 )
  • [Improvement] Make evicted_rows a UVA buffer (#3079 )
  • [Improvement] General enhancement and bug fixes (#2937, #2993, #3151, #3089, #2898, #2930)

GenAI Support and Operators

  • [New] Decode and Prefill support (#3009 )
  • [New] Support rope with block tables (#3146)
  • [New] EP support (#3071)
  • [New] Implement SDPA kernel wrapper to use run_kernel flow for perf (#2820)
  • [Improvement] Move mqa code (#3011)
  • [Improvement] BE improvements to init_comms #3103

Triton GEMM support

  • [New] Enable torch.compile compatibility for triton fp8 rowwise gemm (#2978)
  • [New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
  • [New] GEMM custom op enablement (#3046)
  • [New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
  • [Improvement] Add fused bias to Triton FP8 Rowwise Kernels (#2852)
  • [Improvement] Triton dependency ( #3027)
  • [Improvement] Fix triton fp8 handling of non-contiguous inputs (#2919)
  • [Improvement] More autotune configs and bug fixes in TMA kernel (#3078, #3066, #3072)
  • [Improvement] Fp8 gemm tweak for 405B Decoding (#3104 )

FP8 and other Quantization support

  • [New] CK FP8 Optimizations and fixes (#2940, #2912, #2987, #3017, (#2893 )
  • [New] FP8 kernel development and enablement (#2866)
  • [New] GenAI CK Version update and integration (#2865, #2971)
  • [Improvement] Also hipify the fp8 related cuda functions (#2834 )
  • [Improvement] Auto-generation of CUTLASS Extension Kernel Templates (#2932)
  • [Improvement] Marlin Mixed Input Kernel Productionization (#3008)
  • [Improvement] Remove redundant torch.abs (#3020, #2822 )
  • [Improvement] Tuning for 405B/70B Prefill with small seqlen (#3042)
  • [Improvement] Added new instances for 405B decoding (#2936 )

Permute and Pooled Embeddings Ops

  • [New] Implementation of permute_multi_embedding (#2833)
  • [Improvement] Clean up and removal of unused exception (#2832, #2891)
  • [Improvement] Use at::parallel_for in cpu kernel (#2817)
  • [Improvement] Add dispatch_to_cpu for the operators (#2874, #2881)
  • [Improvement] Print the exact variable values triggering the alert in Merge Pooled Embedding (#3038)

Sparse Operators

  • [New] Support original indices for FBGEMM block bucketization flag (#2999, #2925)
  • [Improvement] Fix pack_segments backward when grad is non-contig (#3006)
  • [Improvement] Fix FBGEMM_GPU_MEMCHECK in sparse_ops_cuda (#2943 )
  • [Improvement] Update sparse_ops.py to use generic gpu target fbgemm_gpu:input_combine to support both nvidia and AMD(#2905)
  • [Improvement] Add abstract impl and functions (#2962, #2983, #3000 )
  • [Improvement] Use guard_size_oblivious in tbe_input_combine_abstract fake kernel (#2923)
  • [Improvement] Out variant for asynchronous_exclusive_cumsum_cpu + some more static dispatch kernels (#3090)

Quantize ops

  • [New] Add a CPU nbit to float dequantization op that supports torch.quintMxN type (#2995)

MX4 Ops

  • [New] Optimize FBGEMM Triton MX4 Quantize-Dequantize (#2838, #2837)
  • [New] Rounding Mode Support (#2821, #2816, #2933, #2859 )
  • [New] FBGEMM/TorchRec MX4 padding support (#3055, #3047, #3010 )
  • [New] Add Stochastic downcasting to MX4 Quantization (#2899)
  • [New] Support for other MX4 formats in Triton kernels (#2900)
  • [Improvement] Refactor MX4 Kernel to operate on flat tensors (#2836)
  • [Improvement] Optimize MX4 padding to minimize need for tuning (#3040)

Benchmarks / Tests

  • [New] Add schema compatibility test (#3130)
  • [New] Add SSD/UVM caching in TBE device benchmark (#3076)
  • [New] Add EmbeddingSpMDM8BitBenchmarkOutTypeFloat16 (#2952 )
  • [New] Add benchmark EmbeddingSpMDMNBitBenchmarkOutTypeFloat16 (#2901 )
  • [New] Add unit test for int4 to int4 sequence CPU TBE (#2997)
  • [New] Add rocm support for fp8 benchmarks (#2965)
  • [New] Add rotating buffer feature to quantize_bench #2857)
  • [New] Benchmark of fbge...
Read more

FBGEMM_GPU v0.8.0 Release Notes

29 Jul 22:40
Compare
Choose a tag to compare

Release Note

Highlights

Table Batched Embedding

For GPU

  • New Table Batched Embedding (TBE) operators and momentum type support
  • New Intraining Embedding Pruning (ITEP) operators
  • VBE support for Dense TBE
  • Global weight decay support in TBE
  • New type support and improvement to SSD TBE
  • Improvement and bug fixes for TBE training and inference modules and sparse operators

For MTIA

  • MTIA support for DenseTBE

Generative AI

  • Gen AI Ops integration
  • Support for Triton-based and CUTLASS-based operators (#2552, #2537)
  • New FP8 GEMM and quantization operators
  • New query attention operators
  • New Car and All-To-All (NCCL-based) communication operators
  • AMD Support for FP8

Others

  • New MX4 quantization operators
  • Support for CUDA 12.4

Better engineering

  • Code refactoring and reorganization for faster builds
  • New tests and benchmarks
  • Improved AMD support

Software Requirements

FBGEMM_GPU v0.8.0 has been tested and known to work on the following setups:

  • PyTorch: v2.4
  • CUDA: v11.8, 12.1, 12.4
  • Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.8.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.8.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

For GPU

  • [New] VBE support for Dense TBE (#2628, #2620, #2641)
  • [New] BF16 momentum support in PARTIAL_ROWWISE_ADAM (#2524, #2522, #2518)
  • [New] Global weight decay support (#2516, #2507, #2506)
  • [New] Multi-pass prefetch for memory efficiency (#2566)
  • [Improvement] Work around masked_select for numel > MAX_INT (#2648)
  • [Improvement] Fused optim in backward capability with aot_autograd (#2651)
  • [Improvement] Weights mutations declaration in TBE backward ops schemas (#2698)
  • [Improvement] Helper ops to support cache conflict misses (#2571)
  • [Improvement] Fixed the hang issue in some TBE GPU optimizers (#2509)
  • [Improvement] Misc TBE fixes and refactoring (#2583, #2597, #2529)
  • [Improvement] Cache prefetch and conflict miss improvements (#2596, #2514)

For MTIA

  • [New] Support MTIA in DenseTableBatchedEmbeddingBagsCodegen (#2680)

SSD Table batched embedding (TBE) operators

  • [New] Add FP16 weight and output support to SSD TBE (#2638)
  • [New] Implementation of PS KV DB for FBGEMM TBE operator (#2664, #2642)
  • [Improvement] Removal of D->H sync when calling lxu_cache_lookup (#2672)
  • [Improvement] Recording of functions in SSD TBE (#2670)
  • [Improvement] Added options, assertions and logs for training and inference SSD TBE (#2689, #2657)
  • [Improvement] SSD TBE backend fixes (#2645, #2671)

New Operator Groups

GenAI FP8 Operators

GenAI Support and Operators

  • [New] Integrated Gen AI ops into the build (#2512)
  • [New] Support for Triton-based operators (#2570, #2618)
  • [New] Support for CUTLASS-based operators (#2552, #2537)
  • [New] Car and All-To-All (NCCL-based) communication ops (#2606, #2667, #2631, #2624)
  • [New] Grouped query attention ops (#2673, #2504)
  • [New] CK BF16 GEMM (#2617)
  • [New] W4A8 GEMM kernels (#2558, #2607)

Pooled Embeddings

  • [Improvement] Clean up unused pooled embedding ops (#2626)
  • [Improvement] PyTorch compatibility fixes (#2619, #2629)

Sparse Operators

  • [Improvement] Increased dynamic shared memory size to support larger bucket sizes (#2500)
  • [Improvement] UINT8 support for reorder sequence embedding operator (#2531)
  • [Improvement] Fixed CPU blocking D2H in JaggedIndexSelect2dOp backward (#2510)

Benchmarks / Tests

  • [New] Unified benchmarks and unit tests for FP8 (#2609, #2699, #2666)
  • [Improvement] SSD TBE benchmarks (#2579, #2580)
  • [Improvement] SSD TBE tests (#2665, #2647)
  • [Improvement] Fixes for TBE tests and benchmarks (#2632)
  • [Improvement] nbit_cache benchmark bandwidth calculation (#2511)

Build / CI improvements and Fixes

  • [New] Support for CUDA 12.4 (#2565)
  • [Improvement] Improved AMD support (#2541, #2679)
  • [Improvement] Strengthened artifact installation process (#2491)
  • [Improvement] Memcheck added across operators (#2576, #2574, #2572, #2612, #2594, #2589, #2578)
  • [Improvement] Refactoring of large header files (#2650)
  • [Improvement] Improved build scripts to support debug flags and custom (i.e. GenAI) variants (#2702)

FBGEMM_GPU v0.7.0

26 Apr 17:53
Compare
Choose a tag to compare

Release Note

Highlights

  • New optimizer and output type supports for Table Batched Embedding (TBE) training
  • Improvement and bug fixes for TBE variable batch size
  • Enhanced TBE pipeline prefetching for UVM caching
  • Many improvements on TBE CPU kernels
  • New and enhanced low-precision operators
  • Code refactoring and reorganization for faster builds
  • New tests and benchmarks
  • PyTorch 2 support for various operators
  • Clang compilation support

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

  • PyTorch: v2.3
  • CUDA: v11.8, 12.1
  • Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.7.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.7.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.7.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

  • [New] Added BF16 output support in TBE training (#2382)
  • [New] Added Support int8 output for sequence embeddings (#2316)
  • [New] Added an auto-vectorization implementation for CPU TBE-NBit kernel with user selection (#2182, #2299)
  • [New] Added CowClip optimizer (#2226, #2243)
  • [Improvement] Extended support and bug fixes for variable batch size TBE (#2256, #2388, #2394, #2333)
  • [Improvement] Optimized cache fetch for forward split (#2216, #2282, #2289, #2262, #2218)
  • [Improvement] Caching and cache lookup for pipeline prefetching fixes and enhancements (#2164, #2309, #2287, #2308)
  • [Improvement] Built hip rules by default (#2380)
  • [New] Added a method to TBE module to recompute buffers (#2338)
  • [New] Added meta functions for PyTorch 2 support (#2347)
  • [New] Added support for MTIA in TBE modules (#2273, #2286)
  • [Improvement] Improved TBE logging and stats report (#2379, #2378, #2377, #2386, #2337)
  • [Improvement] General fixes and enhancements (#2235, #2398, #2212, #2269, #1782, #2270, #2265, #2385, #2370, #2349, #2312, #2411, #2400)
  • [Deprecation] Optimizers deprecated (#2253, #2252)
  • [Deprecation] Removed double type support from fbgemm_cuda_utils.cuh (#2335)
  • [Deprecation] Removed INT8 weight/output support from TBE GPU training

Jagged Tensor Operators

  • [Improvement] Removed device-host synchronization from keyed jagged index select (#2315)
  • [Improvement] Fixed half->int build error (#2240)

Index Select Operators

  • [Improvement] Fixed BF16 group_index_select_2d on AMD GPU (#2321)

Low-precision operators

  • [New] CPU implementation of per-channel quantize operator (#2341)
  • [New] CPU implementation for qlinear_channelwise operator (#2343)
  • [New] Enabled CPU int8 output to dequantization to bf16 on CUDA (#2242)
  • [New] Enabled dequantization for bf16 (#2241)

Pooled Embedding

  • [Improvement] Used gpu_library_selector for permute_pooled_embedding_ops_gpu (#2340)

Misc

  • [New] Implementation of CPU version of all_to_one_device (#2251)
  • [Improvement] Performance improvement of _block_bucketize_sparse_features_cuda_kernel1 (#2331)
  • [New] Created cumem_utils_cpu and added to all_deps_cpu (#2215)
  • [New] Added float support to asynchronous_complete_cumsum_cpu (#2383)
  • [Improvement] Added early exit to sparse ops (#2277, #2276, #2213, #2259)
  • [New] STBE GPU coalescing kernel (#2275)
  • [Improvement] Removed symint from tbe_input_combine_with_length_abstract (#2336)
  • [New] GPU timing and basic reporting framework (#2314)
  • [Improvement] Fixes and FBGEMM PT2 compliance (#2223, #2224, #2225, #2231, #2327)

Benchmarks / Tests

Build / CI improvements and Fixes

FBGEMM_GPU v0.6.0

31 Jan 19:40
Compare
Choose a tag to compare

Release Note

Highlights

  • Improvement and bug fixes for TBE variable batch size
  • Many TBE extensions and benchmarks
  • Enhanced TBE pipeline prefetching for UVM caching
  • Code refactoring and reorganization for faster builds
  • Many improvements and new sparse ops added
  • Improved low precision ops
  • Support for Python 3.12
  • PyTorch 2 support for various operators

Software Requirements

FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:

  • PyTorch: v2.2
  • CUDA: v11.8, 12.1
  • Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.6.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.6.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

  • [Improvement] Extended support and bug fixes for variable batch size (#2012, #2043, #2107, #2150, #2188)
  • [Improvement] caching and cache lookup for pipeline prefetching (#2147, #2154, #2151)
  • [New] Support MTIA device type in FBGEMM TBE training (#1994)
  • [New] Enable sequence TBE CPU via AVX (#2195)
  • [New] Enable subwarp only for unweighted (#2051)
  • [New] Add meta functions (#2094, #2102)
  • [New] Add reverse qparam option for MTIA (#2109)
  • [New] uvm_cache_stats for direct mapped (#1951, #1952)
  • [Improvement] use memcpy for cpu emb inplace update (#2166)
  • [Improvement] Remove indices and offsets copying from prefetch (#2186)
  • [Improvement] Improve perf for L=0 cases for TBE v2 (#2046)
  • [Improvement] General fixes and enhancements (#2030, #2009)

Jagged Tensor Operators

  • [Improvement] Fix incorrect SymInt signature on dense_to_jagged (#2039)
  • [Improvement] Fix non-contiguous tensor problem in jagged_index_select (#2060, #2061)

Index Select Operators

  • [Improvement] Get total D from CPU buffer in batch_index_select_dim0 (#2079)

Low-precision operators

  • [New] Add BF16 in padded FP8 quantize ops (#2010)
  • [Improvement] Improve quantize_comm error message (#2018)
  • [Improvement] Fix illegal memory access error and initialize empty values on fp8 quantize kernel (#2131, #2176)

Pooled Embedding

  • [New] Add permute_duplicate_pooled_embeddings op for CPU (#1939)
  • [Improvement] Use PyTorch's p2p access enable function (#2000)
  • [New] Add support for duplicate in permutations for permute_pooled_embs_split (#1940)
  • [Improvement] Improve all_to_one error message (#2019)
  • [New] Add meta function for fbgemm::merge_pooled_embeddings operator (#2069)
  • [New] Add variable batch per feature support to EBC (tw/cw only) (#1986)

Misc

Benchmarks / Tests

  • [New] Benchmark block_bucketize_sparse_features uneven sharding (#2140, #2169)
  • [New] Add unit test for unique cache lookup (#2160)
  • [New] Add autogenerated opcheck tests (#2050, #2069, #2073, #2092, #2118, #2139, #2152, #2173, #2193)
  • [New] Add test for fbgemm ops. (#2136, #2082)
  • [Improvement] Modified TBE testbench to use FBGEMM generate_rquests function to generate indices and offsets (#1882)
  • [Improvement] Remove FP64 from TBE CPU tests (#2049)
  • [Improvement] Add warmup_runs to TBE benchmarks and run at least 1 warmup iter #2163
  • [Improvement] Add --pooling in TBE nbit_cpu benchmark (#2200)
  • [Improvement] Fill embedding tables with randomized scales and bias in split-TBE benchmarks (#2031)

Build / CI improvements and Fixes

FBGEMM_GPU v0.5.0

05 Oct 23:52
Compare
Choose a tag to compare

Release Notes

Highlights

  • TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement)
  • Many TBE extensions including defused TBE backward-optimizer, variable batch size support, pipeline prefetching support for UVM caching
  • Many improvements and new sparse ops added
  • ARM support
  • SM 9.0 support for CUDA 12.1 for H100 GPUs
  • PyTorch 2 support for various operators, i.e., jagged tensor, pooled embedding ops

Software Requirements

FBGEMM_GPU v0.5.0 has been tested and known to work on the following setups:

  • PyTorch: v2.1
  • CUDA: v11.8, 12.1
  • Python: v3.8, 3.9, 3.10, 3.11

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.5.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.5.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

Jagged Tensor Operators

Index Select Operators

  • [New] batch_index_select_dim0 with TBE backend (#1897)
  • [New] Variable input sizes support for group_index_select_dim0 (#1968)
  • [Improvement] Improve group_index_select(#1764, #1884)

Low-precision operators

  • [New] Meta Backend FP8RowwiseQuantizedToFloat (#1890)
  • [New] Column-wise parallel quantization/dequantization (#1743)
  • [New] BF16 Support in FP8 quantize ops (#1961)
  • [Improvement] FP8 row-wise quantization optimization/improvement (#1729, #1858, #1981, #1909)

Pooled Embedding

  • [New] reduce_to_one (#1571)
  • [New] permute_duplicate_pooled_embeddings op (#1912)
  • [New] BF16 support for permute_pooled_embeddings op 1937
  • [New] Variable size input-output support for permute_pooled_embs_kernel (#1913)
  • [New] Backends (Meta) (#1853)
  • [Improvement] multi-gpu all_to_one enhancements (#1674, #1962)

Misc

  • [New] CUB kernel for 2D asynchronous_complete_cumsum (#1707)
  • [New] Backends (Meta) (#1709, #1905, #1970, #1971)
  • [New] BF16 support in permute_indices_weights_kernel_2 (#1852)
  • [New] FP16 and BF16 support in pack_segments (#1708)
  • [New] BF16 support for HBC ops. (#1744)
  • [New] BFloat16 support (#1832, #1865)
  • [Improvement] Speedup reorder_batched_ad_indices (#1901, #1902, #1932, #1933, 1711)

Benchmarks / Tests

  • [New] CLI support to GEMMsBenchmark (#1721, #1725)
  • [New] Benchmark for variable batch on TBE (#1559)
  • [New] BF16 output test coverage (#1835, #1838)
  • [New] Benchmark for reorder_batched_ad_indices (#1895)
  • [New] CPU support (#1874, #1926)
  • [Improvement] GroupIndexSelect Benchmark with zero_grad (#1559)
  • [Improvement] Add nbit-cpu-with-spec benchmark in FBGEMM-GPU's TBE benchmark suite (#1892)

Build / CI improvements and Fixes

FBGEMM_GPU v0.4.1

24 Mar 23:37
@q10 q10
Compare
Choose a tag to compare

Release Notes

Software Requirements

FBGEMM_GPU v0.4.1 has been tested and known to work on the following setups:

  • PyTorch: v2.0
  • CUDA: v11.7, 11.8
  • Python: v3.8, 3.9, 3.10, 3.11

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU may be fetched directly from PyPI:

# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.1

# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.1

Changes

This is a minor release whose main purpose is to deliver Python 3.11 support.

  • [New] Add support for Python 3.11 (#1646)
  • [Improvement] Add support for group size > 54 in group_index_select (#1611)
  • [Improvement] Implement cache miss emulation in UVM_CACHING (#1637)))
  • [Improvement] Add TensorAccessor with memcheck (#1602)

FBGEMM_GPU v0.4.0

15 Mar 17:08
@q10 q10
Compare
Choose a tag to compare

Release Notes

Software Requirements

FBGEMM_GPU v0.4.0 has been tested and known to work on the following setups:

  • PyTorch: v2.0
  • CUDA: v11.7, 11.8
  • Python: v3.8, 3.9, 3.10 (3.11 not supported yet)

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU may be fetched directly from PyPI:

# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.0

# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.0

Changes

Table batched embedding (TBE) operators

UVM cache improvement

  • [New] Delta in-place update (#1436)
  • [New] UVM caching stats report (#1623, #1462, #1433, #1623, #1570)
  • [Improvement] [lfu|lru]_cache_insert_byte_kernel vectorization (#1475)

Jagged Tensor Operators

Index Select Operators

  • [New] group_index_select (#1421, #1592)
  • [New] index_select for selecting KeyJaggedTensor dim 1 (previously support only dim 0) (#1429)
  • [New] jagged_index_select for CPU (#1586)

Low-precision operators

  • [New] FP8 rowwise quantized communication (#1423)

Misc

  • Support 2D inputs for asynchronous_complete_cumsum (#1573)

Benchmarks / Tests

  • [New] nbit_device_with_spec for table batched embedding inference benchmark (#1455, #1465)
  • [New] Variable bag sizes for TBE benchmark (#1450)
  • [Improvement] Parallel bottom_unique_k_per_row for faster Zipf data generation (for FBGEMM benchmarks) (#1447)

Build / CI improvements and Fixes

v0.3.2

19 Jan 22:22
Compare
Choose a tag to compare

Minor release