Skip to content

FBGEMM_GPU v0.4.0

Compare
Choose a tag to compare
@q10 q10 released this 15 Mar 17:08
· 1516 commits to main since this release

Release Notes

Software Requirements

FBGEMM_GPU v0.4.0 has been tested and known to work on the following setups:

  • PyTorch: v2.0
  • CUDA: v11.7, 11.8
  • Python: v3.8, 3.9, 3.10 (3.11 not supported yet)

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU may be fetched directly from PyPI:

# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.0

# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.0

Changes

Table batched embedding (TBE) operators

UVM cache improvement

  • [New] Delta in-place update (#1436)
  • [New] UVM caching stats report (#1623, #1462, #1433, #1623, #1570)
  • [Improvement] [lfu|lru]_cache_insert_byte_kernel vectorization (#1475)

Jagged Tensor Operators

Index Select Operators

  • [New] group_index_select (#1421, #1592)
  • [New] index_select for selecting KeyJaggedTensor dim 1 (previously support only dim 0) (#1429)
  • [New] jagged_index_select for CPU (#1586)

Low-precision operators

  • [New] FP8 rowwise quantized communication (#1423)

Misc

  • Support 2D inputs for asynchronous_complete_cumsum (#1573)

Benchmarks / Tests

  • [New] nbit_device_with_spec for table batched embedding inference benchmark (#1455, #1465)
  • [New] Variable bag sizes for TBE benchmark (#1450)
  • [Improvement] Parallel bottom_unique_k_per_row for faster Zipf data generation (for FBGEMM benchmarks) (#1447)

Build / CI improvements and Fixes