FBGEMM_GPU v0.4.0

q10 released this 15 Mar 17:08

· 1516 commits to main since this release

Release Notes

Software Requirements

FBGEMM_GPU v0.4.0 has been tested and known to work on the following setups:

PyTorch: v2.0
CUDA: v11.7, 11.8
Python: v3.8, 3.9, 3.10 (3.11 not supported yet)

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU may be fetched directly from PyPI:

# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.0

# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.0

Changes

Table batched embedding (TBE) operators

[New] SSD for inference TBE (#1473, #1479, #1485, #1517, #1533, #1535)
[New] Inplace TBE update (#1480, #1482, #1492, #1529)
[New] BF16 support for inference TBE (#1498, #1503)
[New] BF16 support for TBE on CPU (#1540, #1583)
[Improvement] Training TBE backward performance improvement (#1563)

UVM cache improvement

[New] Delta in-place update (#1436)
[New] UVM caching stats report (#1623, #1462, #1433, #1623, #1570)
[Improvement] [lfu|lru]_cache_insert_byte_kernel vectorization (#1475)

Jagged Tensor Operators

[New] Backends (Meta and Autograd) (#1461, #1466, #1467, #1469, #1468, #1477, #1556)
[New] BF16 support (#1472, #1560)
[New] FP32 + BF16 hybrid support for jagged_dense_dense_elementwise_add_jagged (#1487)
[New] Jagged tensors with no inner dense dimension support (#1267)
[New] New jagged tensor operators (#1557, #1577, #1578, #1579, #1594, #1595)

Index Select Operators

[New] group_index_select (#1421, #1592)
[New] index_select for selecting KeyJaggedTensor dim 1 (previously support only dim 0) (#1429)
[New] jagged_index_select for CPU (#1586)

Low-precision operators

[New] FP8 rowwise quantized communication (#1423)

Misc

Support 2D inputs for asynchronous_complete_cumsum (#1573)

Benchmarks / Tests

[New] nbit_device_with_spec for table batched embedding inference benchmark (#1455, #1465)
[New] Variable bag sizes for TBE benchmark (#1450)
[Improvement] Parallel bottom_unique_k_per_row for faster Zipf data generation (for FBGEMM benchmarks) (#1447)

Build / CI improvements and Fixes

[New] Linter integration (#1427)
[Improvement] General CI and build system enhancement (#1444, #1407, #1541, #1542, #1544, #1546, #1549, #1562, #1568, #1589, #1603, #1598, #1606, #1619, #1627, #1631, #1635)
[Improvement] AMD GPU CI and build system enhancement (#1537, #1552, #1543)

Assets 2