refactor kv cache ops

Summary: X-link: facebookresearch/FBGEMM#94 Simplify the kv cache code and reuse the APIs. Differential Revision: D61082542
pytorch · Aug 15, 2024 · 8534573 · 8534573
1 parent 7105d5f
commit 8534573
Showing 1 changed file with 3 additions and 1 deletion.
diff --git a/fbgemm_gpu/experimental/gen_ai/README.md b/fbgemm_gpu/experimental/gen_ai/README.md
@@ -6,13 +6,15 @@ FBGEMM FP8 rowwise quantization kernels have been officially adopted in the [Lla
 
 FBGEMM GenAI FP8 supports a variety of configurations:
 
-* GEMM Operators: {hipBLASLt, CK, Triton} x {BF16, FP8} x {tensor-wise, row-wise, block-wise} x {Nvidia H100, AMD MI300x}.
+* GEMM Operators: {CUTLASS, CK, Triton} x {BF16, FP8} x {tensor-wise, row-wise, block-wise} x {Nvidia H100, AMD MI300x}.
 * High/low Precision Conversion Kernels: (FP32 / BF16 <-> FP8) with scaling options {tensor-wise, row-wise, block-wise} across hardware platforms {Nvidia H100, AMD MI300x} and programming options of {Triton, CUDA/HIP}.
 
 Besides FP8 support, FBGEMM GenAI operators also support:
 
 * Customized AllReduce communications (reduce latency for small message sizes).
 * GQA: optimized specifically for decoding cases, as detailed in PyTorch's blog on [INT4 decoding](https://pytorch.org/blog/int4-decoding/).
+* KV cache quantizations.
+* Rotary Positional Embedding (RoPE).
 
 ## **1.1 FP8 core API functions**