This repo provides demos and packages to perform fast inference solutions for BLOOM. Some of the solutions have their own repos in which case a link to the corresponding repos is provided instead.
Some of the solutions provide both half-precision and int8-quantized solution.
Solutions developed to perform large batch inference locally:
Pytorch:
-
Thomas Wang is working on a Custom Fused Kernel solution - will link once it's ready for a general use.
JAX:
Solutions developed to be used in a server mode (i.e. varied batch size, varied request rate):
Pytorch:
Rust: