The llm
ecosystem of crates, including llm
, llm-base
and ggml
support various acceleration backends, selectable via --features
flags. The availability of supported backends varies by platform, and these crates can only be built with a single active acceleration backend at a time. If CuBLAS and CLBlast are both specified, CuBLAS is prioritized and CLBlast is ignored.
Platform/OS | cublas |
clblast |
metal |
---|---|---|---|
Windows | ✔️ | ✔️ | ❌ |
Linux | ✔️ | ✔️ | ❌ |
MacOS | ❌ | ❌ | ✔️ |
To activate GPU support (assuming that you have enabled one of the features above), set the use_gpu
attribute of the ModelParameters
to true
.
-
CLI Users: You can enable GPU support by adding the
--use-gpu
flag. -
Backend Consideration: For users leveraging the
cublas
orclblast
backends, you can specify the number of layers you wish to offload to your GPU with thegpu_layers
parameter in theModelParameters
. By default, all layers are offloaded.However, if your model size exceeds your GPU's VRAM, you can specify a limit, like
20
, to offload only the first 20 layers. For CLI users, this can be achieved using the--gpu-layers
parameter.
Example: To run a llama
model with CUDA acceleration and offload all its layers, your CLI command might resemble:
cargo run --release --features cublas -- infer -a llama -m [path/to/model.bin] --use-gpu -p "Help a llama is standing in my garden!"
💡 Protip: For those with ample VRAM using cublas
or clblast
, you can significantly reduce your prompt's feed time by increasing the batch size; for example, you can use 256
or 512
(default is 8
).
-
Programmatic users of
llm
can adjust this by setting then_batch
parameter in theInferenceSessionConfig
when initializing a session. -
CLI users can utilize the
--batch-size
parameter to achieve this.
While specific accelerators only support certain model architectures, some unmarked architectures may function, but their performance is not guaranteed—it hinges on the operations used by the model's architecture. The table below lists models with confirmed compatibility for each accelerator:
Model/accelerator | cublas |
clblast |
metal |
---|---|---|---|
LLaMA | ✅ | ✅ | ✅ |
MPT | ❌ | ❌ | ❌ |
Falcon | ❌ | ❌ | ❌ |
GPT-NeoX | ❌ | ❌ | ❌ |
GPT-J | ✅ | ❌ | ❌ |
GPT-2 | ❌ | ❌ | ❌ |
BLOOM | ❌ | ❌ | ❌ |
To build with acceleration support, certain dependencies must be installed. These dependencies are contingent upon your chosen platform and the specific acceleration backend you're working with.
For developers aiming to distribute packages equipped with acceleration capabilities, our CI/CD setup serves as an exemplary foundation.
CUDA must be installed. You can download CUDA from the official Nvidia site.
CLBlast can be installed via vcpkg using the command vcpkg install clblast
. After installation, the OPENCL_PATH
and CLBLAST_PATH
environment variables should be set to the opencl_x64-windows
and clblast_x64-windows
directories respectively.
Here's an example of the required commands:
git clone https://github.com/Microsoft/vcpkg.git
.\vcpkg\bootstrap-vcpkg.bat
.\vcpkg\vcpkg install clblast
set OPENCL_PATH=....\vcpkg\packages\opencl_x64-windows
set CLBLAST_PATH=....\vcpkg\packages\clblast_x64-windows
-Ctarget-feature=+crt-static
Rust flag. This flag is critical as it enables the static linking of the C runtime, which can be paramount for certain deployment scenarios or specific runtime environments.
To set this flag, you can modify the .cargo\config file in your project directory. Please add the following configuration snippet:
[target.x86_64-pc-windows-msvc]
rustflags = ["-Ctarget-feature=+crt-static"]
This will ensure the Rust flag is appropriately set for your compilation process.
For a comprehensive guide on the usage of Rust flags, including other possible ways to set them, please refer to this detailed StackOverflow discussion. Make sure to choose an option that best fits your project requirements and development environment.
llm
to function properly, it requires the clblast.dll
and OpenCL.dll
files. These files can be found within the bin
subdirectory of their respective vcpkg packages. There are two options to ensure llm
can access these files:
-
Amend your
PATH
environment variable to include thebin
directories of each respective package. -
Manually copy the
clblast.dll
andOpenCL.dll
files into the./target/release
or./target/debug
directories. The destination directory will depend on the profile that was active during the compilation process.
Please choose the option that best suits your needs and environment configuration.
You need to have CUDA installed on your system. CUDA can be downloaded and installed from the official Nvidia site. On Linux distributions that do not have CUDA_PATH
set, the environment variables CUDA_INCLUDE_PATH
and CUDA_LIB_PATH
can be set to their corresponding paths.
CLBlast can be installed on Linux through various package managers. For example, using apt
you can install it via sudo apt install clblast
. After installation, make sure that the OPENCL_PATH
and CLBLAST_PATH
environment variables are correctly set. Additionally the environment variables OPENCL_INCLUDE_PATH
/OPENCL_LIB_PATH
& CBLAST_INCLUDE_PATH
/CLBLAST_LIB_PATH
can be used to specify the location of the files. All environment variables are supported by all listed operating systems.
Xcode and the associated command-line tools should be installed on your system, and you should be running a version of MacOS that supports Metal. For more detailed information, please consult the official Metal documentation.
To enable Metal using the CLI, ensure it was built successfully using --features=metal
and then pass the --use-gpu
flag.
The current underlying implementation of Metal in GGML is still in flux and has some limitations:
- Evaluating a model with more than one token at a time is not currently supported in GGML's Metal implementation. An
llm
inference session will fall back to the CPU implementation (typically during the 'feed prompt' phase) but will automatically use the GPU once a single token is passed per evaluation (typically after prompt feeding). - Not all model architectures will be equally stable when used with Metal due to ongoing work in the underlying implementation. Expect
llama
models to work fine though. - With Metal, it is possible but not required to use
mmap
. As buffers do not need to be copied to VRAM on M1,mmap
is the most efficient however. - Debug messages may be logged by the underlying GGML Metal implementation. This will likely go away in the future for release builds of
llm
.