diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 50cfc67dce4f..259249fd0b37 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,4 +1,6 @@ -**Which issue is resolved by this Pull Request:** +# Changes + +**Which issue is resolved by this Pull Request:** Resolves # **Description of your changes:** diff --git a/docs/README.md b/docs/README.md index b8cffb7d825d..8f65baecd6b1 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,10 +1,17 @@ -Workflow figure is generated using [PlantUML](https://plantuml.com/ditaa) with the [ditaa](https://ditaa.sourceforge.net). -To generate it yourself, the easiest way is to install the [PlantUML plugin in VS Code](https://marketplace.visualstudio.com/items?itemName=jebbs.plantuml) (with the prerequisite installed), open the file and click preview. +# Workflow PlantUML + +Workflow figure is generated using [PlantUML](https://plantuml.com/ditaa) with +the [ditaa](https://ditaa.sourceforge.net). +To generate it yourself, the easiest way is to install the +[PlantUML plugin in VS Code](https://marketplace.visualstudio.com/items?itemName=jebbs.plantuml) +(with the prerequisite installed), open the file and click preview. + +If you don't want to install the dependencies locally, you can use the following +settings to make the preview work with a remote render: -If you don't want to install the dependencies locally, you can use the following settings to make the preview work with a remote render: ```json "plantuml.render": "PlantUMLServer", "plantuml.server": "https://www.plantuml.com/plantuml", ``` -[ASCIIFlow](https://asciiflow.com/#/) is a helpful tool to edit the source code. \ No newline at end of file +[ASCIIFlow](https://asciiflow.com/#/) is a helpful tool to edit the source code. diff --git a/docs/containerization.md b/docs/containerization.md index 2dc1958b0f69..0c773e639bbd 100644 --- a/docs/containerization.md +++ b/docs/containerization.md @@ -1,10 +1,12 @@ -# Putting `lab` in a Container AND making it go fast +# Putting `lab` in a Container AND making it go fast -Containerization of `lab` allows for portability and ease of setup. With this, users can now run lab on OpenShift to test the speed of `lab train` and `generate` using dedicated GPUs. This guide shows you how to put the `lab`CLI, all of its dependencies, -and your GPU into a container for an isolated and easily reproducible experience. +Containerization of `lab` allows for portability and ease of setup. With this, +users can now run lab on OpenShift to test the speed of `lab train` and `generate` +using dedicated GPUs. This guide shows you how to put the `lab`CLI, all of its +dependencies, and your GPU into a container for an isolated and easily reproducible +experience. - -## Steps to build an image then run a container: +## Steps to build an image then run a container **Containerfile:** @@ -30,25 +32,35 @@ CMD ["/bin/bash"] Or image: TBD (am I allowed to have a public image with references to lab in it?) -This containerfile is based on Nvidia's CUDA image, which lucky for us plugs directly into Podman via their `nvidia-container-toolkit`! The ubi9 base image does not have most packages installed. The bulk of the `containerfile` is spent configuring your system so `lab` can be installed and run properly. ubi9 as compared to ubuntu cannot install the entire nvidia-12-4 toolkit. This did not impact performance during testing. +This containerfile is based on Nvidia's CUDA image, which lucky for us plugs +directly into Podman via their `nvidia-container-toolkit`! The ubi9 base image +does not have most packages installed. The bulk of the `containerfile` is spent +configuring your system so `lab` can be installed and run properly. ubi9 as compared +to ubuntu cannot install the entire nvidia-12-4 toolkit. This did not impact +performance during testing. -1. Podman build –ssh=default -f +```shell +1. podman build –ssh=default -f 2. curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo 3. sudo yum-config-manager --enable nvidia-container-toolkit-experimental 4. sudo dnf install -y nvidia-container-toolkit 5. sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml 6. nvidia-ctk cdi list Example output: - INFO[0000] Found 2 CDI devices + INFO[0000] Found 2 CDI devices nvidia.com/gpu=0 nvidia.com/gpu=all 7. podman run --device nvidia.com/gpu=0 --security-opt=label=disable -it +``` Voila! You now have a container with CUDA and GPUs enabled! -#### Sources: -https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html – nvidia container toolkit -https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html podman +### Sources + +[Nvidia Container Toolkit Install Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) + +[Podman Support for Container Device Interface](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html) + +### Notes -#### Notes: Thanks to Taj Salawu for figuring out how to pass the git ssh keys properly! diff --git a/docs/converting_GGUF.md b/docs/converting_GGUF.md index 45ab259e1698..c9d2e962f5bd 100644 --- a/docs/converting_GGUF.md +++ b/docs/converting_GGUF.md @@ -1,9 +1,12 @@ - +# Optional: Converting a Model to GGUF and Quantizing -# Optional: Converting a Model to GGUF and Quantizing - -The latest [llama.cpp](https://github.com/ggerganov/llama.cpp) framework -requires the model to be converted into [GGUF](https://medium.com/@sandyeep70/ggml-to-gguf-a-leap-in-language-model-file-formats-cd5d3a6058f9) format. [GGUF](https://medium.com/@sandyeep70/ggml-to-gguf-a-leap-in-language-model-file-formats-cd5d3a6058f9) is a quantization technique. [Quantization](https://www.tensorops.ai/post/what-are-quantized-llms) is a technique used to reduce the size of large neural networks, including large language models (LLMs) by modifying the precision of their weights. If you have a model already in GGUF format, you can skip this step. +The latest [llama.cpp](https://github.com/ggerganov/llama.cpp) framework +requires the model to be converted into [GGUF](https://medium.com/@sandyeep70/ggml-to-gguf-a-leap-in-language-model-file-formats-cd5d3a6058f9) +format. [GGUF](https://medium.com/@sandyeep70/ggml-to-gguf-a-leap-in-language-model-file-formats-cd5d3a6058f9) +is a quantization technique. [Quantization](https://www.tensorops.ai/post/what-are-quantized-llms) +is a technique used to reduce the size of large neural networks, including large +language models (LLMs) by modifying the precision of their weights. If you have a +model already in GGUF format, you can skip this step. ## Clone the llama.cpp repository @@ -42,7 +45,8 @@ def write(self): ## Convert a model to GGUF -The following command converts a Hugging Face model (safetensors) to [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) format and saves it in your model directory with a `.gguf` extension. +The following command converts a Hugging Face model (safetensors) to [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) +format and saves it in your model directory with a `.gguf` extension. ```shell export MODEL_DIR={model_directory} @@ -53,9 +57,10 @@ python convert-hf-to-gguf.py $MODEL_DIR --outtype f16 ## Quantize -Optionally, for smaller/faster models with varying loss of quality use a quantized model. +Optionally, for smaller/faster models with varying loss of quality use a +quantized model. -#### Make the llama.cpp binaries +### Make the llama.cpp binaries Build binaries like `quantize` etc. for your environment. @@ -65,15 +70,17 @@ make #### Run quantize command - ```shell ./quantize {model_directory}/{f16_gguf_model} ``` -For example, the following command converts the f16 GGUF model to a Q4_K_M quantized model and saves it in your model directory with a `.gguf` suffix (e.g. ggml-model-Q4_K_M.gguf). +For example, the following command converts the f16 GGUF model to a Q4_K_M +quantized model and saves it in your model directory with a `.gguf` +suffix (e.g. ggml-model-Q4_K_M.gguf). ```shell ./quantize $MODEL_DIR/ggml-model-f16.gguf Q4_K_M ``` -> Tip: Use `./quantize help` for a list of quantization types with their relative size and output quality along with additional usage parameters. +> Tip: Use `./quantize help` for a list of quantization types with their +> relative size and output quality along with additional usage parameters. diff --git a/docs/gpu-acceleration.md b/docs/gpu-acceleration.md index d65ac9076f7b..e3fdf6fb848b 100644 --- a/docs/gpu-acceleration.md +++ b/docs/gpu-acceleration.md @@ -1,14 +1,26 @@ # 🏎️ Making `lab` go fast -By default, `lab` will attempt to use your GPU for inference and synthesis. This works on a wide variety of common systems, but less-common configurations may require some additional tinkering to get it enabled. This document aims to describe how you can GPU-accelerate `lab` on a variety of different environments. +By default, `lab` will attempt to use your GPU for inference and synthesis. This +works on a wide variety of common systems, but less-common configurations may +require some additional tinkering to get it enabled. This document aims to +describe how you can GPU-accelerate `lab` on a variety of different +environments. -`lab` relies on two Python packages that can be GPU accelerated: `torch` and `llama-cpp-python`. In short, you'll need to replace the default versions of these packages with versions that have been compiled for GPU-specific support, recompile `lab`, then run it. +`lab` relies on two Python packages that can be GPU accelerated: `torch` +and `llama-cpp-python`. In short, you'll need to replace the default versions of +these packages with versions that have been compiled for GPU-specific support, +recompile `lab`, then run it. -### Python 3.11 (Linux only) +## Python 3.11 (Linux only) -> **NOTE:** This section may be outdated. At least AMD ROCm works fine with Python 3.12 and Torch 2.2.1+rocm5.7 binaries. +> **NOTE:** This section may be outdated. At least AMD ROCm works fine with +> Python 3.12 and Torch 2.2.1+rocm5.7 binaries. -Unfortunately, at the time of writing, `torch` does not have GPU-specific support for the latest Python (3.12), so if you're on Linux, it's recommended to set up a Python 3.11-specific `venv` and install `lab` to that to minimize issues. (MacOS ships Python 3.9, so this step shouldn't be necessary.) Here's how to do that on Fedora with `dnf`: +Unfortunately, at the time of writing, `torch` does not have GPU-specific +support for the latest Python (3.12), so if you're on Linux, it's recommended +to set up a Python 3.11-specific `venv` and install `lab` to that to minimize +issues. (MacOS ships Python 3.9, so this step shouldn't be necessary.) Here's +how to do that on Fedora with `dnf`: ```shell # Install python3.11 @@ -32,22 +44,27 @@ With Python 3.11 installed, it's time to replace some packages! ### Nvidia/CUDA -`torch` should already ship with CUDA support, so you only have to replace `llama-cpp-python`. +`torch` should already ship with CUDA support, so you only have to replace +`llama-cpp-python`. -Ensure you have the latest proprietary Nvidia drivers installed. You can easily validate whether you are using `nouveau` or `nvidia` kernel drivers with the following command. If your output shows "Kernel driver in use: nouveau", you are **not running** with the proprietary Nvidia drivers. +Ensure you have the latest proprietary Nvidia drivers installed. You can +easily validate whether you are using `nouveau` or `nvidia` kernel drivers with +the following command. If your output shows "Kernel driver in use: nouveau", +you are **not running** with the proprietary Nvidia drivers. ```shell # Check video driver lspci -n -n -k | grep -A 2 -e VGA -e 3D ``` -If needed, install the proprietary NVidia drivers +If needed, install the proprietary NVidia drivers ```shell # Enable RPM Fusion Repos sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm # Install Nvidia Drivers + # There may be extra steps for enabling secure boot. View the following blog for further details: https://blog.monosoul.dev/2022/05/17/automatically-sign-nvidia-kernel-module-in-fedora-36/ sudo dnf install akmod-nvidia xorg-x11-drv-nvidia-cuda @@ -59,7 +76,8 @@ sudo reboot lspci -n -n -k | grep -A 2 -e VGA -e 3D ``` -You should now see "Kernel driver in use: nvidia". The next step is to ensure CUDA 12.4 is installed. +You should now see "Kernel driver in use: nvidia". The next step is to ensure +CUDA 12.4 is installed. ```shell # Install CUDA 12.4 and nvtop to monitor GPU usage @@ -69,7 +87,12 @@ sudo dnf clean all sudo dnf -y install cuda-toolkit-12-4 nvtop ``` -Go to the project's Github to see the [supported backends](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#supported-backends). Find the `cuBLAS (CUDA)` backend. You'll see a `pip3 install` command. You'll want to add a few options to ensure it gets installed over the existing package: `--force-reinstall` and `--no-cache-dir`. Your final command should look like this: +Go to the project's Github to see the +[supported backends](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#supported-backends). +Find the `cuBLAS (CUDA)` backend. You'll see a `pip3 install` command. +You'll want to add a few options to ensure it gets installed over the +existing package: `--force-reinstall` and `--no-cache-dir`. Your final +command should look like this: ```shell # Veryify CUDA can be found in your PATH variable @@ -84,25 +107,38 @@ CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip3 install --force-reinstall --no-cache-dir lla pip3 install cli/. ``` -Proceed to the `Initialize` section of the [CLI Readme](https://github.com/instruct-lab/cli?tab=readme-ov-file#%EF%B8%8F-initialize-lab), and use the `nvtop` utility to validate GPU utilization when interacting with `lab chat` or `lab generate` +Proceed to the `Initialize` section of +the [CLI Readme](https://github.com/instruct-lab/cli?tab=readme-ov-file#%EF%B8%8F-initialize-lab), +and use the `nvtop` utility to validate GPU utilization when interacting +with `lab chat` or `lab generate` ### AMD/ROCm -Your user account must be in the `video` and `render` group to have permission to access the GPU hardware. If the `id` command does not show both groups, then run the following command. You have to log out log and log in again to refresh your current user session. +Your user account must be in the `video` and `render` group to have permission +to access the GPU hardware. If the `id` command does not show both groups, then +run the following command. You have to log out log and log in again to refresh +your current user session. ```shell sudo usermod -a -G render,video $LOGNAME ``` -`torch` does not yet ship with AMD ROCm support, so you'll need to install a version compiled with support. +`torch` does not yet ship with AMD ROCm support, so you'll need to install a +version compiled with support. -Visit [Pytorch's "Get Started Locally" page](https://pytorch.org/get-started/locally/) and use the matrix installer tool to find the ROCm package. `Stable, Linux, Pip, Python, ROCm 5.7` in the matrix installer spits out the following command: +Visit [Pytorch's "Get Started Locally" page](https://pytorch.org/get-started/locally/) +and use the matrix installer tool to find the ROCm package. `Stable, Linux, Pip, +Python, ROCm 5.7` in the matrix installer spits out the following command: ```shell pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7 ``` -You don't need `torchvision` or `torchaudio`, so get rid of those. You also want to make _very_ sure you're installing the right package, and not the old one that doesn't have GPU support, so you should add these options: `--force-reinstall` and `--no-cache-dir`. Your command should look like below. Run it to install the new version of `torch`. +You don't need `torchvision` or `torchaudio`, so get rid of those. You also want +to make _very_ sure you're installing the right package, and not the old one +that doesn't have GPU support, so you should add these options: +`--force-reinstall` and `--no-cache-dir`. Your command should look like below. +Run it to install the new version of `torch`. ```shell pip3 install torch --force-reinstall --no-cache-dir --index-url https://download.pytorch.org/whl/rocm5.7 @@ -110,14 +146,23 @@ pip3 install torch --force-reinstall --no-cache-dir --index-url https://download With that done, it's time to move on to `llama-cpp-python`. -Go to the project's Github to see the [supported backends](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#supported-backends). There are several possible backends that may work on AMD; `CLBlast (OpenCL)` and `hipBLAS (ROCm)` have been tested to work. It may be worth installing others to see if they work for you, but your mileage may vary. Instructions for the tested backends are included below! +Go to the project's Github to see +the [supported backends](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#supported-backends). +There are several possible backends that may work on AMD; `CLBlast (OpenCL)` +and `hipBLAS (ROCm)` have been tested to work. It may be worth installing others +to see if they work for you, but your mileage may vary. Instructions for the +tested backends are included below! -Whichever backend you choose, you'll see a `pip3 install` command. You'll want to add a few options to ensure it gets installed over the existing package: `--force-reinstall` and `--no-cache-dir`. +Whichever backend you choose, you'll see a `pip3 install` command. You'll want +to add a few options to ensure it gets installed over the existing package: +`--force-reinstall` and `--no-cache-dir`. #### hipBLAS -If using hipBLAS you may need to install additional ROCm and hipBLAS Dependencies: -``` +If using hipBLAS you may need to install additional ROCm and hipBLAS +Dependencies: + +```shell # Optionally enable repo.radeon.com repository, available through AMD documentation or Radeon Software for Linux for RHEL 9.3 at https://www.amd.com/en/support/linux-drivers # The above will get you the latest 6.x drivers, and will not work with rocm5.7 pytorch # to grab rocm 5.7 drivers: https://repo.radeon.com/amdgpu-install/23.30.3/rhel/9.2/ @@ -128,10 +173,15 @@ sudo dnf install rocm-dev rocm-utils rocm-llvm rocminfo sudo dnf install hipblas-devel hipblas rocblas-devel ``` -With those dependencies installed, you should be able to install (and build) `llama-cpp-python`! +With those dependencies installed, you should be able to install (and build) +`llama-cpp-python`! -You can use `rocminfo | grep gfx` from `rocminfo` package or `amdgpu-arch` from `clang-tools-extra` package to find our GPU model to include in the build command - this may not be necessary in Fedora 40+ or ROCm 6.0+. You should see something like the following if you have an AMD Integrated and Dedicated GPU: -``` +You can use `rocminfo | grep gfx` from `rocminfo` package or `amdgpu-arch` from +`clang-tools-extra` package to find our GPU model to include in the build +command - this may not be necessary in Fedora 40+ or ROCm 6.0+. You should see +something like the following if you have an AMD Integrated and Dedicated GPU: + +```shell $ rocminfo | grep gfx Name: gfx1100 Name: amdgcn-amd-amdhsa--gfx1100 @@ -139,15 +189,29 @@ $ rocminfo | grep gfx Name: amdgcn-amd-amdhsa--gfx103 ``` -In this case, `gfx1100` is the model we're looking for (our dedicated GPU) so we'll include that in our build command as follows: +In this case, `gfx1100` is the model we're looking for (our dedicated GPU) so +we'll include that in our build command as follows: ```shell CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS=gfx1100" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --no-cache-dir ``` -> **Note:** This is explicitly forcing the build to use the ROCm compilers and prefix path for dependency resolution in the CMake build. This works around an issue in the CMake and ROCm version in Fedora 39 and below and is fixed in Fedora 40. With Fedora 40's ROCm packages, use `CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DAMDGPU_TARGETS=gfx1100"` instead. - -Once that package is installed, recompile `lab` with `pip3 install .`. You also need to tell `HIP` which GPU to use - you can find this out via `rocminfo` although it is typically GPU 0. To set which device is visible to HIP, we'll set `export HIP_VISIBLE_DEVICES=0` for GPU 0. You may also have to set `HSA_OVERRIDE_GFX_VERSION` to override ROCm's GFX version detection, for example `export HSA_OVERRIDE_GFX_VERSION=10.3.0` to force an unsupported `gfx1032` card to use use supported `gfx1030` version. The environment variable `AMD_LOG_LEVEL` enables debug logging of ROCm libraries, for example `AMD_LOG_LEVEL=3` to print API calls to stderr. +> **Note:** This is explicitly forcing the build to use the ROCm compilers and +> prefix path for dependency resolution in the CMake build. This works around +> an issue in the CMake and ROCm version in Fedora 39 and below and is fixed in +> Fedora 40. With Fedora 40's ROCm packages, use +> `CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DCMAKE_C_COMPILER=/usr/bin/clang +> -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DAMDGPU_TARGETS=gfx1100"` instead. + +Once that package is installed, recompile `lab` with `pip3 install .`. You also +need to tell `HIP` which GPU to use - you can find this out via `rocminfo` +although it is typically GPU 0. To set which device is visible to HIP, we'll +set `export HIP_VISIBLE_DEVICES=0` for GPU 0. You may also have to set +`HSA_OVERRIDE_GFX_VERSION` to override ROCm's GFX version detection, for example +`export HSA_OVERRIDE_GFX_VERSION=10.3.0` to force an unsupported `gfx1032` card +to use use supported `gfx1030` version. The environment variable +`AMD_LOG_LEVEL` enables debug logging of ROCm libraries, for example +`AMD_LOG_LEVEL=3` to print API calls to stderr. Now you can skip to the `Testing` section. @@ -159,25 +223,41 @@ Your final command should look like so (this uses `CLBlast`): CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip3 install --force-reinstall --no-cache-dir llama-cpp-python ``` -Once that package is installed, recompile `lab` with `pip3 install .` and skip to the `Testing` section. +Once that package is installed, recompile `lab` with `pip3 install .` and skip +to the `Testing` section. ### Metal/Apple Silicon -The `lab` default installation should have Metal support by default. If that isn't the case, these steps might help to enable it. +The `lab` default installation should have Metal support by default. If that +isn't the case, these steps might help to enable it. -`torch` should already ship with Metal support, so you only have to replace `llama-cpp-python`. Go to the project's Github to see the [supported backends](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#supported-backends). Find the `Metal` backend. You'll see a `pip3 install` command. You'll want to add a few options to ensure it gets installed over the existing package: `--force-reinstall` and `--no-cache-dir`. Your final command should look like so: +`torch` should already ship with Metal support, so you only have to +replace `llama-cpp-python`. Go to the project's Github to see the +[supported backends](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#supported-backends). +Find the `Metal` backend. You'll see a `pip3 install` command. You'll want to +add a few options to ensure it gets installed over the existing package: +`--force-reinstall` and `--no-cache-dir`. Your final command should look like so: ```shell CMAKE_ARGS="-DLLAMA_METAL=on" pip3 install --force-reinstall --no-cache-dir llama-cpp-python ``` -Once that package is installed, recompile `lab` with `pip3 install .` and skip to the `Testing` section. +Once that package is installed, recompile `lab` with `pip3 install .` and skip +to the `Testing` section. ### Testing -Test your changes by chatting to the LLM. Run `lab serve` and `lab chat` and chat to the LLM. If you notice significantly faster inference, congratulations! You've enabled GPU acceleration. You should also notice that the `lab generate` step will take significantly less time. You can use tools like `nvtop` and `radeontop` to monitor GPU usage. +Test your changes by chatting to the LLM. Run `lab serve` and `lab chat` and +chat to the LLM. If you notice significantly faster inference, congratulations! +You've enabled GPU acceleration. You should also notice that the `lab generate` +step will take significantly less time. You can use tools like `nvtop` and +`radeontop` to monitor GPU usage. -The `torch` and `llama_cpp` packages provide functions to debug GPU support. Here is an example from an AMD ROCm system with a single GPU, ROCm build of PyTorch and llama-cpp with HIPBLAS. Don't be confused by the fact that PyTorch uses `torch.cuda` API for ROCm or llama-cpp reports HIPBLAS as CUBLAS. The packages treat ROCm like a variant of CUDA. +The `torch` and `llama_cpp` packages provide functions to debug GPU support. +Here is an example from an AMD ROCm system with a single GPU, ROCm build of +PyTorch and llama-cpp with HIPBLAS. Don't be confused by the fact that PyTorch +uses `torch.cuda` API for ROCm or llama-cpp reports HIPBLAS as CUBLAS. The +packages treat ROCm like a variant of CUDA. ```python >>> import torch @@ -210,7 +290,13 @@ ggml_init_cublas: found 1 ROCm devices: ## Training -`lab train` also experimentally supports GPU acceleration on Linux. Details of a working set up is included above. Training is memory-intensive and requires a modern GPU to work. The GPU must support `bfloat16` or `fp16` and have at least 17 GiB of free GPU memory. Nvidia CUDA on WSL2 is able to use shared host memory (USM) if GPU memory is not sufficient, but that comes with a performance penalty. Training on Linux Kernel requires all data to fit in GPU memory. We are working on improvements like 4-bit quantization. +`lab train` also experimentally supports GPU acceleration on Linux. Details +of a working set up is included above. Training is memory-intensive and requires +a modern GPU to work. The GPU must support `bfloat16` or `fp16` and have at +least 17 GiB of free GPU memory. Nvidia CUDA on WSL2 is able to use shared host +memory (USM) if GPU memory is not sufficient, but that comes with a performance +penalty. Training on Linux Kernel requires all data to fit in GPU memory. We are +working on improvements like 4-bit quantization. It has been successfully tested on: @@ -219,16 +305,20 @@ It has been successfully tested on: - Nvidia Tesla V100 (16 GB) on AWS `p3.2xlarge`, Fedora 39, PyTorch 2.2.1, 4-bit quantization - AMD Radeon RX 7900 XT (20 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7 - AMD Radeon RX 7900 XTX (24 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7 -- AMD Radeon RX 6700 XT (12 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7, 4-bit quantization +- AMD Radeon RX 6700 XT (12 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7, 4-bit +quantization Incompatible devices: - NVidia cards with Turing architecture (GeForce RTX 20 series) or older. They lack support for `bfloat16` and `fp16`. -> **Note:** PyTorch implements AMD ROCm support on top of its `torch.cuda` API and treats AMD GPUs as CUDA devices. In a ROCm build of PyTorch, `cuda:0` is actually the first ROCm device. - -> **Note:** Training does not use a local lab server. You can stop `lab serve` to free up GPU memory. +> **Note:** PyTorch implements AMD ROCm support on top of its `torch.cuda` API +> and treats AMD GPUs as CUDA devices. In a ROCm build of PyTorch, `cuda:0` is +> actually the first ROCm device. + +> **Note:** Training does not use a local lab server. You can stop `lab serve` +> to free up GPU memory. ```shell lab train --device cuda @@ -243,4 +333,4 @@ LINUX_TRAIN.PY: PyTorch device is 'cuda:0' Free GPU memory: 19.9 GiB of 20.0 GiB LINUX_TRAIN.PY: NUM EPOCHS IS: 1 ... -``` \ No newline at end of file +```