Skip to content

Commit

Permalink
Add "Fine Tuning LLMs" how to guide (ROCm#3124)
Browse files Browse the repository at this point in the history
* Add Fine Tuning LLMs how to guide

* Reorg and refactor Fine-tuning LLMs with ROCm

Update index and headings

Fix formatting and update toc

Split out content from index to overview.rst

Add metadata

Clean up overview

Add inference sections, fix rst errors, clean up single-gpu-fine-tuning

Combine fine-tuning and inference guides

Fix some links and formatting

Update toc and add formatting fixes

Add ck kernel fusion content

Update toc

Clean up model quantization and acceleration

Add CK images

Clean up profiling

Update triton kernel performance optimization

Update llm inference frameworks guide

Disable automatic number of figures and tables in Sphinx conf

Change tabs to spaces

Change heading to end with -ing

Add link fixes and heading updates

Add rocprof/Omniperf/Omnitrace section

Update profiling and debugging guide

Add formatting fixes

Satisfy spellcheck

Fix words

Delete unused file

Finish overview

Clean up first 4 sections

Multi-gpu fine-tuning guide: slight fixes

Update toc

Remove tabs

Formatting fixes

* Minor wording updates

* Add some clean-up

* Update profiling and debugging gudie

* Fix Omnitrace link

* Update ck kernel fusion with latest

* Update CK formatting

* Fix perfetto link syntax

* Fix typos and add blurbs

* Add fixes to Triton optimization doc

* Tabify saving adapters / models section

* Fix linting errors - spellcheck

Fix spelling and grammar

Satisfy linter

Update wording in profiling guide

Add fixes to satisfy linter

More fixes for linting in Triton guide

More linting fixes

Spellcheck in CK guide

* Improve triton guide

Fix linting errors and optics

* Add occupancy / vgpr table

Change some wording

* Re-add tunableop

* Add missing indent in _toc.yml

* Remove ckProfiler references

* Add links to resources

* Add refs in CK optimization guide

* Rename files and fix internal links

* Organize tuning guides

Reorg triton

* Add compute unit diagram

* Remove AutoAWQ

* Add higher res image for Perfetto trace example

* Update link text

* Update fig nums

* Update some formatting

* Update "Inductor"

* Change "Inductor" to TorchInductor

* Add link to official TorchInductor docs
  • Loading branch information
peterjunpark committed Jun 4, 2024
1 parent bf19dd1 commit 1e55e01
Show file tree
Hide file tree
Showing 32 changed files with 2,763 additions and 2 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions docs/data/how-to/fine-tuning-llms/perfetto-trace.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/data/how-to/fine-tuning-llms/tunableop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion docs/how-to/deep-learning-rocm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ Find information on version compatibility and framework release notes in :doc:`T

For guidance on installing ROCm itself, refer to :doc:`ROCm installation for Linux <rocm-install-on-linux:index>`.

Learn how to use your ROCm deep learning environment for training, fine-tuning, and inference through the following guides.
Learn how to use your ROCm deep learning environment for training, fine-tuning, inference, and performance optimization
through the following guides.

* :doc:`rocm-for-ai/index`

* :doc:`fine-tuning-llms/index`
20 changes: 20 additions & 0 deletions docs/how-to/fine-tuning-llms/fine-tuning-and-inference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, inference, usage, tutorial

*************************
Fine-tuning and inference
*************************

Fine-tuning using ROCm involves leveraging AMD's GPU-accelerated :doc:`libraries <rocm:reference/api-libraries>` and
:doc:`tools <rocm:reference/rocm-tools>` to optimize and train deep learning models. ROCm provides a comprehensive
ecosystem for deep learning development, including open-source libraries for optimized deep learning operations and
ROCm-aware versions of :doc:`deep learning frameworks <../deep-learning-rocm>` such as PyTorch, TensorFlow, and JAX.

Single-accelerator systems, such as a machine equipped with a single accelerator or GPU, are commonly used for
smaller-scale deep learning tasks, including fine-tuning pre-trained models and running inference on moderately
sized datasets. See :doc:`single-gpu-fine-tuning-and-inference`.

Multi-accelerator systems, on the other hand, consist of multiple accelerators working in parallel. These systems are
typically used in LLMs and other large-scale deep learning tasks where performance, scalability, and the handling of
massive datasets are crucial. See :doc:`multi-gpu-fine-tuning-and-inference`.
37 changes: 37 additions & 0 deletions docs/how-to/fine-tuning-llms/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial

**************************
Fine-tuning LLMs with ROCm
**************************

ROCm empowers the fine-tuning and optimization of large language models, making them accessible and efficient for
specialized tasks. ROCm supports the broader AI ecosystem to ensure seamless integration with open frameworks,
models, and tools.

For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_

Throughout the following topics, this guide discusses the goals and :ref:`challenges of fine-tuning a large language
model <fine-tuning-llms-concept-challenge>` like Llama 2. Then, it introduces :ref:`common methods of optimizing your
fine-tuning <fine-tuning-llms-concept-optimizations>` using techniques like LoRA with libraries like PEFT. In the
sections that follow, you'll find practical guides on libraries and tools to accelerate your fine-tuning.

- :doc:`Conceptual overview of fine-tuning LLMs <overview>`

- :doc:`Fine-tuning and inference <fine-tuning-and-inference>` using a
:doc:`single-accelerator <single-gpu-fine-tuning-and-inference>` or
:doc:`multi-accelerator <multi-gpu-fine-tuning-and-inference>` system.

- :doc:`Model quantization <model-quantization>`

- :doc:`Model acceleration libraries <model-acceleration-libraries>`

- :doc:`LLM inference frameworks <llm-inference-frameworks>`

- :doc:`Optimizing with Composable Kernel <optimizing-with-composable-kernel>`

- :doc:`Optimizing Triton kernels <optimizing-triton-kernel>`

- :doc:`Profiling and debugging <profiling-and-debugging>`

218 changes: 218 additions & 0 deletions docs/how-to/fine-tuning-llms/llm-inference-frameworks.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, inference, vLLM, TGI, text generation inference

************************
LLM inference frameworks
************************

This section discusses how to implement `vLLM <https://docs.vllm.ai/en/latest>`_ and `Hugging Face TGI
<https://huggingface.co/docs/text-generation-inference/en/index>`_ using
:doc:`single-accelerator <single-gpu-fine-tuning-and-inference>` and
:doc:`multi-accelerator <multi-gpu-fine-tuning-and-inference>` systems.

.. _fine-tuning-llms-vllm:

vLLM inference
==============

vLLM is renowned for its paged attention algorithm that can reduce memory consumption and increase throughput thanks to
its paging scheme. Instead of allocating GPU high-bandwidth memory (HBM) for the maximum output token lengths of the
models, the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths. This paged attention
is also effective when multiple requests share the same key and value contents for a large value of beam search or
multiple parallel requests.

vLLM also incorporates many modern LLM acceleration and quantization algorithms, such as Flash Attention, HIP and CUDA
graphs, tensor parallel multi-GPU, GPTQ, AWQ, and token speculation.

Installing vLLM
---------------

1. To install vLLM, run the following commands.

.. code-block:: shell
# Install from the source
git clone https://github.com/ROCm/vllm.git
cd vllm
PYTORCH_ROCM_ARCH=gfx942 python setup.py install #MI300 series
.. _fine-tuning-llms-vllm-rocm-docker-image:

2. Run the following commands to build a Docker image ``vllm-rocm``.

.. code-block:: shell
git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.rocm -t vllm-rocm .
.. tab-set::

.. tab-item:: vLLM on a single-accelerator system
:sync: single

3. To use vLLM as an API server to serve reference requests, first start a container using the :ref:`vllm-rocm
Docker image <fine-tuning-llms-vllm-rocm-docker-image>`.

.. code-block:: shell
docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
vllm-rocm \
bash
4. Inside the container, start the API server to run on a single accelerator on port 8000 using the following command.

.. code-block:: shell
python -m vllm.entrypoints.api_server --model /app/model --dtype float16 --port 8000 &
The following log message is displayed in your command line indicates that the server is listening for requests.

.. image:: ../../data/how-to/fine-tuning-llms/vllm-single-gpu-log.png
:alt: vLLM API server log message
:align: center

5. To test, send it a curl request containing a prompt.

.. code-block:: shell
curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt": "What is AMD Instinct?", "max_tokens": 80, "temperature": 0.0 }'
You should receive a response like the following.

.. code-block:: text
{"text":["What is AMD Instinct?\nAmd Instinct is a brand new line of high-performance computing (HPC) processors from Advanced Micro Devices (AMD). These processors are designed to deliver unparalleled performance for HPC workloads, including scientific simulations, data analytics, and machine learning.\nThe Instinct lineup includes a range of processors, from the entry-level Inst"]}
.. tab-item:: vLLM on a multi-accelerator system
:sync: multi

3. To use vLLM as an API server to serve reference requests, first start a container using the :ref:`vllm-rocm
Docker image <fine-tuning-llms-vllm-rocm-docker-image>`.

.. code-block:: shell
docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
vllm-rocm \
bash
4. To run API server on multiple GPUs, use the ``-tp`` or ``--tensor-parallel-size`` parameter. For example, to use two
GPUs, start the API server using the following command.

.. code-block:: shell
python -m vllm.entrypoints.api_server --model /app/model --dtype float16 -tp 2 --port 8000 &
5. To run multiple instances of API Servers, specify different ports for each server, and use ``ROCR_VISIBLE_DEVICES`` to
isolate each instance to a different accelerator.

For example, to run two API servers, one on port 8000 using GPU 0 and 1, one on port 8001 using GPU 2 and 3, use a
a command like the following.

.. code-block:: shell
ROCR_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server --model /data/llama-2-7b-chat-hf --dtype float16 –tp 2 --port 8000 &
ROCR_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --model /data/llama-2-7b-chat-hf --dtype float16 –tp 2--port 8001 &
6. To test, send it a curl request containing a prompt.

.. code-block:: shell
curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt": "What is AMD Instinct?", "max_tokens": 80, "temperature": 0.0 }'
You should receive a response like the following.

.. code-block:: text
{"text":["What is AMD Instinct?\nAmd Instinct is a brand new line of high-performance computing (HPC) processors from Advanced Micro Devices (AMD). These processors are designed to deliver unparalleled performance for HPC workloads, including scientific simulations, data analytics, and machine learning.\nThe Instinct lineup includes a range of processors, from the entry-level Inst"]}
.. _fine-tuning-llms-tgi:

Hugging Face TGI
================

Text Generation Inference (TGI) is LLM serving framework from Hugging
Face, and it also supports the majority of high-performance LLM
acceleration algorithms such as Flash Attention, Paged Attention,
CUDA/HIP graph, tensor parallel multi-GPU, GPTQ, AWQ, and token
speculation.

.. tip::

In addition to LLM serving capability, TGI also provides the `Text Generation Inference benchmarking tool
<https://github.com/huggingface/text-generation-inference/blob/main/benchmark/README.md>`_.

Install TGI
-----------

1. To install the TGI Docker image, run the following commands.

.. code-block:: shell
# Install from Dockerfile
git clone https://github.com/huggingface/text-generation-inference.git -b mi300-compat
cd text-generation-inference
docker build . -f Dockerfile.rocm
.. tab-set::

.. tab-item:: TGI on a single-accelerator system
:sync: single

2. Launch a model using TGI server on a single accelerator.

.. code-block:: shell
export ROCM_USE_FLASH_ATTN_V2_TRITON=True
text-generation-launcher --model-id NousResearch/Meta-Llama-3-70B --dtype float16 --port 8000 &
3. To test, send it a curl request containing a prompt.

.. code-block:: shell
curl http://localhost:8000/generate_stream -X POST -d '{"inputs":"What is AMD Instinct?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
You should receive a response like the following.

.. code-block:: shell
data:{"index":20,"token":{"id":304,"text":" in","logprob":-1.2822266,"special":false},"generated_text":" AMD Instinct is a new family of data center GPUs designed to accelerate the most demanding workloads in","details":null}
.. tab-item:: TGI on a multi-accelerator system

2. Launch a model using TGI server on multiple accelerators (4 in this case).

.. code-block:: shell
export ROCM_USE_FLASH_ATTN_V2_TRITON=True
text-generation-launcher --model-id NousResearch/Meta-Llama-3-8B --dtype float16 --port 8000 --num-shard 4 &
3. To test, send it a curl request containing a prompt.

.. code-block:: shell
curl http://localhost:8000/generate_stream -X POST -d '{"inputs":"What is AMD Instinct?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
You should receive a response like the following.

.. code-block:: shell
data:{"index":20,"token":{"id":304,"text":" in","logprob":-1.2773438,"special":false},"generated_text":" AMD Instinct is a new family of data center GPUs designed to accelerate the most demanding workloads in","details":null}
Loading

0 comments on commit 1e55e01

Please sign in to comment.