forked from ROCm/ROCm
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add "How to use ROCm for AI" (ROCm#3117)
* Add Using ROCm for AI:wq Add PyTorch Docker installation images Split doc into subtopics Add metadata Clean up index Clean up hugging face guide Clean up installation guide Fix rST formatting Clean up install and train-a-model Clean up MAD Delete unused file Add ref anchors and clean up MAD doc Add formatting fixes Update toc and section index Format some code blocks Remove install guide and update toc Chop installation guide Clean up deployment and hugging face sections Change headings to end in -ing Fix spelling in Training a model Delete MAD and split out install content Fix formatting Change words to satisfy spellcheck linter * Add review suggestions and add helpful links Co-authored-by: Leo Paoletti <[email protected]> Add helpful links and add review suggestions Remove fine-tuning link and links to D5 and MAGMA Update docs/how-to/rocm-for-ai/deploy-your-model.rst Co-authored-by: Young Hui - AMD <[email protected]> Update DeepSpeed link Add subheading to ML framework installation and closing blurb to hugging face models guide * Reorder topics
- Loading branch information
1 parent
64c2ef8
commit 3c06011
Showing
11 changed files
with
553 additions
and
1 deletion.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
.. meta:: | ||
:description: How to use ROCm for AI | ||
:keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial | ||
|
||
******************** | ||
Deploying your model | ||
******************** | ||
|
||
ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers. | ||
This section focuses on deploying transformers-based LLM models. | ||
|
||
ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks. | ||
|
||
.. _rocm-for-ai-serve-vllm: | ||
|
||
Serving using vLLM | ||
================== | ||
|
||
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM officially supports ROCm versions 5.7 and | ||
6.0. AMD is actively working with the vLLM team to improve performance and support later ROCm versions. | ||
|
||
See the `GitHub repository <https://github.com/vllm-project/vllm>`_ and `official vLLM documentation | ||
<https://docs.vllm.ai/>`_ for more information. | ||
|
||
For guidance on using vLLM with ROCm, refer to `Installation with ROCm | ||
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html>`_. | ||
|
||
vLLM installation | ||
----------------- | ||
|
||
vLLM supports two ROCm-capable installation methods. Refer to the official documentation use the following links. | ||
|
||
- `Build from source with Docker | ||
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-docker-rocm>`_ (recommended) | ||
|
||
- `Build from source <https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm>`_ | ||
|
||
vLLM walkthrough | ||
---------------- | ||
|
||
Refer to this developer blog for guidance on serving with vLLM `Inferencing and serving with vLLM on AMD GPUs — ROCm | ||
Blogs <https://rocm.blogs.amd.com/artificial-intelligence/vllm/README.html>`_ | ||
|
||
.. _rocm-for-ai-serve-hugging-face-tgi: | ||
|
||
Serving using Hugging Face TGI | ||
============================== | ||
|
||
The `Hugging Face Text Generation Inference <https://huggingface.co/docs/text-generation-inference/index>`_ | ||
(TGI) library is optimized for serving LLMs with low latency. Refer to the `Quick tour of TGI | ||
<https://huggingface.co/docs/text-generation-inference/quicktour>`_ for more details. | ||
|
||
TGI installation | ||
---------------- | ||
|
||
The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at | ||
`<https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference>`__. | ||
|
||
TGI walkthrough | ||
--------------- | ||
|
||
#. Set up the LLM server. | ||
|
||
Deploy the Llama2 7B model with TGI using the official Docker image. | ||
|
||
.. code-block:: shell | ||
model=TheBloke/Llama-2-7B-fp16 | ||
volume=$PWD | ||
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data --name tgi_amd ghcr.io/huggingface/text-generation-inference:1.2-rocm --model-id $model | ||
#. Set up the client. | ||
|
||
a. Open another shell session and run the following command to access the server with the client URL. | ||
|
||
.. code-block:: shell | ||
curl 127.0.0.1:8080/generate \\ | ||
-X POST \\ | ||
-d '{"inputs":"What is Deep | ||
Learning?","parameters":{"max_new_tokens":20}}' \\ | ||
-H 'Content-Type: application/json' | ||
b. Access the server with request endpoints. | ||
|
||
.. code-block:: shell | ||
pip install request | ||
PYTHONPATH=/usr/lib/python3/dist-packages python requests_model.py | ||
``requests_model.py`` should look like: | ||
.. code-block:: python | ||
import requests | ||
headers = { | ||
"Content-Type": "application/json", | ||
} | ||
data = { | ||
'inputs': 'What is Deep Learning?', | ||
'parameters': { 'max_new_tokens': 20 }, | ||
} | ||
response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data) | ||
print(response.json()) | ||
vLLM and Hugging Face TGI are robust solutions for anyone looking to deploy LLMs for applications that demand high | ||
performance, low latency, and scalability. | ||
Visit the topics in :doc:`Using ROCm for AI <index>` to learn about other ROCm-aware solutions for AI development. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,210 @@ | ||
.. meta:: | ||
:description: How to use ROCm for AI | ||
:keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial | ||
|
||
******************************** | ||
Running models from Hugging Face | ||
******************************** | ||
|
||
`Hugging Face <https://huggingface.co>`_ hosts the world’s largest AI model repository for developers to obtain | ||
transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in | ||
developing and deploying AI solutions. | ||
|
||
This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs. | ||
|
||
.. _rocm-for-ai-hugging-face-transformers: | ||
|
||
Using Hugging Face Transformers | ||
------------------------------- | ||
|
||
First, `install the Hugging Face Transformers library <https://huggingface.co/docs/transformers/en/installation>`_, | ||
which lets you easily import any of the transformer models into your Python application. | ||
|
||
.. code-block:: shell | ||
pip install transformers | ||
Here is an example of running `GPT2 <https://huggingface.co/openai-community/gpt2>`_: | ||
|
||
.. code-block:: python | ||
from transformers import GPT2Tokenizer, GPT2Model | ||
tokenizer = GPT2Tokenizer.from_pretrained('gpt2') | ||
model = GPT2Model.from_pretrained('gpt2') | ||
text = "Replace me with any text you'd like." | ||
encoded_input = tokenizer(text, return_tensors='pt') | ||
output = model(**encoded_input) | ||
Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core | ||
models should also function correctly. | ||
|
||
Here are some mainstream models to get you started: | ||
|
||
- `BERT <https://huggingface.co/bert-base-uncased>`_ | ||
|
||
- `BLOOM <https://huggingface.co/bigscience/bloom>`_ | ||
|
||
- `Llama <https://huggingface.co/huggyllama/llama-7b>`_ | ||
|
||
- `OPT <https://huggingface.co/facebook/opt-66b>`_ | ||
|
||
- `T5 <https://huggingface.co/t5-base>`_ | ||
|
||
.. _rocm-for-ai-hugging-face-optimum: | ||
|
||
Using Hugging Face with Optimum-AMD | ||
----------------------------------- | ||
|
||
Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack. | ||
|
||
For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the | ||
`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on | ||
using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration. | ||
|
||
Hugging Face libraries natively support AMD Instinct accelerators. For other | ||
:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not | ||
validated, but most features are expected to work without issues. | ||
|
||
.. _rocm-for-ai-install-optimum-amd: | ||
|
||
Installation | ||
~~~~~~~~~~~~ | ||
|
||
Install Optimum-AMD using pip. | ||
|
||
.. code-block:: shell | ||
pip install --upgrade --upgrade-strategy eager optimum[amd] | ||
Or, install from source. | ||
|
||
.. code-block:: shell | ||
git clone https://github.com/huggingface/optimum-amd.git | ||
cd optimum-amd | ||
pip install -e . | ||
.. _rocm-for-ai-flash-attention: | ||
|
||
Flash Attention | ||
--------------- | ||
|
||
#. Use `the Hugging Face team's example Dockerfile | ||
<https://github.com/huggingface/optimum-amd/blob/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile>`_ to use | ||
Flash Attention with ROCm. | ||
|
||
.. code-block:: shell | ||
docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash . | ||
volume=$PWD | ||
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd | ||
transformers_pytorch_amd_gpu_flash:latest | ||
#. Use Flash Attention 2 with `Transformers | ||
<https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2>`_ by adding the | ||
``use_flash_attention_2`` parameter to ``from_pretrained()``: | ||
|
||
.. code-block:: python | ||
import torch | ||
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM | ||
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b") | ||
with torch.device("cuda"): | ||
model = AutoModelForCausalLM.from_pretrained( | ||
"tiiuae/falcon-7b", | ||
torch_dtype=torch.float16, | ||
use_flash_attention_2=True, | ||
) | ||
.. _rocm-for-ai-gptq: | ||
|
||
GPTQ | ||
---- | ||
|
||
To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are available for ROCm. | ||
|
||
#. First, :ref:`install Optimum-AMD <rocm-for-ai-install-optimum-amd>`. | ||
|
||
#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation <https://github.com/AutoGPTQ/AutoGPTQ#Installation>`_ for | ||
in-depth guidance. | ||
|
||
.. code-block:: shell | ||
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ | ||
Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable. | ||
|
||
.. code-block:: shell | ||
ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e . | ||
#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library | ||
<https://github.com/PanQiWei/AutoGPTQ>`_: | ||
|
||
.. code-block:: python | ||
import torch | ||
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM | ||
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ") | ||
with torch.device("cuda"): | ||
model = AutoModelForCausalLM.from_pretrained( | ||
"TheBloke/Llama-2-7B-Chat-GPTQ", | ||
torch_dtype=torch.float16, | ||
) | ||
.. _rocm-for-ai-onnx: | ||
|
||
ONNX | ||
---- | ||
|
||
Hugging Face Optimum also supports the `ONNX Runtime <https://onnxruntime.ai>`_ integration. For ONNX models, usage is | ||
straightforward. | ||
|
||
#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method: | ||
|
||
.. code-block:: python | ||
from optimum.onnxruntime import ORTModelForSequenceClassification | ||
.. | ||
ort_model = ORTModelForSequenceClassification.from_pretrained( | ||
.. | ||
provider="ROCMExecutionProvider" | ||
) | ||
#. Try running a `BERT text classification | ||
<https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english>`_ ONNX model with ROCm: | ||
|
||
.. code-block:: python | ||
from optimum.onnxruntime import ORTModelForSequenceClassification | ||
from optimum.pipelines import pipeline | ||
from transformers import AutoTokenizer | ||
import onnxruntime as ort | ||
session_options = ort.SessionOptions() | ||
session_options.log_severity_level = 0 | ||
ort_model = ORTModelForSequenceClassification.from_pretrained( | ||
"distilbert-base-uncased-finetuned-sst-2-english", | ||
export=True, | ||
provider="ROCMExecutionProvider", | ||
session_options=session_options | ||
) | ||
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") | ||
pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0") | ||
result = pipe("Both the music and visual were astounding, not to mention the actors performance.") | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
.. meta:: | ||
:description: How to use ROCm for AI | ||
:keywords: ROCm, AI, machine learning, LLM, usage, tutorial | ||
|
||
***************** | ||
Using ROCm for AI | ||
***************** | ||
|
||
ROCm offers a suite of optimizations for AI workloads from large language models (LLMs) to image and video detection and | ||
recognition, life sciences and drug discovery, autonomous driving, robotics, and more. ROCm proudly supports the broader | ||
AI software ecosystem, including open frameworks, models, and tools. | ||
|
||
For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_ | ||
|
||
In this guide, you'll learn about: | ||
|
||
- :doc:`Installing ROCm and machine learning frameworks <install>` | ||
|
||
- :doc:`Training a model <train-a-model>` | ||
|
||
- :doc:`Running models from Hugging Face <hugging-face-models>` | ||
|
||
- :doc:`Deploying your model <deploy-your-model>` |
Oops, something went wrong.