Skip to content

Commit

Permalink
Add "How to use ROCm for AI" (ROCm#3117)
Browse files Browse the repository at this point in the history
* Add Using ROCm for AI:wq

Add PyTorch Docker installation images

Split doc into subtopics

Add metadata

Clean up index

Clean up hugging face guide

Clean up installation guide

Fix rST formatting

Clean up install and train-a-model

Clean up MAD

Delete unused file

Add ref anchors and clean up MAD doc

Add formatting fixes

Update toc and section index

Format some code blocks

Remove install guide and update toc

Chop installation guide

Clean up deployment and hugging face sections

Change headings to end in -ing

Fix spelling in Training a model

Delete MAD and split out install content

Fix formatting

Change words to satisfy spellcheck linter

* Add review suggestions and add helpful links

Co-authored-by: Leo Paoletti <[email protected]>

Add helpful links and add review suggestions

Remove fine-tuning link and links to D5 and MAGMA

Update docs/how-to/rocm-for-ai/deploy-your-model.rst

Co-authored-by: Young Hui - AMD <[email protected]>

Update DeepSpeed link

Add subheading to ML framework installation and closing blurb to hugging face models guide

* Reorder topics
  • Loading branch information
peterjunpark committed May 30, 2024
1 parent 64c2ef8 commit 3c06011
Show file tree
Hide file tree
Showing 11 changed files with 553 additions and 1 deletion.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
113 changes: 113 additions & 0 deletions docs/how-to/rocm-for-ai/deploy-your-model.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
.. meta::
:description: How to use ROCm for AI
:keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial

********************
Deploying your model
********************

ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers.
This section focuses on deploying transformers-based LLM models.

ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks.

.. _rocm-for-ai-serve-vllm:

Serving using vLLM
==================

vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM officially supports ROCm versions 5.7 and
6.0. AMD is actively working with the vLLM team to improve performance and support later ROCm versions.

See the `GitHub repository <https://github.com/vllm-project/vllm>`_ and `official vLLM documentation
<https://docs.vllm.ai/>`_ for more information.

For guidance on using vLLM with ROCm, refer to `Installation with ROCm
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html>`_.

vLLM installation
-----------------

vLLM supports two ROCm-capable installation methods. Refer to the official documentation use the following links.

- `Build from source with Docker
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-docker-rocm>`_ (recommended)

- `Build from source <https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm>`_

vLLM walkthrough
----------------

Refer to this developer blog for guidance on serving with vLLM `Inferencing and serving with vLLM on AMD GPUs — ROCm
Blogs <https://rocm.blogs.amd.com/artificial-intelligence/vllm/README.html>`_

.. _rocm-for-ai-serve-hugging-face-tgi:

Serving using Hugging Face TGI
==============================

The `Hugging Face Text Generation Inference <https://huggingface.co/docs/text-generation-inference/index>`_
(TGI) library is optimized for serving LLMs with low latency. Refer to the `Quick tour of TGI
<https://huggingface.co/docs/text-generation-inference/quicktour>`_ for more details.

TGI installation
----------------

The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at
`<https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference>`__.

TGI walkthrough
---------------

#. Set up the LLM server.

Deploy the Llama2 7B model with TGI using the official Docker image.

.. code-block:: shell
model=TheBloke/Llama-2-7B-fp16
volume=$PWD
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data --name tgi_amd ghcr.io/huggingface/text-generation-inference:1.2-rocm --model-id $model
#. Set up the client.

a. Open another shell session and run the following command to access the server with the client URL.

.. code-block:: shell
curl 127.0.0.1:8080/generate \\
-X POST \\
-d '{"inputs":"What is Deep
Learning?","parameters":{"max_new_tokens":20}}' \\
-H 'Content-Type: application/json'
b. Access the server with request endpoints.

.. code-block:: shell
pip install request
PYTHONPATH=/usr/lib/python3/dist-packages python requests_model.py
``requests_model.py`` should look like:
.. code-block:: python
import requests
headers = {
"Content-Type": "application/json",
}
data = {
'inputs': 'What is Deep Learning?',
'parameters': { 'max_new_tokens': 20 },
}
response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data)
print(response.json())
vLLM and Hugging Face TGI are robust solutions for anyone looking to deploy LLMs for applications that demand high
performance, low latency, and scalability.
Visit the topics in :doc:`Using ROCm for AI <index>` to learn about other ROCm-aware solutions for AI development.
210 changes: 210 additions & 0 deletions docs/how-to/rocm-for-ai/hugging-face-models.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
.. meta::
:description: How to use ROCm for AI
:keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial

********************************
Running models from Hugging Face
********************************

`Hugging Face <https://huggingface.co>`_ hosts the world’s largest AI model repository for developers to obtain
transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in
developing and deploying AI solutions.

This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs.

.. _rocm-for-ai-hugging-face-transformers:

Using Hugging Face Transformers
-------------------------------

First, `install the Hugging Face Transformers library <https://huggingface.co/docs/transformers/en/installation>`_,
which lets you easily import any of the transformer models into your Python application.

.. code-block:: shell
pip install transformers
Here is an example of running `GPT2 <https://huggingface.co/openai-community/gpt2>`_:

.. code-block:: python
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Replace me with any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core
models should also function correctly.

Here are some mainstream models to get you started:

- `BERT <https://huggingface.co/bert-base-uncased>`_

- `BLOOM <https://huggingface.co/bigscience/bloom>`_

- `Llama <https://huggingface.co/huggyllama/llama-7b>`_

- `OPT <https://huggingface.co/facebook/opt-66b>`_

- `T5 <https://huggingface.co/t5-base>`_

.. _rocm-for-ai-hugging-face-optimum:

Using Hugging Face with Optimum-AMD
-----------------------------------

Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.

For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the
`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on
using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.

Hugging Face libraries natively support AMD Instinct accelerators. For other
:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not
validated, but most features are expected to work without issues.

.. _rocm-for-ai-install-optimum-amd:

Installation
~~~~~~~~~~~~

Install Optimum-AMD using pip.

.. code-block:: shell
pip install --upgrade --upgrade-strategy eager optimum[amd]
Or, install from source.

.. code-block:: shell
git clone https://github.com/huggingface/optimum-amd.git
cd optimum-amd
pip install -e .
.. _rocm-for-ai-flash-attention:

Flash Attention
---------------

#. Use `the Hugging Face team's example Dockerfile
<https://github.com/huggingface/optimum-amd/blob/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile>`_ to use
Flash Attention with ROCm.

.. code-block:: shell
docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash .
volume=$PWD
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd
transformers_pytorch_amd_gpu_flash:latest
#. Use Flash Attention 2 with `Transformers
<https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2>`_ by adding the
``use_flash_attention_2`` parameter to ``from_pretrained()``:

.. code-block:: python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
with torch.device("cuda"):
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-7b",
torch_dtype=torch.float16,
use_flash_attention_2=True,
)
.. _rocm-for-ai-gptq:

GPTQ
----

To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are available for ROCm.

#. First, :ref:`install Optimum-AMD <rocm-for-ai-install-optimum-amd>`.

#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation <https://github.com/AutoGPTQ/AutoGPTQ#Installation>`_ for
in-depth guidance.

.. code-block:: shell
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable.

.. code-block:: shell
ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e .
#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library
<https://github.com/PanQiWei/AutoGPTQ>`_:

.. code-block:: python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")
with torch.device("cuda"):
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-Chat-GPTQ",
torch_dtype=torch.float16,
)
.. _rocm-for-ai-onnx:

ONNX
----

Hugging Face Optimum also supports the `ONNX Runtime <https://onnxruntime.ai>`_ integration. For ONNX models, usage is
straightforward.

#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method:

.. code-block:: python
from optimum.onnxruntime import ORTModelForSequenceClassification
..
ort_model = ORTModelForSequenceClassification.from_pretrained(
..
provider="ROCMExecutionProvider"
)
#. Try running a `BERT text classification
<https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english>`_ ONNX model with ROCm:

.. code-block:: python
from optimum.onnxruntime import ORTModelForSequenceClassification
from optimum.pipelines import pipeline
from transformers import AutoTokenizer
import onnxruntime as ort
session_options = ort.SessionOptions()
session_options.log_severity_level = 0
ort_model = ORTModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english",
export=True,
provider="ROCMExecutionProvider",
session_options=session_options
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
result = pipe("Both the music and visual were astounding, not to mention the actors performance.")
23 changes: 23 additions & 0 deletions docs/how-to/rocm-for-ai/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
.. meta::
:description: How to use ROCm for AI
:keywords: ROCm, AI, machine learning, LLM, usage, tutorial

*****************
Using ROCm for AI
*****************

ROCm offers a suite of optimizations for AI workloads from large language models (LLMs) to image and video detection and
recognition, life sciences and drug discovery, autonomous driving, robotics, and more. ROCm proudly supports the broader
AI software ecosystem, including open frameworks, models, and tools.

For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_

In this guide, you'll learn about:

- :doc:`Installing ROCm and machine learning frameworks <install>`

- :doc:`Training a model <train-a-model>`

- :doc:`Running models from Hugging Face <hugging-face-models>`

- :doc:`Deploying your model <deploy-your-model>`
Loading

0 comments on commit 3c06011

Please sign in to comment.