Skip to content

Commit

Permalink
Update fine-tuning guide: title, improve readibility in code blocks, …
Browse files Browse the repository at this point in the history
…fix typos (ROCm#3222)

* Fix typo

* Add torchtune link

* Add newlines before comments in code blocks for readability

* Update title
  • Loading branch information
peterjunpark committed Jun 4, 2024
1 parent df1d0e4 commit 3515950
Show file tree
Hide file tree
Showing 9 changed files with 117 additions and 112 deletions.
6 changes: 3 additions & 3 deletions docs/how-to/fine-tuning-llms/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial

**************************
Fine-tuning LLMs with ROCm
**************************
*******************************************
Fine-tuning LLMs and inference optimization
*******************************************

ROCm empowers the fine-tuning and optimization of large language models, making them accessible and efficient for
specialized tasks. ROCm supports the broader AI ecosystem to ensure seamless integration with open frameworks,
Expand Down
2 changes: 1 addition & 1 deletion docs/how-to/fine-tuning-llms/llm-inference-frameworks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Installing vLLM

.. code-block:: shell
# Install from the source
# Install from source
git clone https://github.com/ROCm/vllm.git
cd vllm
PYTORCH_ROCM_ARCH=gfx942 python setup.py install #MI300 series
Expand Down
12 changes: 7 additions & 5 deletions docs/how-to/fine-tuning-llms/model-acceleration-libraries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ ROCm provides two different implementations of Flash Attention 2 modules. They c

.. code-block:: shell
# Install from the source
# Install from source
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention/
GPU_ARCHS=gfx942 python setup.py install #MI300 series
Expand Down Expand Up @@ -156,7 +156,7 @@ of the PyTorch compilation.

.. code-block:: python
# Sample script to run LLM with the static key-value cache and pytorch compilation
# Sample script to run LLM with the static key-value cache and PyTorch compilation
from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache
import torch
from typing import Optional
Expand All @@ -180,7 +180,8 @@ of the PyTorch compilation.
return new_token
batch_size, seq_length = inputs["input_ids"].shape
# static key-value cache
# Static key-value cache
max_cache_length = 1024
max_new_tokens = 10
model._setup_cache(StaticCache, batch_size, max_cache_len=max_cache_length)
Expand All @@ -190,6 +191,7 @@ of the PyTorch compilation.
logits = model(**inputs, cache_position=cache_position, return_dict=False, use_cache=True)[0]
next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
# torch compilation
decode_one_tokens = torch.compile(decode_one_tokens, mode="max-autotune-no-cudagraphs",fullgraph=True)
Expand Down Expand Up @@ -221,10 +223,10 @@ page describes the options.

.. code-block:: python
# To turn on TunableOps, simply set this environmental variable
# To turn on TunableOp, simply set this environment variable
export PYTORCH_TUNABLEOP_ENABLED=1
# python
# Python
import torch
import torch.nn as nn
import torch.nn.functional as F
Expand Down
18 changes: 10 additions & 8 deletions docs/how-to/fine-tuning-llms/model-quantization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,19 +32,19 @@ The AutoGPTQ library implements the GPTQ algorithm.

.. code-block:: shell
# This will install pre-built wheel for a specific ROCm version
# This will install pre-built wheel for a specific ROCm version.
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
Or, install AutoGPTQ from source for the appropriate ROCm version (for example, ROCm 6.1).

.. code-block:: shell
# Clone the source code
# Clone the source code.
git clone https://github.com/AutoGPTQ/AutoGPTQ.git
cd AutoGPTQ
# Speed up the compilation by specifying PYTORCH_ROCM_ARCH to target device
# Speed up the compilation by specifying PYTORCH_ROCM_ARCH to target device.
PYTORCH_ROCM_ARCH=gfx942 ROCM_VERSION=6.1 pip install .
# Show the package after the installation
Expand Down Expand Up @@ -93,12 +93,14 @@ Using GPTQ with AutoGPTQ

.. code-block:: python
# import auto_gptq class
# Import auto_gptq class.
from auto_gptq import AutoGPTQForCausalLM
# load non-quantized model
# Load non-quantized model.
base_model = AutoGPTQForCausalLM.from_pretrained(base_model_name, quantize_config, device_map = "auto")
base_model.quantize(examples)
# save quantized model
# Save quantized model.
base_model.save_quantized(quantized_model_name)
Using GPTQ with Hugging Face Transformers
Expand Down Expand Up @@ -201,7 +203,7 @@ Installing bitsandbytes
Using bitsandbytes primitives
-----------------------------

To get started with bitsandbytes primitives, use the following code a reference.
To get started with bitsandbytes primitives, use the following code as reference.

.. code-block:: python
Expand Down Expand Up @@ -230,7 +232,7 @@ To load a Transformers model in 4-bit, set ``load_int_4bt=true`` in ``BitsAndByt
device_map="auto",
quantization_config=quantization_config)
# check the memory footprint with get_memory_footprint method
# Check the memory footprint with get_memory_footprint method
print(bnb_model_4bit.get_memory_footprint())
To load a model in 8-bit for inference, use the ``load_in_8bit`` option.
Expand Down
134 changes: 67 additions & 67 deletions docs/how-to/fine-tuning-llms/multi-gpu-fine-tuning-and-inference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,8 +130,8 @@ After loading the model in this way, the model is fully ready to use the resourc
torchtune for fine-tuning and inference
=============================================
torchtune is a PyTorch-native library for easy single and multi-accelerator or GPU model fine-tuning and inference with
LLMs.
`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-accelerator or
GPU model fine-tuning and inference with LLMs.
#. Install torchtune using pip.
Expand All @@ -157,80 +157,80 @@ LLMs.
subcommands:
{download,ls,cp,run,validate}
torchtune recipes are designed around easily composable components and workable training loops, with minimal abstraction
getting in the way of fine-tuning. Run ``tune ls`` to show built-in torchtune configuration recipes.
.. code-block:: shell
RECIPE CONFIG
full_finetune_single_device llama2/7B_full_low_memory
llama3/8B_full_single_device
mistral/7B_full_low_memory
full_finetune_distributed llama2/7B_full
llama2/13B_full
llama3/8B_full
mistral/7B_full
gemma/2B_full
lora_finetune_single_device llama2/7B_lora_single_device
llama2/7B_qlora_single_device
llama3/8B_lora_single_device
llama3/8B_qlora_single_device
llama2/13B_qlora_single_device
mistral/7B_lora_single_device
The ``RECIPE`` column shows the easy-to-use and workable fine-tuning and inference recipes for popular fine-tuning
techniques (such as LoRA). The ``CONFIG`` column lists the YAML configurations for easily configuring training,
evaluation, quantization, or inference recipes.
The snippet shows the architecture of a model's YAML configuration file:
.. code-block:: yaml
# Model Arguments
model:
_component_: torchtune.models.llama2.lora_llama2_7b
lora_attn_modules: ['q_proj', 'v_proj']
apply_lora_to_mlp: False
apply_lora_to_output: False
lora_rank: 8
lora_alpha: 16
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/Llama-2-7b-hf/tokenizer.model
# Dataset and Sampler
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
train_on_input: True
#. torchtune recipes are designed around easily composable components and workable training loops, with minimal abstraction
getting in the way of fine-tuning. Run ``tune ls`` to show built-in torchtune configuration recipes.
.. code-block:: shell
RECIPE CONFIG
full_finetune_single_device llama2/7B_full_low_memory
llama3/8B_full_single_device
mistral/7B_full_low_memory
full_finetune_distributed llama2/7B_full
llama2/13B_full
llama3/8B_full
mistral/7B_full
gemma/2B_full
lora_finetune_single_device llama2/7B_lora_single_device
llama2/7B_qlora_single_device
llama3/8B_lora_single_device
llama3/8B_qlora_single_device
llama2/13B_qlora_single_device
mistral/7B_lora_single_device
The ``RECIPE`` column shows the easy-to-use and workable fine-tuning and inference recipes for popular fine-tuning
techniques (such as LoRA). The ``CONFIG`` column lists the YAML configurations for easily configuring training,
evaluation, quantization, or inference recipes.
The snippet shows the architecture of a model's YAML configuration file:
.. code-block:: yaml
# Model arguments
model:
_component_: torchtune.models.llama2.lora_llama2_7b
lora_attn_modules: ['q_proj', 'v_proj']
apply_lora_to_mlp: False
apply_lora_to_output: False
lora_rank: 8
lora_alpha: 16
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/Llama-2-7b-hf/tokenizer.model
# Dataset and sampler
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
train_on_input: True
This configuration file defines the fine-tuning base model path, data set, hyper-parameters for optimizer and scheduler,
and training data type. To download the base model for fine-tuning, run the following command:
#. This configuration file defines the fine-tuning base model path, data set, hyper-parameters for optimizer and scheduler,
and training data type. To download the base model for fine-tuning, run the following command:
.. code-block:: shell
.. code-block:: shell
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token
The output directory argument for ``--output-dir`` should map the model path specified in YAML config file.
The output directory argument for ``--output-dir`` should map the model path specified in YAML config file.
To launch ``lora_finetune_distributed`` on four devices, run the following
command:
#. To launch ``lora_finetune_distributed`` on four devices, run the following
command:
.. code-block:: shell
.. code-block:: shell
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config llama2/7B_lora
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config llama2/7B_lora
If successful, you should something like the following output:
If successful, you should something like the following output:
.. code-block:: shell
.. code-block:: shell
INFO:torchtune.utils.logging:FSDP is enabled. Instantiating Model on CPU for Rank 0 ...
INFO:torchtune.utils.logging:Model instantiation took 7.32 secs
INFO:torchtune.utils.logging:Memory Stats after model init:
{'peak_memory_active': 9.478172672, 'peak_memory_alloc': 8.953868288, 'peak_memory_reserved': 11.112808448}
INFO:torchtune.utils.logging:Optimizer and loss are initialized.
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
INFO:torchtune.utils.logging:Learning rate scheduler is initialized.
1|111|Loss: 1.5790324211120605: 7%|█ | 114/1618
INFO:torchtune.utils.logging:FSDP is enabled. Instantiating Model on CPU for Rank 0 ...
INFO:torchtune.utils.logging:Model instantiation took 7.32 secs
INFO:torchtune.utils.logging:Memory Stats after model init:
{'peak_memory_active': 9.478172672, 'peak_memory_alloc': 8.953868288, 'peak_memory_reserved': 11.112808448}
INFO:torchtune.utils.logging:Optimizer and loss are initialized.
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
INFO:torchtune.utils.logging:Learning rate scheduler is initialized.
1|111|Loss: 1.5790324211120605: 7%|█ | 114/1618
Read more about inference frameworks in :doc:`LLM inference frameworks <llm-inference-frameworks>`.
4 changes: 2 additions & 2 deletions docs/how-to/fine-tuning-llms/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Conceptual overview of fine-tuning LLMs
***************************************

Large language models (LLMs) are trained on massive amounts of text data to generate coherent and fluent text. The
underlying *transformer* architecture is the fundamental building block of all LLMs. Transformers serve as the
underlying *transformer* architecture is the fundamental building block of all LLMs. Transformers
enable LLMs to understand and generate text by capturing contextual relationships and long-range dependencies. To better
understand the philosophy of the transformer architecture, review the foundational
`Attention is all you need <https://arxiv.org/pdf/1706.03762.pdf>`_ paper.
Expand Down Expand Up @@ -60,7 +60,7 @@ overcome this issue of high memory consumption.
LoRA accelerates the adjustment process and reduces related memory costs. To be precise, LoRA decomposes the portion of
weight changes :math:`ΔW` into high-precision low-rank representations, which do not require the calculations of all
:math:`ΔW`. It learns the decomposition representation of :math:`ΔW` during training, as shown in
:ref:`the weight update diagram <fine-tuning-llms-concept-challenge>`. This is how LoRA saves on
the :ref:`weight update diagram <fine-tuning-llms-concept-challenge>`. This is how LoRA saves on
computing resources.

LoRA is integrated into the `Hugging Face Parameter-Efficient Fine-Tuning (PEFT)
Expand Down
Loading

0 comments on commit 3515950

Please sign in to comment.