Fine-Tuning LLM with SkyPilot, DVC and W&B

In this project, I fine-tune the bert-base-uncased model for text classification on the hotels-reviews dataset. The dataset is artificially made and contains 100k reviews whith labels: Excellent, Very good, Average, Poor, Terrible. Spoiler alert: the model is able to learn the task and achieve 100% accuracy with no more than 200 samples.

There are a few learning goals for this project:

Provisioning/Infrastructure: Run the training pipeline in the cloud on a GPU instance in the most efficient way across multiple cloud providers (cost, performance, checkpointing, spot instances, etc.).
Machine Learning: How fine-tuning improves the performance of the model.
MLOps: Compare ML experiments on Weights & Biases vs DVC Studio - best tool, advanteages and disadvantages.

Tools used in this project:

HuggingFace Transformers for fine-tuning the model.
DVC for defining machine learning pipelines - dependencies.
SkyPilot for provisioning infrastructure and running the training pipeline in the cloud.
Weights & Biases for logging metrics and artifacts.

Tasks

Setup

Install SkyPilot, DVC, and Weight & Biases.

pip install requirements.txt

Next, configure AWS, Azure, GCP, etc. credentials. SkyPilot will choose the cloud provider based on GPU availability and pricing.

Example of AWS configuration:

pip install boto3
aws configure

Confirm the setup with the following command:

sky check

Define the resources, file mounts, setup and command for the training job in the SkyPilot configuration file sky-vscode.yaml.

File mounts are used to mount the data, ssh keys and gitconfig to the cloud instance. The least two are needed for DVC to work with Git.

file_mounts:
  /data: ~/azml_finetune_llm/data
  ~/.ssh/id_ed25519: ~/.ssh/id_ed25519
  ~/.ssh/id_ed25519.pub: ~/.ssh/id_ed25519.pub
  ~/.gitconfig: ~/.gitconfig

Setup is running only once when the instance is created. It is used to install dependencies.

Finally, set the commands to run the training job. SkyPilot creates a new working directory sky_workdir, so we need to change the directory to the project root. Then we can run the ML pipeline with one command thanks to DVC.

run: |
  cd ~/sky_workdir
  source activate pytorch
  dvc exp run

Note

Usually current remote URL for origin is using HTTPS. If you want to use SSH keys for authentication, you should change this URL to the SSH format. You can do this with:

git remote set-url origin [email protected]:avoytkiv/azml_finetune_llm.git

Also, check permissions for the SSH key and change them if needed. This error may occur if the permissions are not correct:

Warning

The remote server unexpectedly closed the connection.owner or permissions on /home/ubuntu/.ssh/config

This can be fixed by changing the permissions of the config file:

chmod 600 ~/.ssh/config

More details can be found here.

SkyPilot: Run everything in Cloud

To launch job on spot instances, run:

sky launch sky-vscode.yaml -c mycluster -i 30 -d --use-spot

This SkyPilot command uses spot instances to save costs and automatically terminates the instance after 30 minutes of idleness. Once the experiment is complete, its artifacts such as model weights and metrics are logged to Weights & Biases.

Add --env DVC-STUDIO-TOKEN to sky launch/exec command to see the experiment running live in DVC Studio. Add --env WANDB_API_KEY to sky launch/exec command to see the experiment running live in Weights & Biases. First, make it available in your current shell.

While the model is training, you can monitor the logs by running the following command.

sky logs mycluster

Checkpoints

HuggingFace Transformers supports checkpointing. And has an integration with Weights & Biases. To enable checkpointing, we need to:

set the environment variable WANDB_LOG_MODEL=checkpoint.
set --run_name to $SKYPILOT_TASK_ID so that the logs for all recoveries of the same job will be saved to the same run in Weights & Biases.

Any Transformers Trainer you initialize from now on will upload models to your W&B project. Model checkpoints will be logged and include the full model lineage.

Any time the instance is preempted (interrupted), the SkyPilot will automatically resume the training job from the last checkpoint.

Note

There’s one edge case to handle: during a checkpoint write, the instance may get preempted suddenly and only partial state is written to the cloud bucket. When this happens, resuming from a corrupted partial checkpoint will crash the program. The cleanup_incomplete_checkpoints function will delete any partial checkpoints that are incomplete.

Data Science Workflow

Evaluate the bert-base-uncased model on the hotels-reviews-small dataset for baseline performance (it's 20% accuracy).
Fine-tune the bert-base-uncased model for text classification on the hotels-reviews dataset.
Evaluate the model on the hotels-reviews-small dataset.
Use WandB to track metrics, model, and parameters across the train and evaluate stages.

Results

Now, when the ml pipeline is defined and the cloud infrastructure is optimized for cost, we can run and then compare our experiments. Not only train and evaluate stages, but also system metrics such as GPU utilization, memory usage, etc. are logged to Weights & Biases.

The model is able to learn the task and achieve 100% accuracy with no more than 200 samples.

What's next

Use the Weights & Biases Model Registry to register models to prepare it for staging or deployment in production environment.

Useful Commands

Freeze only the packages that are required to run the project.

pip freeze -q -r requirements.txt | sed '/freeze/,$ d' > requirements-froze.txt
mv requirements-froze.txt requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.devcontainer		.devcontainer
.dvc		.dvc
data		data
notebooks		notebooks
results		results
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
requirements.txt		requirements.txt
sky-vscode.yaml		sky-vscode.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-Tuning LLM with SkyPilot, DVC and W&B

Setup

SkyPilot: Run everything in Cloud

Checkpoints

Data Science Workflow

Results

What's next

Useful Commands

Useful Resources

About

Releases

Packages

Languages

avoytkiv/azml_finetune_llm

Folders and files

Latest commit

History

Repository files navigation

Fine-Tuning LLM with SkyPilot, DVC and W&B

Setup

SkyPilot: Run everything in Cloud

Checkpoints

Data Science Workflow

Results

What's next

Useful Commands

Useful Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages