In this project, I fine-tune the bert-base-uncased
model for text classification on the hotels-reviews
dataset.
The dataset is artificially made and contains 100k reviews whith labels: Excellent
, Very good
, Average
, Poor
, Terrible
. Spoiler alert: the model is able to learn the task and achieve 100% accuracy with no more than 200 samples.
There are a few learning goals for this project:
- Provisioning/Infrastructure: Run the training pipeline in the cloud on a GPU instance in the most efficient way across multiple cloud providers (cost, performance, checkpointing, spot instances, etc.).
- Machine Learning: How fine-tuning improves the performance of the model.
- MLOps: Compare ML experiments on Weights & Biases vs DVC Studio - best tool, advanteages and disadvantages.
Tools used in this project:
HuggingFace Transformers
for fine-tuning the model.DVC
for defining machine learning pipelines - dependencies.SkyPilot
for provisioning infrastructure and running the training pipeline in the cloud.Weights & Biases
for logging metrics and artifacts.
Tasks
-
Preprocess the custom
hotels-reviews
dataset.- Convert the dataset to the HuggingFace format.
- Split the dataset into train and test sets.
- Tokenize the dataset.
-
Evaluate the
bert-base-uncased
model on the preprocessed dataset. -
Fine-tune the
bert-base-uncased
model. -
Set up infrastructure to run the training pipeline in the cloud on a GPU instance.
- Register a new account on:
- Lambda (AI cloud platform),
- Cloudflare (R2 storage with zero egress charges),
- AWS, Azure, GCP (major GPU cloud providers)
- Request quota increase for GPU instances.
- Install SkyPilot, DVC, and Weight & Biases.
- Authenticate with AWS, Azure, GCP etc. Skypilot will choose the cloud provider based on GPU availability and pricing.
- Upload the data to S3 (tracked by DVC).
- Create a SkyPilot configuration to run the training job in the cloud.
- Configure resources (cloud provider, instance type, etc.).
- Configure file mounts (artifacts, data, etc.)
- Configure the training job (command, environment variables, etc.).
- Create SSH keys to connect to GitHub (DVC needs it as it works with Git).
- Implement checkpoints to save the model weights and metrics to WandB. It will allow to resume training from the last checkpoint. With this setup we can run the training job for a long time on spot instances (
sky spot launch
) and get automatic recovery from preemption. - Implement
Early Stopping
to not waste time on training the model that is not improving.
Bonus tasks:
- Benchmark performance and cost of different GPU instances on different cloud providers (
sky bench launch
).- Make a table with the results.
- Check Sky Spot Instances Dashboard
- Register a new account on:
Install SkyPilot, DVC, and Weight & Biases.
pip install requirements.txt
Next, configure AWS, Azure, GCP, etc. credentials. SkyPilot will choose the cloud provider based on GPU availability and pricing.
Example of AWS configuration:
pip install boto3
aws configure
Confirm the setup with the following command:
sky check
Define the resources, file mounts, setup and command for the training job in the SkyPilot configuration file sky-vscode.yaml
.
File mounts are used to mount the data, ssh keys and gitconfig to the cloud instance. The least two are needed for DVC to work with Git.
file_mounts:
/data: ~/azml_finetune_llm/data
~/.ssh/id_ed25519: ~/.ssh/id_ed25519
~/.ssh/id_ed25519.pub: ~/.ssh/id_ed25519.pub
~/.gitconfig: ~/.gitconfig
Setup is running only once when the instance is created. It is used to install dependencies.
Finally, set the commands to run the training job. SkyPilot creates a new working directory sky_workdir
, so we need to change the directory to the project root. Then we can run the ML pipeline with one command thanks to DVC.
run: |
cd ~/sky_workdir
source activate pytorch
dvc exp run
Note
Usually current remote URL for origin is using HTTPS. If you want to use SSH keys for authentication, you should change this URL to the SSH format. You can do this with:
git remote set-url origin [email protected]:avoytkiv/azml_finetune_llm.git
Also, check permissions for the SSH key and change them if needed. This error may occur if the permissions are not correct:
Warning
The remote server unexpectedly closed the connection.owner or permissions on /home/ubuntu/.ssh/config
This can be fixed by changing the permissions of the config file:
chmod 600 ~/.ssh/config
More details can be found here.
To launch job on spot instances, run:
sky launch sky-vscode.yaml -c mycluster -i 30 -d --use-spot
This SkyPilot command uses spot instances to save costs and automatically terminates the instance after 30 minutes of idleness. Once the experiment is complete, its artifacts such as model weights and metrics are logged to Weights & Biases.
Add --env DVC-STUDIO-TOKEN
to sky launch/exec
command to see the experiment running live in DVC Studio.
Add --env WANDB_API_KEY
to sky launch/exec
command to see the experiment running live in Weights & Biases.
First, make it available in your current shell.
While the model is training, you can monitor the logs by running the following command.
sky logs mycluster
HuggingFace Transformers supports checkpointing. And has an integration with Weights & Biases. To enable checkpointing, we need to:
- set the environment variable
WANDB_LOG_MODEL=checkpoint
. - set
--run_name
to$SKYPILOT_TASK_ID
so that the logs for all recoveries of the same job will be saved to the same run in Weights & Biases.
Any Transformers Trainer you initialize from now on will upload models to your W&B project. Model checkpoints will be logged and include the full model lineage.
Any time the instance is preempted (interrupted), the SkyPilot will automatically resume the training job from the last checkpoint.
Note
There’s one edge case to handle: during a checkpoint write, the instance may get preempted suddenly and only partial
state is written to the cloud bucket. When this happens, resuming from a corrupted partial checkpoint will crash the program. The cleanup_incomplete_checkpoints
function will delete any partial checkpoints that are incomplete.
- Evaluate the
bert-base-uncased
model on thehotels-reviews-small
dataset for baseline performance (it's 20% accuracy). - Fine-tune the
bert-base-uncased
model for text classification on thehotels-reviews
dataset. - Evaluate the model on the
hotels-reviews-small
dataset. - Use WandB to track metrics, model, and parameters across the train and evaluate stages.
Now, when the ml pipeline is defined and the cloud infrastructure is optimized for cost, we can run and then compare our experiments. Not only train
and evaluate
stages, but also system
metrics such as GPU utilization, memory usage, etc. are logged to Weights & Biases.
The model is able to learn the task and achieve 100% accuracy with no more than 200 samples.
Use the Weights & Biases Model Registry to register models to prepare it for staging
or deployment
in production environment.
Freeze only the packages that are required to run the project.
pip freeze -q -r requirements.txt | sed '/freeze/,$ d' > requirements-froze.txt
mv requirements-froze.txt requirements.txt
- SkyPilot Documentation
- SkyPilot - Configure access to cloud providers
- SkyPilot - Source code for sky.Task - debugging
- SkyPilot - SkyCallback
- Skypilot - LLM
- SkyPilot - Request quota increase
- Azure - GPU optimized virtual machine sizes
- DVC Documentation
- ML experiments in the cloud with SkyPilot and DVC
- Fine-Tuning Large Language Models with a Production-Grade Pipeline
- Create SSH key
- WANDB - Logging with Weights and Biases in Transformer