-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docker image for BERT e2e training task #454
Changes from 6 commits
af9fda0
104fa93
477f672
fb7d18f
b5aedc7
7f9480b
19613e1
974da50
5b4ae1a
6bc3ef4
fa8d244
f87ba65
7af6b13
1a3ad52
1f5b1c9
b67026c
4a8e0ec
60ddc02
01d8270
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Had a discussion with @cartermckinnon. I think we could reuse the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean that's fine from a base image perspective, since many of the dependencies will be shared among test types (i.e. unit/training/inference), but training and inference will both require unique dependencies on top of what's included in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add those unique dependencies to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cartermckinnon I agree though that further thought needs to be put into the test directory structure before we go too much further. I'm not sure how many more tests we're looking to add, but the current approach doesn't scale particularly well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @weicongw I mean sure we can, but then you're adding another ~7GB of deps to that image, which are totally unnecessary for the unit tests. Also, if we ever added the another test (e.g. ResNet), they very well could have their own unique dependencies as well. This will especially be true if we ever want to validate other frameworks than what are currently being utilized. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# Use the NVIDIA CUDA runtime as a parent image | ||
FROM nvidia/cuda:12.5.0-devel-ubuntu22.04 | ||
|
||
# Set environment variable to disable interactive prompts | ||
ENV DEBIAN_FRONTEND=noninteractive | ||
|
||
# Set default values for MASTER_ADDR and MASTER_PORT | ||
ENV MASTER_ADDR=127.0.0.1 | ||
ENV MASTER_PORT=12355 | ||
|
||
# Install Python 3.11 | ||
RUN apt-get update && apt-get install -y \ | ||
software-properties-common && \ | ||
add-apt-repository ppa:deadsnakes/ppa && \ | ||
apt-get update && \ | ||
apt-get install -y \ | ||
python3.11 \ | ||
python3.11-dev \ | ||
python3.11-distutils \ | ||
python3-pip && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
# Create a symbolic link to use python3.11 as python | ||
RUN ln -sf /usr/bin/python3.11 /usr/bin/python | ||
|
||
# Set the working directory in the container | ||
WORKDIR /app | ||
|
||
# Copy only the necessary files into the container at /app | ||
COPY train.py /app/ | ||
COPY requirements.txt /app/ | ||
|
||
# Install any needed packages specified in requirements.txt | ||
RUN python -m pip install --upgrade pip && \ | ||
pip install --no-cache-dir -r requirements.txt | ||
|
||
ARG EFA_INSTALLER_VERSION=latest | ||
# 1.7.4+ is required, to enforce proper EFA function with OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=0 | ||
ARG AWS_OFI_NCCL_VERSION=1.9.1 | ||
ARG NCCL_TESTS_VERSION=master | ||
|
||
# Install necessary dependencies and remove old ones | ||
RUN apt-get update -y && \ | ||
apt-get remove -y --allow-change-held-packages \ | ||
libmlx5-1 ibverbs-utils libibverbs-dev libibverbs1 libnccl2 libnccl-dev && \ | ||
rm -rf /opt/hpcx /usr/local/mpi /usr/local/ucx /etc/ld.so.conf.d/hpcx.conf && \ | ||
ldconfig && \ | ||
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \ | ||
sudo git gcc vim kmod openssh-client openssh-server build-essential \ | ||
wget curl autoconf libtool gdb automake python3-distutils cmake \ | ||
apt-utils devscripts debhelper libsubunit-dev check pkg-config libhwloc-dev | ||
|
||
# SSH configuration | ||
RUN mkdir -p /var/run/sshd && \ | ||
sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \ | ||
echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \ | ||
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config | ||
|
||
# Set environment variables for OpenMPI and CUDA | ||
ENV LD_LIBRARY_PATH /opt/amazon/openmpi/lib64:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib64:/opt/aws-ofi-nccl/install/lib:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib/:/usr/lib64:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH | ||
ENV PATH /usr/local/cuda/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/usr/sbin:/usr/bin:/usr/local/bin:$PATH | ||
|
||
# Install EFA | ||
RUN cd $HOME \ | ||
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz \ | ||
&& tar -xf $HOME/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz \ | ||
&& cd aws-efa-installer \ | ||
&& ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \ | ||
&& rm -rf $HOME/aws-efa-installer | ||
|
||
# Install NCCL | ||
RUN apt-key del 7fa2af80 && \ | ||
curl -L -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb && \ | ||
dpkg -i cuda-keyring_1.0-1_all.deb && \ | ||
sudo apt install libnccl2=2.18.5-1+cuda12.2 libnccl-dev=2.18.5-1+cuda12.2 | ||
|
||
# Install AWS-OFI-NCCL plugin | ||
RUN export OPAL_PREFIX="" && \ | ||
git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl && \ | ||
cd /opt/aws-ofi-nccl && \ | ||
git checkout v${AWS_OFI_NCCL_VERSION}-aws && \ | ||
./autogen.sh && \ | ||
./configure --prefix=/opt/aws-ofi-nccl/install --with-libfabric=/opt/amazon/efa/ --with-cuda=/usr/local/cuda --with-mpi=/opt/amazon/openmpi/ && \ | ||
make && make install | ||
|
||
# Set default values for MASTER_ADDR and MASTER_PORT for local testing | ||
ENV MASTER_ADDR=127.0.0.1 | ||
ENV MASTER_PORT=12355 | ||
|
||
# Set environment variables for NCCL and clean up | ||
ENV NCCL_PROTO simple | ||
RUN rm -rf /var/lib/apt/lists/* | ||
# Ensure NCCL library is found first | ||
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
torch==2.3 | ||
transformers==4.29 | ||
numpy==1.23 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
import os | ||
import json | ||
import time | ||
import torch | ||
import torch.distributed as dist | ||
from torch.nn.parallel import DistributedDataParallel as DDP | ||
from transformers import BertForPreTraining, BertTokenizer | ||
from torch.utils.data import DataLoader, TensorDataset | ||
import numpy as np | ||
|
||
|
||
def create_dummy_data(tokenizer, num_samples=100, max_length=128): | ||
# Create dummy input data | ||
sentences = [ | ||
"This is a dummy sentence number {}".format(i) for i in range(num_samples) | ||
] | ||
tokenized_inputs = tokenizer( | ||
sentences, | ||
max_length=max_length, | ||
padding="max_length", | ||
truncation=True, | ||
return_tensors="pt", | ||
) | ||
labels = tokenized_inputs.input_ids.detach().clone() | ||
|
||
# MLM task: randomly mask some tokens | ||
mlm_probability = 0.15 | ||
input_ids, labels = mask_tokens( | ||
tokenized_inputs.input_ids, tokenizer, mlm_probability | ||
) | ||
|
||
# NSP task: create dummy pairs | ||
next_sentence_labels = torch.randint(0, 2, (num_samples,)) | ||
|
||
return TensorDataset( | ||
input_ids, tokenized_inputs.attention_mask, labels, next_sentence_labels | ||
) | ||
|
||
|
||
def mask_tokens(inputs, tokenizer, mlm_probability): | ||
labels = inputs.clone() | ||
probability_matrix = torch.full(labels.shape, mlm_probability) | ||
special_tokens_mask = [ | ||
tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) | ||
for val in labels.tolist() | ||
] | ||
probability_matrix.masked_fill_( | ||
torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0 | ||
) | ||
masked_indices = torch.bernoulli(probability_matrix).bool() | ||
labels[~masked_indices] = -100 # We only compute loss on masked tokens | ||
|
||
inputs[masked_indices] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token) | ||
|
||
return inputs, labels | ||
|
||
|
||
def setup(rank, world_size): | ||
master_addr = os.environ["MASTER_ADDR"] | ||
master_port = os.environ["MASTER_PORT"] | ||
dist.init_process_group( | ||
"nccl", | ||
init_method=f"tcp://{master_addr}:{master_port}", | ||
rank=rank, | ||
world_size=world_size, | ||
) | ||
torch.cuda.set_device(rank) | ||
print(f"Process {rank} initialized, using GPU {rank}") | ||
|
||
|
||
def cleanup(): | ||
dist.destroy_process_group() | ||
|
||
|
||
def train_bert(rank, world_size, model, tokenizer): | ||
setup(rank, world_size) | ||
|
||
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") | ||
model = BertForPreTraining.from_pretrained("bert-base-uncased").to(rank) | ||
ddp_model = DDP(model, device_ids=[rank]) | ||
|
||
dataset = create_dummy_data(tokenizer) | ||
train_sampler = torch.utils.data.distributed.DistributedSampler( | ||
dataset, num_replicas=world_size, rank=rank | ||
) | ||
train_dataloader = DataLoader(dataset, sampler=train_sampler, batch_size=8) | ||
|
||
optimizer = torch.optim.AdamW(ddp_model.parameters(), lr=0.001) | ||
criterion = torch.nn.CrossEntropyLoss() | ||
|
||
start_time = time.time() | ||
|
||
for epoch in range(1): # Short run for testing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we let the program read the epoch from an environment variable or argument? This way we can allow larger instances (e.g. p5) to run more epochs without changing the code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would the purpose of having more epochs for larger instance sizes? Are you thinking about it purely from the perspective of wanting the tests to last the same amount of time for each instance type? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just some random thoughts. I was thinking we could run more epochs for larger instances to get more accurate performance data. Additionally, we could reuse this code for our future long-running tests (like soak tests). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gotcha... Yeah I certainly appreciate the idea behind re-usability, but there's a good chance this current test isn't the best option for a SOAP test anyways. As far as more epochs for larger instance types, it depends on what your end goal is. For the tests we're running, and the metrics we're looking to gather, I don't see any benefit in doing this at this time. |
||
ddp_model.train() | ||
for batch in train_dataloader: | ||
optimizer.zero_grad() | ||
inputs, masks, labels, next_sentence_labels = batch | ||
inputs, masks, labels, next_sentence_labels = ( | ||
inputs.to(rank), | ||
masks.to(rank), | ||
labels.to(rank), | ||
next_sentence_labels.to(rank), | ||
) | ||
outputs = ddp_model( | ||
input_ids=inputs, | ||
attention_mask=masks, | ||
labels=labels, | ||
next_sentence_label=next_sentence_labels, | ||
) | ||
loss = outputs.loss | ||
loss.backward() | ||
optimizer.step() | ||
|
||
end_time = time.time() | ||
training_time = end_time - start_time | ||
throughput = len(dataset) / training_time | ||
|
||
print(f"Process {rank} - Training time: {training_time:.2f} seconds") | ||
print(f"Process {rank} - Throughput: {throughput:.2f} samples/second") | ||
Comment on lines
+113
to
+114
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need to dump this output to disk so we can use to upload to s3? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Potentially... I also was considering writing directly to s3 as well, but was curious to hear other's perspective(s). My intuition says writing to S3 is the long-term solution (once a stable schema is solidified), but short term just doing something like writing to disk or stdout might be the way to go. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree it can go to s3, cloudwatch, or etc once we know where this is going. we should have this output also printed for sure though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it should be fine dump to disk for short term, and you have enough flexibility to play POC with different long-term destinations There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Issacwww Are there any concerns/considerations with writing from the container to the host machine? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, good call out, this is on the tod worker, dump to disk has no difference between stdout... so stdout should be fine now. |
||
|
||
cleanup() | ||
|
||
|
||
def main(): | ||
# Pre-download model and tokenizer | ||
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") | ||
model = BertForPreTraining.from_pretrained("bert-base-uncased") | ||
|
||
rank = int(os.environ["OMPI_COMM_WORLD_RANK"]) | ||
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"]) | ||
train_bert(rank, world_size, model, tokenizer) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an idea for the future (since there will be more to come), maybe we standardizing a
images/XXX/{Dockerfile, ...}
structure for all our images and then create a job matrix for test image build