-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training is interrupted without error with MulitGPU #5604
Comments
This happens to me too, but I don't get "Terminated" at end of progress bar, it just stops, and when I check the system with "top i", I see 4 python processes running at 100% and 4 GPUs at about 90% capacity but nothing is changing. Sometimes, after like an hour or two, it just randomly restarts but the s/it jumps from 2.5 to like 85. |
@angadkalra +1. For me switching to DP solves the problem, but it at the cost of speed. Update: |
Hi everybody! If any of you can provide a snippet that would be awesome. Otherwise we are blind trying to fix it. |
Code snippet? Is the code in the original post good enough or need colab? |
Need something I can run. Preferably a colab 👍 |
@carmocca Sorry for the late reproducible example! Please see below for a self-contained example that uses two GPUs for DDP. For me the code gets stuck at epoch 13, while the two GPUs keep busy at 100%. Switching to DP solves the problem. Some hints
My settings:
import torch
import pytorch_lightning as pl
import torch.nn.functional as F
import torch.optim as optim
from torch import nn
from torch.utils.data import Dataset, DataLoader
# set random seed
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True
# Define Dataset
class CCDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return self.data.shape[0]
def __getitem__(self, idx):
target = self.data[idx,0].clone().detach()
features = self.data[idx,1:].clone().detach()
return target, features
# then define DataModule
class CCDataModule(pl.LightningDataModule):
def __init__(self):
super().__init__()
# Dataset
def setup(self):
# read the train and test dataset
# targets_train = feather.read_feather('targets_train.feather')
# targets_val = feather.read_feather('targets_test.feather')
targets_train = torch.rand(12157,16)
targets_val = torch.rand(12157,16)
self.train_dataset = CCDataset(targets_train)
self.val_dataset = CCDataset(targets_val)
# DataLoader
def train_dataloader(self):
return DataLoader(self.train_dataset, batch_size=64,
shuffle=True, drop_last=False, num_workers=2,
pin_memory=True)
def val_dataloader(self):
return DataLoader(self.val_dataset, batch_size=64,
num_workers=2, pin_memory=True,
drop_last=False)
# # def Model
class Model(pl.LightningModule):
'''Mainly define the `*_step_end` methods
'''
def __init__(self):
super().__init__()
# dropout layers
self.dropout_1 = nn.Dropout(0.5)
# fc layers
self.fc_1 = nn.Linear(15, 16)
self.fc_2 = nn.Linear(16, 1)
def shared_step(self, batch):
t, x = batch
x = self.dropout_1(F.relu(self.fc_1(x)))
y = self.fc_2(x) # (N, 1)
return y.squeeze(), t
# train step
def training_step(self, batch, idx):
y, t = self.shared_step(batch)
return {'y': y, 't': t}
# validation step
def validation_step(self, batch, idx):
y, t = self.shared_step(batch)
return {'y': y, 't': t}
# loss
def mse_loss(self, y, t):
return F.mse_loss(y, t)
# def training_step_end
def training_step_end(self, outputs):
y = outputs['y']
t = outputs['t']
loss = self.mse_loss(y, t)
return {'loss':loss}
# def validation_step_end
def validation_step_end(self, outputs):
y = outputs['y']
t = outputs['t']
return {'y': y, 't': t}
# validation step
def validation_epoch_end(self, outputs):
y = torch.cat([x['y'] for x in outputs])
t = torch.cat([x['t'] for x in outputs])
loss = self.mse_loss(y, t)
rmse = torch.sqrt(loss)
self.log('val_rmse', rmse, on_step=False)
# optimizer
def configure_optimizers(self):
optimizer = optim.Adam(self.parameters(), lr=1e-4)
return optimizer
# # Run
# checkpoint
checkpoint_callback = pl.callbacks.ModelCheckpoint(
verbose=True,
mode='min',
monitor='val_rmse',
save_top_k=1)
# trainer
trainer = pl.Trainer(gpus=[0,1],
checkpoint_callback=checkpoint_callback,
accelerator='ddp',
min_epochs=10,
max_epochs=500)
# loop over windows
torch.manual_seed(42)
# init model
model = Model()
# create datamodule
datamodule = CCDataModule()
datamodule.setup()
# train the model
trainer.fit(model, datamodule) |
I observe the same behaviour as @angadkalra:
This is pretty severe issue, basically making the training impossible for non-toy models - in my case, removing |
For me, I turn my VM off and on and it'll train fine for many epochs... |
Regarding the reproducible code above, I think I can confirm that this is due to the I know this is incorrect metrically, but as a test I'd suggest changing If this does, two longer term solutions would be to define your own pl.Metric (I can assist here, should be easy) which will handle distributed synching for you and gathering all the outputs, or trying to use the |
Hi @SeanNaren, I can confirm by adding |
I'm only calling |
Sure, I'll try get some time, in the meantime I tested the # validation step
def validation_epoch_end(self, outputs):
y = torch.cat([x['y'] for x in outputs])
t = torch.cat([x['t'] for x in outputs])
y = self.all_gather(y)
t = self.all_gather(t)
loss = self.mse_loss(y, t)
rmse = torch.sqrt(loss)
self.log('val_rmse', rmse, on_step=False) Let me give some context as to why this fix works. When we run validation across distributed processes, each GPU/process gets a different set of data batches. This means the score calculated on every GPU is different, unless we do some form of synchronisation between the process. This can either be done via:
@marrrcin maybe the explanation could be insightful? If you're running into a different error and can get a reproducible script let me know, I can help resolve @PyTorchLightning/core-contributors @justusschock this has come up a few times as a bug. How about if we receive different monitor score in the model checkpoint from processes we throw a warning? I don't see too many cases where we'd have different processes giving different results to the model checkpoint. The check will involve gathering the monitor score across processes, but considering this happens only at saving time, this might be worth it. |
Sorry for the long wait, but here is a Google Colab for the code from the original post PyTorchImageGPT
Good to know, i will try it without ModelCheckpoint, also with
No VMs installed on training system |
I see iGPT, which may require from pytorch_lightning.plugins import DDPPlugin
trainer = pl.Trainer(
gpus=2,
checkpoint_callback=checkpoint_callback,
accelerator='ddp',
plugins=DDPPlugin(find_unused_parameters=True),
) |
Have you guys tried updating to v1.2? I'm using Metrics API now instead of returning batch dict and everything is working fine, using ModelCheckpoint callback too. I haven't got any freezing so far. |
@edenlightning @SeanNaren sorry for the late reply. Yes the problem is solved, either by using the pl.Metric or overriding |
|
Thanks @marrrcin for coming back to us! We've got a discussion here to turn this flag back on, or expose it to the trainer: #6219 If anyone has thoughts please leave them there, it seems that at this point we should turn |
Actually keeping this open, to track setting the default to true or any other solution you come up with. |
I suggest:
|
I have the same issue, but only running on one GPU. |
This suggestion didn't work for me, but setting |
In my case, the training stops after epoch 0 (right before the validation end). Setting |
🐛 Bug
The training is interrupted randomly in the middle of an epoch without errors. The console only says: Terminated.
The error does not necessarily occur, if it does then mostly between epochs 2-4. It is noticeable that processes are still running after the termination, the graphic cards are still used by python processes.
We train the PyTorch version of the ImageGPT model with huggingface transformers. Could also be problem of huggingface, we are not sure.
Epoch 1: 29%|█▍ | 9413/32393 [3:28:18<8:28:33, 1.33s/it, loss=3.23, v_num=9]Terminated
Please reproduce using the BoringModel
Cant reproduce with Boring Model.
Code
Expected behavior
The training is fully completed across all epochs.
Environment
Additional context
We have made the following points to solve the problem:
The text was updated successfully, but these errors were encountered: