-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set correct device ids in DDP [wip] #4297
Changes from all commits
5148f25
3d71dc7
ef93040
01baf92
5b72460
d717f44
1250026
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -62,7 +62,7 @@ def setup(self, model): | |
self._call_children_scripts() | ||
|
||
# set the task idx | ||
self.task_idx = int(os.environ['PL_DDP_PID']) | ||
self.task_idx = int(os.environ['LOCAL_RANK']) | ||
|
||
def _call_children_scripts(self): | ||
assert self.trainer.global_rank == 0 | ||
|
@@ -106,19 +106,14 @@ def _call_children_scripts(self): | |
if self.trainer.logger is not None: | ||
os.environ['PL_EXP_VERSION'] = str(self.trainer.logger.version) | ||
|
||
gpu_ids = os.environ.get('CUDA_VISIBLE_DEVICES', '') | ||
if len(gpu_ids) == 1: | ||
gpu_ids = f'{gpu_ids},' | ||
|
||
num_gpus = max(1, len(gpu_ids.split(','))) | ||
|
||
num_gpus = len(self.trainer.data_parallel_device_ids) | ||
os.environ['WORLD_SIZE'] = f'{num_gpus * self.trainer.num_nodes}' | ||
|
||
self.interactive_ddp_procs = [] | ||
for local_rank in range(1, self.trainer.num_processes): | ||
env_copy = os.environ.copy() | ||
env_copy['LOCAL_RANK'] = f'{local_rank}' | ||
env_copy['PL_DDP_PID'] = str(self.trainer.data_parallel_device_ids[local_rank]) | ||
|
||
# remove env var if global seed not set | ||
if os.environ.get('PL_GLOBAL_SEED') is None and 'PL_GLOBAL_SEED' in env_copy: | ||
del env_copy['PL_GLOBAL_SEED'] | ||
|
@@ -137,8 +132,6 @@ def _call_children_scripts(self): | |
delay = np.random.uniform(1, 5, 1)[0] | ||
sleep(delay) | ||
|
||
os.environ['PL_DDP_PID'] = str(0) | ||
|
||
def train(self): | ||
model = self.trainer.model | ||
|
||
|
@@ -180,7 +173,7 @@ def set_world_ranks(self, process_idx): | |
self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes | ||
|
||
def model_to_device(self, model, process_idx): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need process_idx at all then? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, we can probably remove it. the device the model is on is anyway not related to the process index, which was actually the root cause of the bug here. |
||
self.trainer.root_gpu = process_idx | ||
self.trainer.root_gpu = self.trainer.data_parallel_device_ids[self.trainer.local_rank] | ||
torch.cuda.set_device(self.trainer.root_gpu) | ||
model.cuda(self.trainer.root_gpu) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -162,7 +162,7 @@ def set_world_ranks(self, process_idx): | |
self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes | ||
|
||
def model_to_device(self, model, process_idx, is_master): | ||
gpu_idx = process_idx | ||
gpu_idx = self.trainer.data_parallel_device_ids[self.trainer.local_rank] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this same fix apply to the ddp_torchelastic_accelerator? are there other DDP accelerators which need this change too? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, some of these accelerators like torchelastic and ddp2 should also get this fix. The problem is, I didn't want to touch these because I cannot test the fix myself, lacking multi node setup. It would be good to follow up on this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry, i actually don't think the others need this fix. The reason is that torchelastic and co derive their process numbers from the environment, so they don't need to do this (local rank, global rank, etc). Our DDP is a special case where we are acting like torchelastic, so we need to set this stuff up manually |
||
self.trainer.root_gpu = gpu_idx | ||
torch.cuda.set_device(self.trainer.root_gpu) | ||
model.cuda(self.trainer.root_gpu) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there documentation around what env vars lightning sets anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am sure there is no documentation about these. They are internally used only, not meant to be modified by user.
Should they be documented? I'm not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I meant for contributors/developers primarily. I should probably read the code again too haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good point. where could env variables be documented, they are like global variables. maybe at the top of file of the accelerator base class. I have no good suggestions :)