🐛[bug] Distributed training not working for PyTorchTrial when using evaluate_full_dataset and more than 1 GPU #9916

charles-viss · 2024-09-11T15:39:33Z

When training with a PyTorchTrial, the validation_loader for the PyTorchTrialController is set to None for all processes with rank > 0: https://github.com/determined-ai/determined/blob/main/harness/determined/pytorch/_pytorch_trial.py#L463-L465

But then when performing validation with evaluate_full_dataset, the PyTorchTrialController asserts that validation_loader must not be None for all processes: https://github.com/determined-ai/determined/blob/main/harness/determined/pytorch/_pytorch_trial.py#L995

1.Train with a PyTorch trial using more than one GPU and evaluate_full_dataset

With evaluate_full_dataset, only the rank=0 process should require that validation_loader is not None. Could the assertion be moved after the if self.is_cheif statement? https://github.com/determined-ai/determined/blob/main/harness/determined/pytorch/_pytorch_trial.py#L996

N/A

N/A

No response

The text was updated successfully, but these errors were encountered:

azhou-determined · 2024-09-12T16:22:50Z

This indeed seems like a pretty obvious bug on our part. 🤦

We'll land this fix before next release (0.37.0) slated to go out in about two weeks. Thank you for reporting this.

charles-viss added the bug label Sep 11, 2024

azhou-determined closed this as completed Sep 12, 2024

Provide feedback