Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[bug] Distributed training not working for PyTorchTrial when using evaluate_full_dataset and more than 1 GPU #9916

Closed
charles-viss opened this issue Sep 11, 2024 · 1 comment
Labels

Comments

@charles-viss
Copy link

Describe the bug

When training with a PyTorchTrial, the validation_loader for the PyTorchTrialController is set to None for all processes with rank > 0: https://github.com/determined-ai/determined/blob/main/harness/determined/pytorch/_pytorch_trial.py#L463-L465

But then when performing validation with evaluate_full_dataset, the PyTorchTrialController asserts that validation_loader must not be None for all processes: https://github.com/determined-ai/determined/blob/main/harness/determined/pytorch/_pytorch_trial.py#L995

Reproduction Steps

1.Train with a PyTorch trial using more than one GPU and evaluate_full_dataset

Expected Behavior

With evaluate_full_dataset, only the rank=0 process should require that validation_loader is not None. Could the assertion be moved after the if self.is_cheif statement? https://github.com/determined-ai/determined/blob/main/harness/determined/pytorch/_pytorch_trial.py#L996

Screenshot

N/A

Environment

N/A

Additional Context

No response

@azhou-determined
Copy link
Contributor

This indeed seems like a pretty obvious bug on our part. 🤦

We'll land this fix before next release (0.37.0) slated to go out in about two weeks. Thank you for reporting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants