-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem in training iterable dataset #6437
Comments
Has anyone ever encountered this problem before? |
However if you use a Dataset you'll end up with the same amount of data, because we know the length of the dataset we can split it exactly where we want. Also Dataset objects don't load the full dataset in memory; instead it memory maps Arrow files from disk. |
Thanks for your answer! I finally solve it by using the torch.distributed.algorithms.join.Join. I think maybe some rookie like me would face the same question the day after tomorrow hh. |
Great ! Maybe it can be worth having an example that we can include in the docs for other people, did you need anything else than the Join context manager used with the model and optimizer ? |
I think it's none. I have tried barrier() to solve the problem but I failed. Maybe it's a tool for other situation. |
Describe the bug
I am using PyTorch DDP (Distributed Data Parallel) to train my model. Since the data is too large to load into memory at once, I am using load_dataset to read the data as an iterable dataset. I have used datasets.distributed.split_dataset_by_node to distribute the dataset. However, I have noticed that this distribution results in different processes having different amounts of data to train on. As a result, when the earliest process finishes training and starts predicting on the test set, other processes are still training, causing the overall training speed to be very slow.
Steps to reproduce the bug
And here’s the part of output:
Expected behavior
I'd like to know how to fix this problem.
Environment info
The text was updated successfully, but these errors were encountered: