Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Bulk Sampler Batch Distribution is Uneven #4201

Closed
alexbarghi-nv opened this issue Feb 28, 2024 · 0 comments · Fixed by #4278
Closed

[BUG] Bulk Sampler Batch Distribution is Uneven #4201

alexbarghi-nv opened this issue Feb 28, 2024 · 0 comments · Fixed by #4278
Assignees
Labels
bug Something isn't working
Milestone

Comments

@alexbarghi-nv
Copy link
Member

Currently, the bulk sampler naively divides output batches across N files based on the batches_per_partition parameter. But this nearly always results in uneven batch distribution, which is theoretically acceptable, but in practice, leads to significant waste of GPU resources since a few workers get a much higher number of batches.

This problem compounds when the number of batches per worker is required to be equal in order to prevent a hang. model.join is supposed to prevent this problem, but for many workflows, it does not work. So the only solution is to throw out some batches. Depending on how batches were distributed across partitions, this could result in a lot of batches being thrown out, leading to lower training accuracy.

The native GNN frameworks do not have this issue since each batch is tied to a worker. A good workaround could be to make batches_per_partition affect only batches on a per-worker basis. So if we had 27 output batches and 3 workers, we would first divide the batches across the workers (9 batches per worker). Then we would use the batches_per_partition parameter (4) to create the final partitions, resulting in partitions of (4, 4, 1) per worker and a total of 9 partitions. Although there are now more partitions (9 vs. 7), we better utilize GPU resources and ensure no batch is dropped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant