[BUG] Bulk Sampler Batch Distribution is Uneven #4201

alexbarghi-nv · 2024-02-28T18:42:31Z

Currently, the bulk sampler naively divides output batches across N files based on the batches_per_partition parameter. But this nearly always results in uneven batch distribution, which is theoretically acceptable, but in practice, leads to significant waste of GPU resources since a few workers get a much higher number of batches.

This problem compounds when the number of batches per worker is required to be equal in order to prevent a hang. model.join is supposed to prevent this problem, but for many workflows, it does not work. So the only solution is to throw out some batches. Depending on how batches were distributed across partitions, this could result in a lot of batches being thrown out, leading to lower training accuracy.

The native GNN frameworks do not have this issue since each batch is tied to a worker. A good workaround could be to make batches_per_partition affect only batches on a per-worker basis. So if we had 27 output batches and 3 workers, we would first divide the batches across the workers (9 batches per worker). Then we would use the batches_per_partition parameter (4) to create the final partitions, resulting in partitions of (4, 4, 1) per worker and a total of 9 partitions. Although there are now more partitions (9 vs. 7), we better utilize GPU resources and ensure no batch is dropped.

The text was updated successfully, but these errors were encountered:

alexbarghi-nv added the bug Something isn't working label Feb 28, 2024

alexbarghi-nv added this to the 23.06 milestone Feb 28, 2024

alexbarghi-nv self-assigned this Feb 28, 2024

alexbarghi-nv mentioned this issue Feb 28, 2024

[DEBT] Input Batch IDs Don't Line Up With Output Batch IDs #3794

Open

puririshi98 mentioned this issue Mar 6, 2024

Improvements for Papers100m single gpu and single node multi gpu examples (Cugraph, GATConv, better default hyperparams, eval on all ranks) pyg-team/pytorch_geometric#8173

Merged

alexbarghi-nv modified the milestones: 23.06, 24.06 Mar 8, 2024

alexbarghi-nv mentioned this issue Mar 15, 2024

[FEA] New Distributed Sampler for GNN Packages #4246

Closed

alexbarghi-nv mentioned this issue Apr 2, 2024

[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278

Merged

rapids-bot bot closed this as completed in #4278 Apr 15, 2024

rapids-bot bot closed this as completed in 5c7cb2b Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Bulk Sampler Batch Distribution is Uneven #4201

[BUG] Bulk Sampler Batch Distribution is Uneven #4201

alexbarghi-nv commented Feb 28, 2024

[BUG] Bulk Sampler Batch Distribution is Uneven #4201

[BUG] Bulk Sampler Batch Distribution is Uneven #4201

Comments

alexbarghi-nv commented Feb 28, 2024