You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the bulk sampler naively divides output batches across N files based on the batches_per_partition parameter. But this nearly always results in uneven batch distribution, which is theoretically acceptable, but in practice, leads to significant waste of GPU resources since a few workers get a much higher number of batches.
This problem compounds when the number of batches per worker is required to be equal in order to prevent a hang. model.join is supposed to prevent this problem, but for many workflows, it does not work. So the only solution is to throw out some batches. Depending on how batches were distributed across partitions, this could result in a lot of batches being thrown out, leading to lower training accuracy.
The native GNN frameworks do not have this issue since each batch is tied to a worker. A good workaround could be to make batches_per_partition affect only batches on a per-worker basis. So if we had 27 output batches and 3 workers, we would first divide the batches across the workers (9 batches per worker). Then we would use the batches_per_partition parameter (4) to create the final partitions, resulting in partitions of (4, 4, 1) per worker and a total of 9 partitions. Although there are now more partitions (9 vs. 7), we better utilize GPU resources and ensure no batch is dropped.
The text was updated successfully, but these errors were encountered:
Currently, the bulk sampler naively divides output batches across N files based on the
batches_per_partition
parameter. But this nearly always results in uneven batch distribution, which is theoretically acceptable, but in practice, leads to significant waste of GPU resources since a few workers get a much higher number of batches.This problem compounds when the number of batches per worker is required to be equal in order to prevent a hang.
model.join
is supposed to prevent this problem, but for many workflows, it does not work. So the only solution is to throw out some batches. Depending on how batches were distributed across partitions, this could result in a lot of batches being thrown out, leading to lower training accuracy.The native GNN frameworks do not have this issue since each batch is tied to a worker. A good workaround could be to make
batches_per_partition
affect only batches on a per-worker basis. So if we had 27 output batches and 3 workers, we would first divide the batches across the workers (9 batches per worker). Then we would use thebatches_per_partition
parameter (4) to create the final partitions, resulting in partitions of (4, 4, 1) per worker and a total of 9 partitions. Although there are now more partitions (9 vs. 7), we better utilize GPU resources and ensure no batch is dropped.The text was updated successfully, but these errors were encountered: