You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The sharding of IterableDatasets with respect to distributed and dataloader worker processes appears problematic with significant performance traps and inconsistencies wrt to distributed train processes vs worker processes.
Splitting across num_workers (per train process loader processes) and world_size (distributed training processes) appears inconsistent.
In the case of the distributed split, there is a modulus check that flips between two very different behaviours, why is this different than splitting across the data loader workers? For IterableDatasets the DataLoaders worker processes are independent, so whether it's workers within one train process or across a distributed world the shards should be distributed the same, across world_size * num_worker independent workers in either case...
Further, the fallback case when the n_shards % world_size == 0 check fails is a rather extreme change. I argue it is not desirable to do that implicitly, it should be an explicit case for specific scenarios (ie reliable validation). A train scenario would likely be much better handled with improved wrapping / stopping behaviour to eg also fix #6437. Changing from stepping shards to stepping samples means that every single process reads ALL of the shards. This was never an intended default for sharded training, shards gain their performance advantage in large scale distributed training by explicitly avoiding the need to have every process overlapping in the data they read, by default, only the data allocated to each process via their assigned shards should be read in each pass of the dataset.
Using a large scale CLIP example, some of the larger datasets have 10-20k shards across 100+TB of data. Training with 1000 GPUs we are switching between reading 100 terabytes per epoch to 100 petabytes if say change 20k % 1000 and drop one gpu-node to 20k % 992.
The 'step over samples' case might be worth the overhead in specific validation scenarios where gaurantees of at least/most once samples seen are more important and do not make up a significant portion of train time or are done in smaller world sizes outside of train.
Steps to reproduce the bug
N/A
Expected behavior
We have an iterable dataset with N shards, to split across workers
shuffle shards (same seed across all train processes)
step shard iterator across distributed processes
step shard iterator across dataloader worker processes
shuffle samples in every worker via shuffle buffer (different seed in each worker, but ideally controllable (based on base seed + worker id + epoch).
end up with (possibly uneven) number of shards per worker but each shard only ever accessed by 1 worker per pass (epoch)
Environment info
N/A
The text was updated successfully, but these errors were encountered:
I do not know is it the same probelm as mine. I think the num_workers should a value of process number for one dataloader mapped to one card, or the total number of processes for all multiple cards.
but when I set the num_workers larger then the count of training split files, it will report num_workers > n_shards and kill all workers over. as a result, only n_shards workers left, where n_shard = total files count / total cards
Is that means the num_workers should be the process number on one card? ok, I changed the num_workers lower, to view it as the number of loader process for one card, but this time, the data loading is still very slow, it seems that only num_workers dataloader process are working, not the num_workers * n_cards as I thought.
So how to set a good parameter to make good dataloading?
Describe the bug
The sharding of IterableDatasets with respect to distributed and dataloader worker processes appears problematic with significant performance traps and inconsistencies wrt to distributed train processes vs worker processes.
Splitting across num_workers (per train process loader processes) and world_size (distributed training processes) appears inconsistent.
datasets/src/datasets/iterable_dataset.py
Lines 1266 to 1283 in 9d6d161
datasets/src/datasets/iterable_dataset.py
Lines 1335 to 1356 in 9d6d161
In the case of the distributed split, there is a modulus check that flips between two very different behaviours, why is this different than splitting across the data loader workers? For IterableDatasets the DataLoaders worker processes are independent, so whether it's workers within one train process or across a distributed world the shards should be distributed the same, across
world_size * num_worker
independent workers in either case...Further, the fallback case when the
n_shards % world_size == 0
check fails is a rather extreme change. I argue it is not desirable to do that implicitly, it should be an explicit case for specific scenarios (ie reliable validation). A train scenario would likely be much better handled with improved wrapping / stopping behaviour to eg also fix #6437. Changing from stepping shards to stepping samples means that every single process reads ALL of the shards. This was never an intended default for sharded training, shards gain their performance advantage in large scale distributed training by explicitly avoiding the need to have every process overlapping in the data they read, by default, only the data allocated to each process via their assigned shards should be read in each pass of the dataset.Using a large scale CLIP example, some of the larger datasets have 10-20k shards across 100+TB of data. Training with 1000 GPUs we are switching between reading 100 terabytes per epoch to 100 petabytes if say change 20k % 1000 and drop one gpu-node to 20k % 992.
The 'step over samples' case might be worth the overhead in specific validation scenarios where gaurantees of at least/most once samples seen are more important and do not make up a significant portion of train time or are done in smaller world sizes outside of train.
Steps to reproduce the bug
N/A
Expected behavior
We have an iterable dataset with N shards, to split across workers
Environment info
N/A
The text was updated successfully, but these errors were encountered: