-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce load balancing dataset samplers #10163
Conversation
/azp run onnxruntime-binary-size-checks-ci-pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
A couple questions:
|
I have tried it on one of our internal models and observed around 3~4% improvement with PyTorch. I am in the process of evaluating this on another model right now.
I don't think there is any penalty on the memory. The data sampler holds onto a list of pairs whose values are the index of the sample in the dataset and the complexity of this sample. From my observations, this sampler does not incur any significant memory penalty. |
dfe9f80
to
43ec5d3
Compare
This pull request introduces the
LoadBalancingDistributedSampler
which helps with straggler problems observed in distributed training tasks where different workers work on data samples with varying complexity resulting in stragglers.The data sampler does the following:
Here is an example of what the data sampler does:
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
and the number of workers is 2.[0 , 1, 2, 3, 4, 5, 6, 7, 8, 9]
and sorted dataset indices (sorted on complexities) will be[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
.[9, 7, 5, 3, 1]
and[8, 6, 4, 2, 0]
respectively.[5, 7, 3, 1, 9]
and[4, 6, 2, 0, 8]
respectively.In addition, the data sampler also provides a mechanism to control the degree of load balancing using the
random_level
argument. This argument modifies the sample complexities randomly so that the sorted dataset is not sorted purely in ascending order of complexities.The data sampler also allows users to change the sorting and shuffling strategy slightly by working with groups within the dataset. This helps with decoupling any correlation that the sorting of the dataset might introduce with training loss.
group_size
argument, the dataset is grouped intogroup_size
sized groups.Here is an example of data sampler working with groups:
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
and the number of workers is 2 and group size is 4.[6, 7, 8, 9 | 2, 3, 4, 5 | 0, 1]
and sorted dataset indices (sorted on complexities) will be[3, 2, 1, 0 | 7, 6, 5, 4 | 9, 8]
.[ 7, 6, 5, 4 | 9, 8 | 3, 2, 1, 0]
.[7, 5, 9, 3, 1]
and[6, 4, 8, 2, 0]
respectively.