Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce load balancing dataset samplers #10163

Merged
merged 3 commits into from
Feb 14, 2022
Merged

Conversation

baijumeswani
Copy link
Contributor

@baijumeswani baijumeswani commented Jan 3, 2022

This pull request introduces the LoadBalancingDistributedSampler which helps with straggler problems observed in distributed training tasks where different workers work on data samples with varying complexity resulting in stragglers.
The data sampler does the following:

  • Sorts the dataset based on sample complexities.
  • Distributes the data across workers.
  • Shuffles the data each worker sees deterministically in order to avoid working with data that is sorted purely in ascending order (which may result in convergence issues).

Here is an example of what the data sampler does:

  • Assume the data sampler is working with dataset with complexities: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] and the number of workers is 2.
  • The sorted complexities will be: [0 , 1, 2, 3, 4, 5, 6, 7, 8, 9] and sorted dataset indices (sorted on complexities) will be [9, 8, 7, 6, 5, 4, 3, 2, 1, 0].
  • Data is then distributed among workers. So the two workers will work with the indices [9, 7, 5, 3, 1] and [8, 6, 4, 2, 0] respectively.
  • Indices seen by each worker is then shuffled deterministically. So, the indices seen by each worker may look like: [5, 7, 3, 1, 9] and [4, 6, 2, 0, 8] respectively.

In addition, the data sampler also provides a mechanism to control the degree of load balancing using the random_level argument. This argument modifies the sample complexities randomly so that the sorted dataset is not sorted purely in ascending order of complexities.

The data sampler also allows users to change the sorting and shuffling strategy slightly by working with groups within the dataset. This helps with decoupling any correlation that the sorting of the dataset might introduce with training loss.

  • If the user provides the group_size argument, the dataset is grouped into group_size sized groups.
  • The sorting of data samples based on complexities happens within each group as opposed to across the entire dataset.
  • The shuffling happens on the group order and not on the dataset sample indices.

Here is an example of data sampler working with groups:

  • Assume the data sampler is working with dataset with complexities: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] and the number of workers is 2 and group size is 4.
  • The sorted complexities will be: [6, 7, 8, 9 | 2, 3, 4, 5 | 0, 1] and sorted dataset indices (sorted on complexities) will be [3, 2, 1, 0 | 7, 6, 5, 4 | 9, 8].
  • The group order is then shuffled deterministically so that the indices could be: [ 7, 6, 5, 4 | 9, 8 | 3, 2, 1, 0].
  • The data is then distributed among the 2 workers. So each worker will see the data with indices [7, 5, 9, 3, 1] and [6, 4, 8, 2, 0] respectively.

@baijumeswani baijumeswani added component:training-frontend training issues related to ONNX Runtime training; typically submitted using template labels Jan 3, 2022
@baijumeswani
Copy link
Contributor Author

/azp run onnxruntime-binary-size-checks-ci-pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ytaous
Copy link
Contributor

ytaous commented Feb 2, 2022

A couple questions:

  1. has anyone tried the util on any model,if so, what's the perf % we are expecting?
  2. what's the trade-off, i.e., memory penalty and wait time, when using the util?

ytaous
ytaous previously approved these changes Feb 2, 2022
@baijumeswani
Copy link
Contributor Author

A couple questions:

  1. has anyone tried the util on any model,if so, what's the perf % we are expecting?

I have tried it on one of our internal models and observed around 3~4% improvement with PyTorch. I am in the process of evaluating this on another model right now.

  1. what's the trade-off, i.e., memory penalty and wait time, when using the util?

I don't think there is any penalty on the memory. The data sampler holds onto a list of pairs whose values are the index of the sample in the dataset and the complexity of this sample. From my observations, this sampler does not incur any significant memory penalty.
The sampler also sorts this list before every epoch. So, it might incur some wait time penalty. But on the model I ran, I was able to observe around 3~4% performance improvement despite the sorting.

@baijumeswani baijumeswani merged commit 7691e7e into master Feb 14, 2022
@baijumeswani baijumeswani deleted the bmeswani/datasampler branch February 14, 2022 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants