Introduce load balancing dataset samplers #10163

baijumeswani · 2022-01-03T07:45:24Z

This pull request introduces the LoadBalancingDistributedSampler which helps with straggler problems observed in distributed training tasks where different workers work on data samples with varying complexity resulting in stragglers.
The data sampler does the following:

Sorts the dataset based on sample complexities.
Distributes the data across workers.
Shuffles the data each worker sees deterministically in order to avoid working with data that is sorted purely in ascending order (which may result in convergence issues).

Here is an example of what the data sampler does:

Assume the data sampler is working with dataset with complexities: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] and the number of workers is 2.
The sorted complexities will be: [0 , 1, 2, 3, 4, 5, 6, 7, 8, 9] and sorted dataset indices (sorted on complexities) will be [9, 8, 7, 6, 5, 4, 3, 2, 1, 0].
Data is then distributed among workers. So the two workers will work with the indices [9, 7, 5, 3, 1] and [8, 6, 4, 2, 0] respectively.
Indices seen by each worker is then shuffled deterministically. So, the indices seen by each worker may look like: [5, 7, 3, 1, 9] and [4, 6, 2, 0, 8] respectively.

In addition, the data sampler also provides a mechanism to control the degree of load balancing using the random_level argument. This argument modifies the sample complexities randomly so that the sorted dataset is not sorted purely in ascending order of complexities.

The data sampler also allows users to change the sorting and shuffling strategy slightly by working with groups within the dataset. This helps with decoupling any correlation that the sorting of the dataset might introduce with training loss.

If the user provides the group_size argument, the dataset is grouped into group_size sized groups.
The sorting of data samples based on complexities happens within each group as opposed to across the entire dataset.
The shuffling happens on the group order and not on the dataset sample indices.

Here is an example of data sampler working with groups:

Assume the data sampler is working with dataset with complexities: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] and the number of workers is 2 and group size is 4.
The sorted complexities will be: [6, 7, 8, 9 | 2, 3, 4, 5 | 0, 1] and sorted dataset indices (sorted on complexities) will be [3, 2, 1, 0 | 7, 6, 5, 4 | 9, 8].
The group order is then shuffled deterministically so that the indices could be: [ 7, 6, 5, 4 | 9, 8 | 3, 2, 1, 0].
The data is then distributed among the 2 workers. So each worker will see the data with indices [7, 5, 9, 3, 1] and [6, 4, 8, 2, 0] respectively.

orttraining/orttraining/python/training/utils/data/sampler.py

baijumeswani · 2022-02-01T21:53:57Z

/azp run onnxruntime-binary-size-checks-ci-pipeline

azure-pipelines · 2022-02-01T21:54:07Z

Azure Pipelines successfully started running 1 pipeline(s).

ytaous · 2022-02-02T01:01:17Z

A couple questions:

has anyone tried the util on any model,if so, what's the perf % we are expecting?
what's the trade-off, i.e., memory penalty and wait time, when using the util?

baijumeswani · 2022-02-04T00:36:35Z

A couple questions:

has anyone tried the util on any model,if so, what's the perf % we are expecting?

I have tried it on one of our internal models and observed around 3~4% improvement with PyTorch. I am in the process of evaluating this on another model right now.

what's the trade-off, i.e., memory penalty and wait time, when using the util?

I don't think there is any penalty on the memory. The data sampler holds onto a list of pairs whose values are the index of the sample in the dataset and the complexity of this sample. From my observations, this sampler does not incur any significant memory penalty.
The sampler also sorts this list before every epoch. So, it might incur some wait time penalty. But on the model I ran, I was able to observe around 3~4% performance improvement despite the sorting.

baijumeswani requested review from liqunfu, SherlockNoMad, thiagocrepaldi, tlh20 and xadupre as code owners January 3, 2022 07:45