Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic training support #602

Merged
merged 34 commits into from
Dec 23, 2020
Merged

Elastic training support #602

merged 34 commits into from
Dec 23, 2020

Conversation

jeffra
Copy link
Collaborator

@jeffra jeffra commented Dec 14, 2020

Supports scaling up/down training to compatible GPU counts. Adds new 'elasticity' key to our config json. Users indicate their max acceptable batch size and acceptable micro batch sizes. DeepSpeed will find a batch size that is usable with the largest list of compatible GPU counts. The intended consumers of this API and JSON addition are both the user training code and also the infrastructure scheduler.

    "elasticity": {
        "enabled": true,
        "max_train_batch_size": 2000,
        "micro_batch_sizes": [2,4,6],
        "min_gpus": 1,
        "max_gpus" : 10000,
        "min_time": 20,
        "version": 0.1
    }

@g-karthik
Copy link

@jeffra I haven't looked at this closely, but am I right to assume this requires the user to also use the training_data argument of deepspeed.initialize()? Also, how does the infrastructure scheduler tie to this config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants