Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow "auto" for gradient clipping in YAML #2649

Merged
merged 3 commits into from
Apr 12, 2024

Conversation

regisss
Copy link
Contributor

@regisss regisss commented Apr 10, 2024

What does this PR do?

If we have gradient_clipping: auto in the DeepSpeed config inside an Accelerate config, the code currently fails with:

Traceback (most recent call last):
  File "/root/workspace/lora/tmp.py", line 3, in <module>
    args = TrainingArguments(
  File "<string>", line 124, in __init__
  File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1828, in __post_init__
    self.deepspeed_plugin = DeepSpeedPlugin()
  File "<string>", line 14, in __init__
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/dataclasses.py", line 727, in __post_init__
    self.gradient_clipping = float(gradient_clipping)
ValueError: could not convert string to float: 'auto'

This can be reproduced running accelerate launch --config_file my_config.yaml script.py with script.py being

from transformers import TrainingArguments, Trainer, AutoModel

args = TrainingArguments(
    output_dir="/tmp/test",
    max_grad_norm=0.5,
)

trainer = Trainer(
    model=AutoModel.from_pretrained("bert-base-uncased"),
    args=args,
)

and my_config.yaml is

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: auto
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

This PR fixes this issue by checking if the given value can actually be casted to float.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr
Copy link
Collaborator

cc @pacman100

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @regisss for supporting auto for gradient_clipping to be inline with gradient_accumulation_steps providing more flexibility to users! Left a suggestion.

src/accelerate/utils/dataclasses.py Outdated Show resolved Hide resolved
@regisss regisss requested a review from pacman100 April 12, 2024 07:46
Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @regisss!

@pacman100 pacman100 merged commit 5056d32 into huggingface:main Apr 12, 2024
23 checks passed
@regisss regisss deleted the fix_gradient_clipping_yaml_auto branch April 12, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants