Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove .data usages in optimizations.py #23417

Merged
merged 1 commit into from
May 19, 2023

Conversation

alanwaketan
Copy link
Contributor

What does this PR do?

.data usages is deprecated in recent releases of PyTorch. See pytorch/pytorch#91093 (comment)

This change replace all .data usages in optimizations.py with modern alternatives.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@connor-henderson @stas00

@JackCaoG
Copy link

@muellerzr the usage of the .data will greatly slow down pytorch/xla on nightly, we were hoping we can fix this issue before the next release.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 16, 2023

The documentation is not available anymore as the PR was closed or merged.

@stas00
Copy link
Contributor

stas00 commented May 17, 2023

This is a very old and deprecated implementation since it doesn't even follow the AdamW algorithm exactly. One should use torch.optim.AdamW instead, which also has a fused version since pt-2.0.0 which is almost as fast as apex's fused AdamW. So really you shouldn't be using this version anyway.

The only reason it was kept is for BC for those who rely on exact results remaining exact after new transformers versions are released, otherwise we would have just replaced it with torch.optim.AdamW in the first place.

p.s. no objections though to making it better...

@alanwaketan
Copy link
Contributor Author

@stas00 Thanks for the reply. How about the adafactor then?

@stas00
Copy link
Contributor

stas00 commented May 17, 2023

oh, sorry, I didn't see it was Adafactor too. It's hard to see from the diff as it doesn't show the class names.

This Adafactor is being used for sure, but its implementation is super old as well. So certainly it'd be a blessing to bring it up to more modern code standard.

@JackCaoG
Copy link

@stas00 Do you mind give this pr a review? Thanks.

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once .data is removed the copy_ becomes somewhat difficult to understand in Adafactor's part of the diff. An explicit downcast would be more readable at the very end. But since p has to remain the same pointer I couldn't quite think of a better way.

Otherwise looks good. Thank you for modernizing, @alanwaketan

Let me just invite @sgugger to have a quick look before we merge.

@stas00 stas00 requested a review from sgugger May 18, 2023 22:14
@sgugger
Copy link
Collaborator

sgugger commented May 19, 2023

Thanks for your PR. Just to be sure though, is this all going to work with PyTorch 1.8+? 1.8 is the minimum version we offically support at the moment (for a couple more weeks at least, then 1.9 starting mid-June).

@stas00
Copy link
Contributor

stas00 commented May 19, 2023

I'm almost 100% sure it is the case. the whole direct .data usage deprecation is a few years old at least.

Let me quickly test it with pt-1.8

@stas00
Copy link
Contributor

stas00 commented May 19, 2023

$ pytest tests/optimization/test_optimization.py -k test_adafactor
========================================================== test session starts ===========================================================
platform linux -- Python 3.8.8, pytest-7.3.1, pluggy-0.13.1
rootdir: /mnt/nvme0/code/huggingface/transformers-master
configfile: setup.cfg
plugins: timeout-1.4.2, typeguard-2.12.1, flakefinder-1.0.0, forked-1.3.0, monitor-1.6.0, hypothesis-6.47.0, instafail-0.4.2, xdist-2.2.1
collected 3 items / 2 deselected / 1 selected

tests/optimization/test_optimization.py .                                                                                          [100%]

================================================================= PASSES =================================================================
======================================================== short test summary info =========================================================
PASSED tests/optimization/test_optimization.py::OptimizationTest::test_adafactor
============================================== 1 passed, 2 deselected, 4 warnings in 0.16s ===============================================

$ pt-ver
pt=1.8.2, cuda=11.1, nccl=2708

At least the Adafactor test that we have is passing.

@sgugger sgugger merged commit 8aa8513 into huggingface:main May 19, 2023
@alanwaketan
Copy link
Contributor Author

Thanks @sgugger and @stas00 for reviewing the changes.

sheonhan pushed a commit to sheonhan/transformers that referenced this pull request Jun 1, 2023
gojiteji pushed a commit to gojiteji/transformers that referenced this pull request Jun 5, 2023
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants