fix overflow when training mDeberta in fp16 #24116

sjrl · 2023-06-08T15:40:54Z

What does this PR do?

Fixes microsoft/DeBERTa#77 (issue about transformers opened in Microsoft repo)

This issue was originally raised in the https://github.com/microsoft/DeBERTa repo which had to do with mDeberta not being able to be trained using fp16. A fix for this was implemented in the Microsoft repo by @BigBird01 but did not yet make it to HuggingFace. I was interested in training mDeberta models on small hardware (e.g. a 3070, T4) so I updated the HF implementation with the changes from the Microsoft repo. I tried to only bring over the minimal changes needed to get the fp16 training to work.
I checked that existing tests passed and also used this code to successfully train an mDeberta model in fp16 on Squad2 that can be found here which is not currently possible with the main branch of transformers. I'm unsure if there is a good way to add an additional test to make sure mDeberta-V3 training works in fp16 in the CI.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Hey, based on the recommendations from the PR template (and git blame) I decided to tag @ArthurZucker and @sgugger in case you may be interested.

…ully allows for fp16 training of mdeberta

amyeroberts · 2023-06-08T15:48:25Z

cc @younesbelkada @ArthurZucker

ArthurZucker

Hey! Thanks a lot for opening PR, and fixing a very old issue!
We might have problems with this with quantization, as the errors are probably gonna be different. Could you check if using torch.float16 or load_in_8bits gives the same outputs. (this might fix training but maybe break inference? let's just check that everything runs correctly!)

See #22444

HuggingFaceDocBuilderDev · 2023-06-08T16:17:56Z

The documentation is not available anymore as the PR was closed or merged.

sjrl · 2023-06-09T07:06:33Z

I used this code block to check results.
This was run on:

Ubuntu 20.04.4 LTS
NVIDIA 3070
CUDA Version: 11.7

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers.pipelines import QuestionAnsweringPipeline

tokenizer = AutoTokenizer.from_pretrained("sjrhuschlee/mdeberta-v3-base-squad2")
model = AutoModelForQuestionAnswering.from_pretrained(
    "sjrhuschlee/mdeberta-v3-base-squad2",
#     torch_dtype=torch.float16,
#     torch_dtype=torch.bfloat16,
#     load_in_8bit=True,
)
pipe = QuestionAnsweringPipeline(model, tokenizer, device=torch.device("cuda:0"))  # device=... was removed for 8bit

Running on Main Branch

Running the above code using torch.float16 on the main branch gives me no answer

# with torch.float16
pipe = QuestionAnsweringPipeline(model, tokenizer, device=torch.device("cuda:0"))
# []

Running with torch.bfloat16 and torch.float32 gives me the expected answer

# with torch.bfloat16
pipe = QuestionAnsweringPipeline(model, tokenizer, device=torch.device("cuda:0"))
# {'score': 0.98369300365448, 'start': 33, 'end': 41, 'answer': ' Berlin.'}

# with torch.float32
pipe = QuestionAnsweringPipeline(model, tokenizer, device=torch.device("cuda:0"))
# {'score': 0.9850791096687317, 'start': 33, 'end': 41, 'answer': ' Berlin.'}

Also running in 8bit works

# with load_in_8bit=True
pipe = QuestionAnsweringPipeline(model, tokenizer)
# {'score': 0.9868391752243042, 'start': 33, 'end': 41, 'answer': ' Berlin.'}

Running on the PR
The change in this PR also enables mDeberta models to run at inference in torch.float16 which wasn't possible before. And it doesn't look to affect any of the other dtypes.

# with torch.float16
pipe = QuestionAnsweringPipeline(model, tokenizer, device=torch.device("cuda:0"))
# {'score': 0.9848804473876953, 'start': 33, 'end': 41, 'answer': ' Berlin.'}

# with torch.bfloat16
pipe = QuestionAnsweringPipeline(model, tokenizer, device=torch.device("cuda:0"))
# {'score': 0.9841369986534119, 'start': 33, 'end': 41, 'answer': ' Berlin.'}

# with torch.float32
pipe = QuestionAnsweringPipeline(model, tokenizer, device=torch.device("cuda:0"))
# {'score': 0.9850791096687317, 'start': 33, 'end': 41, 'answer': ' Berlin.'}

# with load_in_8bit=True
pipe = QuestionAnsweringPipeline(model, tokenizer)
# {'score': 0.9870386719703674, 'start': 33, 'end': 41, 'answer': ' Berlin.'}

ArthurZucker

Cool! seems like a very subtle but effective change! Pinging @amyeroberts for a second pair of eyes

sjrl · 2023-06-09T07:33:13Z

I also noticed that the TF implementation in DebertaV2 has the same line

transformers/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py

Lines 678 to 679 in 2e2088f

    
           scale = tf.math.sqrt(tf.cast(shape_list(query_layer)[-1] * scale_factor, tf.float32)) 
        
           attention_scores = tf.matmul(query_layer, tf.transpose(key_layer, [0, 2, 1])) / scale

I'm not too familiar with TF though so I'm not sure if this change should be made there as well.

amyeroberts · 2023-06-09T16:04:11Z

@sjrl To the best of my knowledge, we don't support training in fp16 in TF, so less of a risk here. I'd be pro updating in TF, so that the implementations are aligned and it's potentially safer. cc @Rocketknight1 for his thoughts.

Rocketknight1 · 2023-06-12T11:57:07Z

Yes, we support mixed-precision float16/bfloat16 training in TensorFlow, but in general we still expect a 'master' copy of the weights to remain in float32. We're planning some exploration to see if we can get Keras to accept full (b)float16 training, but it might require some refactoring!

younesbelkada

LGTM

sjrl · 2023-06-12T12:30:52Z

@Rocketknight1 should I go ahead and update the TF implementation as well then?

Rocketknight1 · 2023-06-13T12:20:30Z

@sjrl Yes please! Better numerical stability will be nice to have once we've enabled full float16 training

amyeroberts

Thanks for this contribution and making our models more stable ❤️

amyeroberts · 2023-06-13T13:27:44Z

@sjrl - Are there any other changes to add? Otherwise I think we're good to merge :)

sjrl · 2023-06-13T13:29:32Z

@amyeroberts You're welcome, and that's it for the changes!

* Porting changes from https://github.com/microsoft/DeBERTa/ that hopefully allows for fp16 training of mdeberta * Updates to deberta modeling from microsoft repo * Performing some cleanup * Undoing changes that weren't necessary * Undoing float calls * Minimally change the p2c block * Fix error * Minimally changing the c2p block * Switch to torch sqrt * Remove math * Adding back the to calls to scale * Undoing attention_scores change * Removing commented out code * Updating modeling_sew_d.py to satisfy utils/check_copies.py * Missed changed * Further reduce changes needed to get fp16 working * Reverting changes to modeling_sew_d.py * Make same change in TF

sjrl and others added 13 commits June 8, 2023 08:58

Porting changes from https://github.com/microsoft/DeBERTa/ that hopef…

0465185

…ully allows for fp16 training of mdeberta

Updates to deberta modeling from microsoft repo

b367def

Performing some cleanup

9c056f2

Undoing changes that weren't necessary

ba8f2ad

Undoing float calls

3492dd7

Minimally change the p2c block

b75fbd8

Fix error

b5b697a

Minimally changing the c2p block

6d69c7f

Switch to torch sqrt

0ea3459

Remove math

3c95c8a

Adding back the to calls to scale

dd8bd34

Undoing attention_scores change

b930014

Removing commented out code

9d22fde

Updating modeling_sew_d.py to satisfy utils/check_copies.py

f9d52ef

ArthurZucker reviewed Jun 8, 2023

View reviewed changes

Missed changed

c90cde8

sjrl added 2 commits June 8, 2023 19:49

Further reduce changes needed to get fp16 working

9969b99

Reverting changes to modeling_sew_d.py

1ec5df7

ArthurZucker approved these changes Jun 9, 2023

View reviewed changes

ArthurZucker requested a review from amyeroberts June 9, 2023 07:24

younesbelkada approved these changes Jun 12, 2023

View reviewed changes

Make same change in TF

0c9ca81

amyeroberts approved these changes Jun 13, 2023

View reviewed changes

amyeroberts merged commit 3e142cb into huggingface:main Jun 13, 2023

sjrl deleted the mdeberta-fp16-overflow branch July 26, 2023 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix overflow when training mDeberta in fp16 #24116

fix overflow when training mDeberta in fp16 #24116

sjrl commented Jun 8, 2023 •

edited

Loading

amyeroberts commented Jun 8, 2023

ArthurZucker left a comment •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 8, 2023 •

edited

Loading

sjrl commented Jun 9, 2023 •

edited

Loading

ArthurZucker left a comment

sjrl commented Jun 9, 2023

amyeroberts commented Jun 9, 2023

Rocketknight1 commented Jun 12, 2023

younesbelkada left a comment

sjrl commented Jun 12, 2023

Rocketknight1 commented Jun 13, 2023

amyeroberts left a comment

amyeroberts commented Jun 13, 2023

sjrl commented Jun 13, 2023

fix overflow when training mDeberta in fp16 #24116

fix overflow when training mDeberta in fp16 #24116

Conversation

sjrl commented Jun 8, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

amyeroberts commented Jun 8, 2023

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 8, 2023 • edited Loading

sjrl commented Jun 9, 2023 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

sjrl commented Jun 9, 2023

amyeroberts commented Jun 9, 2023

Rocketknight1 commented Jun 12, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

sjrl commented Jun 12, 2023

Rocketknight1 commented Jun 13, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts commented Jun 13, 2023

sjrl commented Jun 13, 2023

sjrl commented Jun 8, 2023 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 8, 2023 •

edited

Loading

sjrl commented Jun 9, 2023 •

edited

Loading