Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix random attention for pytorch's bigbird/pegasus_bigbird #23056

Merged
merged 4 commits into from
May 7, 2023
Merged

fix random attention for pytorch's bigbird/pegasus_bigbird #23056

merged 4 commits into from
May 7, 2023

Conversation

Bearnardd
Copy link
Contributor

Fixes # (issue)
#23055

What does this PR do?

Add control over usage of random attention of BigBird based on current mode (training/eval)

Who can review?

@sanchit-gandhi @ydshieh

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 29, 2023

The documentation is not available anymore as the PR was closed or merged.

@ydshieh
Copy link
Collaborator

ydshieh commented May 2, 2023

Hi @Bearnardd Thank you for the PR.

I have one question: why def _bigbird_block_rand_mask_with_head is not modified for this pytorch BigBird file ..?

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool @Bearnardd - thanks for jumping on fixing the PyTorch model so quickly! Just a few small suggestions / fixes from me!

src/transformers/models/big_bird/modeling_big_bird.py Outdated Show resolved Hide resolved
src/transformers/models/big_bird/modeling_big_bird.py Outdated Show resolved Hide resolved
@@ -1054,7 +1060,7 @@ def _get_rand_attn_plan(from_seq_length, from_block_size, num_rand_blocks):

@staticmethod
def _bigbird_block_rand_mask(
from_seq_length, to_seq_length, from_block_size, to_block_size, num_rand_blocks, last_idx=-1
from_seq_length, to_seq_length, from_block_size, to_block_size, num_rand_blocks, deterministic, last_idx=-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if it's better to make this a non-static method? E.g. so that we can do self.training from within the method and not have to pass deterministic as an argument? Just a suggestion, fine either way for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanchit-gandhi yeah I was thinking about it and that is why I asked about @staticmethod for this method :)

@Bearnardd
Copy link
Contributor Author

Hi @sanchit-gandhi! I have removed the static method as I think it is the best approach.

@Bearnardd
Copy link
Contributor Author

Hi @Bearnardd Thank you for the PR.

I have one question: why def _bigbird_block_rand_mask_with_head is not modified for this pytorch BigBird file ..?

Thanks for the comment! To be honest I am not sure If I understand you correctly, since from what I can see this function is updated. Could you elaborate what exactly is missing?

@Bearnardd Bearnardd requested a review from sanchit-gandhi May 3, 2023 12:20
@ydshieh
Copy link
Collaborator

ydshieh commented May 3, 2023

Hi @Bearnardd Thank you for the PR.
I have one question: why def _bigbird_block_rand_mask_with_head is not modified for this pytorch BigBird file ..?

Thanks for the comment! To be honest I am not sure If I understand you correctly, since from what I can see this function is updated. Could you elaborate what exactly is missing?

Sorry, my bad. You are right :-)

Copy link
Collaborator

@ydshieh ydshieh May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bearnardd Do you know why, before this PR, the test like test_inference_block_sparse_pretraining could get deterministic outputs? It's somehow strange to me that it is the case as the goal of this PR is to fix

random attention is used no matter whether we are in training/eval mode. Corect behaviour is that during inference (eval) we should not introduce any randomness, hence we random attention should not be used.

Copy link
Contributor Author

@Bearnardd Bearnardd May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh Yeah, sure. It has deterministic output because the random seed for random numpy operations is hardcoded which results in the same random attention indices every time you run tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome - thanks for fixing the big bird models @Bearnardd!

@Bearnardd
Copy link
Contributor Author

cc @sgugger

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR! Just two small nits and it should be good to merge.

Comment on lines 571 to 576
self.max_seqlen, self.max_seqlen, from_block_size, to_block_size, n_rand_blocks, last_idx=1024
self.max_seqlen,
self.max_seqlen,
from_block_size,
to_block_size,
n_rand_blocks,
last_idx=1024,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for the restyle anymore here, might be some leftover from when there was deterministic passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, exactly :)

Comment on lines 392 to 397
self.max_seqlen,
self.max_seqlen,
from_block_size,
to_block_size,
n_rand_blocks,
last_idx=1024,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@Bearnardd
Copy link
Contributor Author

I have pushed the changes @sgugger :)

@sgugger sgugger merged commit 6f8a028 into huggingface:main May 7, 2023
gojiteji pushed a commit to gojiteji/transformers that referenced this pull request Jun 5, 2023
…ce#23056)

* fix random attention usage for bigbird and pegasus_bigbird

* remove staticmethod, update tests target valus

* revert style changes
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
…ce#23056)

* fix random attention usage for bigbird and pegasus_bigbird

* remove staticmethod, update tests target valus

* revert style changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants