-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix random attention for pytorch's bigbird/pegasus_bigbird #23056
fix random attention for pytorch's bigbird/pegasus_bigbird #23056
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Hi @Bearnardd Thank you for the PR. I have one question: why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool @Bearnardd - thanks for jumping on fixing the PyTorch model so quickly! Just a few small suggestions / fixes from me!
@@ -1054,7 +1060,7 @@ def _get_rand_attn_plan(from_seq_length, from_block_size, num_rand_blocks): | |||
|
|||
@staticmethod | |||
def _bigbird_block_rand_mask( | |||
from_seq_length, to_seq_length, from_block_size, to_block_size, num_rand_blocks, last_idx=-1 | |||
from_seq_length, to_seq_length, from_block_size, to_block_size, num_rand_blocks, deterministic, last_idx=-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if it's better to make this a non-static method? E.g. so that we can do self.training
from within the method and not have to pass deterministic
as an argument? Just a suggestion, fine either way for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanchit-gandhi yeah I was thinking about it and that is why I asked about @staticmethod
for this method :)
Hi @sanchit-gandhi! I have removed the static method as I think it is the best approach. |
Thanks for the comment! To be honest I am not sure If I understand you correctly, since from what I can see this function is updated. Could you elaborate what exactly is missing? |
Sorry, my bad. You are right :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bearnardd Do you know why, before this PR, the test like test_inference_block_sparse_pretraining
could get deterministic outputs? It's somehow strange to me that it is the case as the goal of this PR is to fix
random attention is used no matter whether we are in training/eval mode. Corect behaviour is that during inference (eval) we should not introduce any randomness, hence we random attention should not be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ydshieh Yeah, sure. It has deterministic output because the random seed for random numpy operations is hardcoded which results in the same random attention indices every time you run tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome - thanks for fixing the big bird models @Bearnardd!
cc @sgugger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! Just two small nits and it should be good to merge.
self.max_seqlen, self.max_seqlen, from_block_size, to_block_size, n_rand_blocks, last_idx=1024 | ||
self.max_seqlen, | ||
self.max_seqlen, | ||
from_block_size, | ||
to_block_size, | ||
n_rand_blocks, | ||
last_idx=1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for the restyle anymore here, might be some leftover from when there was deterministic
passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, exactly :)
self.max_seqlen, | ||
self.max_seqlen, | ||
from_block_size, | ||
to_block_size, | ||
n_rand_blocks, | ||
last_idx=1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
…to fix_pytorch_big_bird_rand_attn
I have pushed the changes @sgugger :) |
…ce#23056) * fix random attention usage for bigbird and pegasus_bigbird * remove staticmethod, update tests target valus * revert style changes
…ce#23056) * fix random attention usage for bigbird and pegasus_bigbird * remove staticmethod, update tests target valus * revert style changes
Fixes # (issue)
#23055
What does this PR do?
Add control over usage of random attention of
BigBird
based on current mode (training/eval)Who can review?
@sanchit-gandhi @ydshieh