Add Multi Resolution Analysis (MRA) (New PR) #24513

novice03 · 2023-06-27T06:44:06Z

Add Multi Resolution Analysis (MRA) for Approximate Self-Attention

This PR adds the MRA model to the repository.

Paper: https://arxiv.org/pdf/2207.10284.pdf
Code: https://github.com/mlpen/mra-attention

To-do:

Improve loading cuda kernels
Improve formatting and documentation
Upload checkpoints

novice03 · 2023-06-27T06:44:50Z

Copied all files over from #20573

HuggingFaceDocBuilderDev · 2023-06-27T07:01:07Z

The documentation is not available anymore as the PR was closed or merged.

sgugger · 2023-06-27T12:29:42Z

Could you fix the failing tests?

novice03 · 2023-06-28T07:58:49Z

Hello @sgugger, I've made sure all checks pass and fixed conflicts.

sgugger

Thanks for cleaning everything! I just have one tiny nit. @amyeroberts could you have one final look and merge?

src/transformers/models/auto/tokenization_auto.py

amyeroberts

Thanks for adding this model!

Really nice PR. Mostly a few very, very small nits. Only main comment to be addressed before merging in is the implementation of test_attention_outputs .

src/transformers/models/mra/configuration_mra.py

src/transformers/models/mra/__init__.py

tests/models/mra/test_modeling_mra.py

src/transformers/models/mra/convert_mra_pytorch_to_pytorch.py

src/transformers/models/mra/modeling_mra.py

amyeroberts · 2023-06-28T12:49:42Z

src/transformers/models/mra/modeling_mra.py

+            query_layer.float(),
+            key_layer.float(),
+            value_layer.float(),
+            attention_mask.float(),


I'm not super familiar with the assumptions we have about our models and the layer types. @sgugger - is it OK to call float() like this?

There is a custom CUDA kernel which I'm guessing cannot handle other dtypes.

amyeroberts · 2023-06-28T12:58:56Z

tests/models/mra/test_modeling_mra.py

+        batch_size=2,
+        seq_length=256,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=128,
+        num_hidden_layers=5,
+        num_attention_heads=2,
+        intermediate_size=36,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        max_position_embeddings=512,


Some of argument values here defining the model architecture are all quite large, which will make running the test suite slow. Could you reduce seq_length and hidden_size?

Reduced seq_length to 8 and hidden_size to 16.

amyeroberts · 2023-06-28T13:00:08Z

tests/models/mra/test_modeling_mra.py

+    def test_attention_outputs(self):
+        return


Tests that are skipped should be skipped explicitly with a unittest.skip(reason) decorator. In this case, as the model outputs attentions, a custom implementation should added.

Hello, actually, MRA does not output attentions. All of the computation is done by the kernels, and the output of mra2_attention is the product of attention and value. For this reason, I've removed output_attentions from the modeling file and skipped this test.

Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: amyeroberts <[email protected]>

…into add-mra-2

novice03 · 2023-07-04T06:53:56Z

Hello @amyeroberts, I've addressed your comments and made some code changes. Please take a look at the updated files.

novice03 · 2023-07-06T06:08:52Z

Hi @amyeroberts, I've addressed the suggestions from the code review. Please take a look at the updated code.

amyeroberts

Thanks for adding this model and iterating!

All LGTM - just two tiny, tiny nits. Otherwise, we're good to merge :)

docs/source/en/model_doc/mra.md

Co-authored-by: amyeroberts <[email protected]>

novice03 · 2023-07-10T06:39:30Z

Thanks for catching these errors @amyeroberts! I've applied both changes.

ydshieh · 2023-07-26T15:14:47Z

@novice03

It seems the CI get

(line 403)  ValueError: sequence length must be divisible by the block_size.

when load_cuda_kernels loads successfully.

It's likely due to seq_length=8 from MraModelTester, but I am not able to set the correct combination of seq_length, block_size, num_blocks to make it works.

Note, our daily CI (with torch 2.0.1 + CUDA 11.8) fails to load custom CUDA kernels and the execution goes to

    if cuda_kernel is None:
        return torch.zeros_like(query).requires_grad_()

in mra2_attention and tests pass.

However, in our CI with torch 1.13 (and with CUDA 11.6.2), kernel is loaded, but the tests fail.

It would be great if you can help us to find the correct settings where the CI will pass when kernel is loaded.

Thanks in advance 🤗 .

ydshieh · 2023-07-26T15:19:23Z

You can run

python3 -m pytest -v tests/models/mra/test_modeling_mra.py::MraModelTest::test_for_masked_lm

The full error log is (if custom cuda kernal is loaded successfully)

self = <tests.models.mra.test_modeling_mra.MraModelTest testMethod=test_for_masked_lm>

    def test_for_masked_lm(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
>       self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)

tests/models/mra/test_modeling_mra.py:322: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/models/mra/test_modeling_mra.py:210: in create_and_check_for_masked_lm
    result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
src/transformers/models/mra/modeling_mra.py:1093: in forward
    outputs = self.mra(
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
src/transformers/models/mra/modeling_mra.py:1028: in forward
    encoder_outputs = self.encoder(
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
src/transformers/models/mra/modeling_mra.py:782: in forward
    layer_outputs = layer_module(hidden_states, attention_mask)
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
src/transformers/models/mra/modeling_mra.py:729: in forward
    self_attention_outputs = self.attention(hidden_states, attention_mask)
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
src/transformers/models/mra/modeling_mra.py:681: in forward
    self_outputs = self.self(hidden_states, attention_mask)
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1194: in _call_impl
    return forward_call(*input, **kwargs)
src/transformers/models/mra/modeling_mra.py:615: in forward
    context_layer = mra2_attention(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

query = tensor([[[[ 0.0500, -0.0523, -0.0260,  ...,  0.0000,  0.0000,  0.0000],
          [-0.1339,  0.0844,  0.0287,  ...,  0...       [ 0.0293,  0.1609,  0.0547,  ...,  0.0000,  0.0000,  0.0000]]]],
       device='cuda:0', grad_fn=<CatBackward0>)
key = tensor([[[[ 0.0185, -0.0316,  0.0150,  ...,  0.0000,  0.0000,  0.0000],
          [-0.0575, -0.1123,  0.0832,  ...,  0...       [ 0.0608,  0.0932, -0.0973,  ...,  0.0000,  0.0000,  0.0000]]]],
       device='cuda:0', grad_fn=<CatBackward0>)
value = tensor([[[[ 0.0131,  0.1242,  0.0672,  ...,  0.0000,  0.0000,  0.0000],
          [-0.0212,  0.0600,  0.0269,  ...,  0...       [-0.1005, -0.0048,  0.0561,  ...,  0.0000,  0.0000,  0.0000]]]],
       device='cuda:0', grad_fn=<CatBackward0>)
mask = tensor([[-2.1475e+09,  1.0000e+00,  1.0000e+00, -2.1475e+09,  1.0000e+00,
         -2.1475e+09, -2.1475e+09,  1.0000e+...  1.0000e+00,  1.0000e+00, -2.1475e+09,  1.0000e+00,
         -2.1475e+09, -2.1475e+09,  1.0000e+00]], device='cuda:0')
num_blocks = 64, approx_mode = 'full', block_size = 32, initial_prior_first_n_blocks = 0, initial_prior_diagonal_n_blocks = 0

    def mra2_attention(
        query,
        key,
        value,
        mask,
        num_blocks,
        approx_mode,
        block_size=32,
        initial_prior_first_n_blocks=0,
        initial_prior_diagonal_n_blocks=0,
    ):
        """
        Use Mra to approximate self-attention.
        """
        if cuda_kernel is None:
            return torch.zeros_like(query).requires_grad_()
    
        batch_size, num_head, seq_len, head_dim = query.size()
        meta_batch = batch_size * num_head
    
        if seq_len % block_size != 0:
>           raise ValueError("sequence length must be divisible by the block_size.")
E           ValueError: sequence length must be divisible by the block_size.

src/transformers/models/mra/modeling_mra.py:403: ValueError

novice03 · 2023-08-02T16:48:14Z

Hello @ydshieh, thanks for bringing this up. We will likely have to use larger values for seq_len and hidden_size. Can you please try with the values here?

ydshieh · 2023-08-04T08:40:23Z

Hi @novice03 Really appreciated you taking time on this. I tried it, and there are still 5 failures (it's already a great improvement!).

However, we (transformers) are in a series of reducing CI time and cost, and change to large values is really what we tried very hard to avoid, as you can see in #24824 , #25005 and #25266. Also, large values is very likely introducing OOM when running tests in multiprocesses settings (we use 8 processes to reduce the CI cost) and it's very hard to figure out when this happens.

I think it would be great if we can have an attribute block_size in the config classes with a default 32. And in the modeling file, everywhere calling methods like sparse_mask, mm_to_sparse etc. pass config.block_size to them.

This way, we will have a way to use small values in the tests. Furthermore, the users of this model will have more flexibility to run the model. And we can also have a better documentation about how to set the config values and the inputs to make it work.

Let me know WDYT 🙏 Thanks again!

novice03 · 2023-08-07T00:00:22Z

Hello @ydshieh, thanks for your reply. I understand that using large values increases the time and memory cost. However, since MRA was specifically designed for large sequences, it will be very tricky to run tests with small seq_len and hidden_size.

Unfortunately, I don't think that the tests can be fixed by lowering the block size. I've tried setting block size to 4 or 8, and got multiple other errors (index out of bounds errors, CUDA errors, etc.). Also, all of the released checkpoints are with block size = 32, so users cannot use the pretrained models with a different block size.

I hope I'm not asking too much, but is there an alternative/ exception that can be made? Either via allowing larger values or by running MRA tests without CUDA kernels. I've already verified that the HF model and the original code output similar logits and hidden states when CUDA kernels are loaded (with large sequence lengths).

ydshieh · 2023-08-07T10:45:44Z

Also, all of the released checkpoints are with block size = 32, so users cannot use the pretrained models with a different block size.

Fair point!

We will discuss internally what to deal with this model testing, but could you check the following 5 (remaining) failed tests that is from the new values you provided in an earlier comment, and see if you are able to fix them 🙏 ? Thanks!

(It's run on torch 1.13 + CUDA 11.6.2)

FAILED tests/models/mra/test_modeling_mra.py::MraModelTest::test_determinism - ValueError: zero-size array to reduction operation maximum which has no identity
FAILED tests/models/mra/test_modeling_mra.py::MraModelTest::test_feed_forward_chunking - AssertionError: False is not true
FAILED tests/models/mra/test_modeling_mra.py::MraModelTest::test_load_with_mismatched_shapes - ValueError: sequence length must be divisible by the block_size.
FAILED tests/models/mra/test_modeling_mra.py::MraModelTest::test_model_outputs_equivalence - TypeError: forward() got an unexpected keyword argument 'output_attentions'
FAILED tests/models/mra/test_modeling_mra.py::MraModelTest::test_retain_grad_hidden_states_attentions - TypeError: 'NoneType' object is not subscriptable

* Add all files * Update masked_language_modeling.md * fix mlm models * fix conflicts * fix conflicts * fix copies * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: amyeroberts <[email protected]> * Reduce seq_len and hidden_size in ModelTester * remove output_attentions * fix conflicts * remove copied from statements * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: amyeroberts <[email protected]>

Add all files

f4a802c

novice03 mentioned this pull request Jun 27, 2023

Add Multi Resolution Analysis (MRA) #20573

Closed

3 tasks

Update masked_language_modeling.md

c0ee9ee

novice03 and others added 5 commits June 27, 2023 23:56

fix mlm models

30e4783

fix conflicts

c17c7ac

Merge branch 'huggingface:main' into add-mra-2

25cec18

fix conflicts

088f378

fix copies

1612188

sgugger approved these changes Jun 28, 2023

View reviewed changes

src/transformers/models/auto/tokenization_auto.py Outdated Show resolved Hide resolved

amyeroberts approved these changes Jun 28, 2023

View reviewed changes

novice03 and others added 7 commits June 28, 2023 23:45

Apply suggestions from code review

4c81a48

Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: amyeroberts <[email protected]>

Reduce seq_len and hidden_size in ModelTester

9c3dec5

remove output_attentions

7feb7d7

fix conflicts

ff83b78

Merge branch 'main' into add-mra-2

7e5798b

remove copied from statements

7196982

Merge branch 'add-mra-2' of https://github.com/novice03/transformers …

c0bc584

…into add-mra-2

amyeroberts approved these changes Jul 6, 2023

View reviewed changes

docs/source/en/model_doc/mra.md Outdated Show resolved Hide resolved

docs/source/en/model_doc/mra.md Outdated Show resolved Hide resolved

novice03 and others added 2 commits July 9, 2023 23:38

Apply suggestions from code review

d73e5e5

Co-authored-by: amyeroberts <[email protected]>

Merge branch 'huggingface:main' into add-mra-2

8b6eba8

amyeroberts merged commit 30ed3ad into huggingface:main Jul 10, 2023

novice03 deleted the add-mra-2 branch July 10, 2023 17:18

ydshieh mentioned this pull request Oct 19, 2023

[Community Event] Docstring Sprint #26638

Closed

zucchini-nlp mentioned this pull request Feb 29, 2024

Mra models have nan in output last_hidden_states #29373

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Multi Resolution Analysis (MRA) (New PR) #24513

Add Multi Resolution Analysis (MRA) (New PR) #24513

novice03 commented Jun 27, 2023 •

edited

Loading

novice03 commented Jun 27, 2023

HuggingFaceDocBuilderDev commented Jun 27, 2023 •

edited

Loading

sgugger commented Jun 27, 2023

novice03 commented Jun 28, 2023

sgugger left a comment

amyeroberts left a comment

amyeroberts Jun 28, 2023

sgugger Jun 29, 2023

amyeroberts Jun 28, 2023

novice03 Jul 4, 2023

amyeroberts Jun 28, 2023

novice03 Jul 4, 2023

novice03 commented Jul 4, 2023

novice03 commented Jul 6, 2023

amyeroberts left a comment

novice03 commented Jul 10, 2023

ydshieh commented Jul 26, 2023

ydshieh commented Jul 26, 2023

novice03 commented Aug 2, 2023

ydshieh commented Aug 4, 2023

novice03 commented Aug 7, 2023

ydshieh commented Aug 7, 2023

Add Multi Resolution Analysis (MRA) (New PR) #24513

Add Multi Resolution Analysis (MRA) (New PR) #24513

Conversation

novice03 commented Jun 27, 2023 • edited Loading

Add Multi Resolution Analysis (MRA) for Approximate Self-Attention

novice03 commented Jun 27, 2023

HuggingFaceDocBuilderDev commented Jun 27, 2023 • edited Loading

sgugger commented Jun 27, 2023

novice03 commented Jun 28, 2023

sgugger left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Jun 28, 2023

Choose a reason for hiding this comment

sgugger Jun 29, 2023

Choose a reason for hiding this comment

amyeroberts Jun 28, 2023

Choose a reason for hiding this comment

novice03 Jul 4, 2023

Choose a reason for hiding this comment

amyeroberts Jun 28, 2023

Choose a reason for hiding this comment

novice03 Jul 4, 2023

Choose a reason for hiding this comment

novice03 commented Jul 4, 2023

novice03 commented Jul 6, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

novice03 commented Jul 10, 2023

ydshieh commented Jul 26, 2023

ydshieh commented Jul 26, 2023

novice03 commented Aug 2, 2023

ydshieh commented Aug 4, 2023

novice03 commented Aug 7, 2023

ydshieh commented Aug 7, 2023

novice03 commented Jun 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 27, 2023 •

edited

Loading