CAB aims to provide a comprehensive evaluation of efficient attentions. CAB contains seven real-world tasks from different research areas to evaluate efficient attentions under four fine-grained attention patterns. See our paper CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling (ICML 2023)" for more details about CAB.
We are actively extending CAB with more long range tasks and efficient models, and any suggestion on the suitable tasks and efficient models with reproducible codebases is welcomed. Please send us an email if you have any suggestion on improving CAB. We will buy you a cup of thank you Nayuki (奈雪) or Hey Tea (喜茶) for help improving this work :).
The repository structure is organized as follows:
efficient-attention
is a plugin library of efficient attentions, and detailed usage can be found in efficient-attention.tasks
refers to all the long range tasks of CAB with reproducible scripts.scripts
includes useful scripts for efficient attention, such as compositional index calculation and causality test.imgs
contains experimental results of CAB paper.
We put forward a fine-grained
attention taxonomy, considering the conditionality and causality of attentions. Thus we obtain four attention patterns via Cartesian product as {noncausal, causal} × {self, cross}
: noncausal self (NS)
, causal self (CS)
, noncausal cross (NC)
, and causal cross (CC)
. Under the taxonomy, we present four attention patterns with different attentive functionality as shown in Figure 1.
We investigate whether effcient attentions perform consistently across different attention patterns. We compute Pearson correlation
between each pair of attention patterns against the same arrangement of effcient attentions for intra-benchmark comparison. In addition, we also calculate the correlation between CAB’s four attention patterns and LRA for inter-benchmark comparison. The pattern correlation are as follows:
We collect seven real-world long-sequence modeling tasks in CAB with data lengths from 300 to 16,000
and include eight
widely-used models as backbones in CAB to assess typical efficient attention mechanisms. Table 1 summarizes the tasks’ statistics, evaluation metrics, backbone neural networks and the required attention patterns are also checked. We show the environment setup and model training of tasks in Text-to-Speech Synthesis (TTS), Summarization (Sum), Long Sequence Time-series Forecasting (LSTF), Point Cloud Completion (PCC), Language Modeling (LM), Masked Language Modeling (MLM), and Super-Resolution (SR).
We also show the task correlation
between each pair of tasks in LRA and CAB as follows:
We report compositional index
(CI
) in the leaderboards.
CI
is a normalized score to balance the influence among evaluation metrics, and high CI
represents excellence. The calculations of CI
are shown in scripts/compositional_index
. CI
between efficient attention and vanilla attention.
Model | TTS | Sum | SR | MLM | Avg. | |
---|---|---|---|---|---|---|
local | 0.362 | 2.617 | 0.421 | 0.515 | 0.978 | 1.002 |
cosFormer | 0.516 | -0.190 | 1.042 | 0.498 | 0.466 | 0.490 |
LongShort | 0.572 | -0.322 | 0.298 | 0.528 | 0.269 | 0.293 |
vanilla | -0.301 | -0.246 | -0.077 | 0.527 | -0.024 | 0.000 |
LARA | -0.639 | -0.522 | 0.554 | 0.479 | -0.032 | -0.008 |
Performer | -0.282 | -0.088 | 0.439 | -1.211 | -0.285 | -0.261 |
Nyströmformer | -2.011 | -0.321 | 0.088 | 0.481 | -0.440 | -0.416 |
ProbSparse | 0.705 | -0.246 | -0.697 | -2.408 | -0.661 | -0.637 |
ABC | 0.043 | -0.678 | -2.525 | 0.414 | -0.686 | -0.662 |
FlashAttention | -0.383 | -0.201 | -5.503 | 0.530 | -1.389 | -1.365 |
S4D | 1.035 | - | 0.457 | 0.176 | - | - |
Model | TTS | Sum | LM | Avg. | |
---|---|---|---|---|---|
S4D | 1.030 | 1.143 | 0.780 | 0.985 | - |
LongShort | 0.701 | 0.340 | 0.812 | 0.617 | - |
FlashAttention | -0.033 | 0.562 | 0.751 | 0.426 | - |
local | -1.361 | 0.337 | 0.305 | -0.239 | - |
ABC | 0.707 | -1.461 | -1.117 | -0.623 | - |
vanilla | -0.047 | 0.784 | - | - | - |
Model | PCC | LSTF | Avg. | |
---|---|---|---|---|
vanilla | 0.449 | 0.744 | 0.596 | 0.000 |
ABC | 0.573 | -0.164 | 0.204 | -0.392 |
Performer | 0.473 | -0.445 | 0.014 | -0.582 |
cosFormer | -1.496 | -0.135 | -0.815 | -1.411 |
Model | TTS | Sum | Avg. | |
---|---|---|---|---|
vanilla | 1.047 | 0.867 | 0.956 | 0.000 |
ABC | -0.112 | 0.228 | 0.058 | -0.898 |
Performer | -0.935 | -1.094 | -1.014 | -1.970 |
We report effciency length
to measure the utility of effcient attentions. The efficiency length
is defined as the intersection point of computational time and memory curves by efficient models and vanilla attention. The efficiency length
represents the minimum length that a sub-quadratic efficient model surpasses vanilla attention in efficiency.
Model | Running Time | Memory Usage |
---|---|---|
Performer | 2,361 | 37 |
LARA | 2,541 | 68 |
ABC | 1,877 | 70 |
Nyströmformer | 3,045 | 94 |
cosFormer | 2,539 | 164 |
ProbSparse | 3,450 | 281 |
LongShort | 5,652 | 342 |
S4D | 6,011 | 949 |
local | 4,195 | 1,169 |
We explore the benefit of attention
under noncausal self attention pattern. The experimental results are as follows. Results show that attention mechanism improves
performance on most tasks. Embarrassingly, after removing the attention mechanism, the PoinTr
and Informer
achieve improvement on PCC
and LSTF
tasks, respectively.
We also focus on efficient attention's interpolation/extrapolation capability. On one hand, as the size of context grows to 8,192, language models achieve decreasing perplexity. It means that longer contexts indeed improve language modeling, and effcient attentions do well in interpolation as expected. On the other hand, for sequences with more than 8,192 tokens, the perplexities of all these effcient attention-equipped language models become higher.
Please follow the experimental setting to evaluate your method. For the hyperparameters of attention, we hope your method to follow or be similar to our settings, such as the size of the window. Besides, your method should have no more than 10% of the parameters of the backbone model.
You are welcomed to provide codes and reproducible scripts of your efficient attention models to the efficient attention library. The results of individual tasks and leaderboards will be updated through pull requests. We will review submitted pull requests soon.
@inproceedings{zhang2023cab,
title={CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling},
author={Jun Zhang and Shuyang Jiang and Jiangtao Feng and Lin Zheng and Lingpeng Kong},
booktitle={International Conference on Machine Learning},
year={2023}
}