Infinite Memory Transformers

Pedro Henrique Martins et al., DeepMind - Arxiv 2021

총평:

Continuous attention의 개념을 처음 알게 되어 신기했다. 어쩌면 vision task에도 적용할 수 있을 것 같다.
비교된 베이스라인은 Transformer-XL 과 Compressive Transformer인데, 더 다양한 (특히 efficient한) baseline과 비교하면 좋았을 것 같다.
결국 long + discrete sequence 를 fixed length + continuous sequence로 "approximate"하여 문제를 푸는 것이 main novelty로 보이며, 따라서 sequence length가 크게 길지 않은 경우에는 Transformer-XL이 더 잘 동작하는 것이 자연스러운 결과인 것 같다.
Continuous attention을 제안한 몇 안 되는 논문이기 때문에, 어쩌면 발전가능성이 많을 것 같다.
Sequence modeling을 다루거나, vanilla transformer를 사용하기엔 메모리 부담이 큰 경우 읽어볼만 할 것 같다.

Task: Transformer Architecture

Motivation / Contributions

Transformer have computational limitations (memory)
Previous approaches to this problem: Sparse attention, Reducing complexity...
- Still depends on context length
- cannot deal with unbounded context
Proposes Infinite-former
- Transformer extended with unbounded long-term memory (LTM)
- allows model to attend to arbitrarily long contexts
- Introduce "sticky memories" so that important information will persist (stay) in LTM

Method outline

Represent input sequence as a "continuous signal"

expressed as linear combination of RBFs
- coefficients can be obtained with closed-form solution (multivariate ridge regression)
the number of basis RBFs can be fixed -> fixed memory requirements!

Attend over continuous signal

NOTE: Up to this point has been proposed in NeurIPS 2020, "Sparse and Continuous Attention mechanisms"

Infinite Memory Transformer

Vanilla transformer + LTM (Long term memory)
- LTM: stores input embeddings / hidden states of previous steps
- May also consider STM for short-term memory (as in Transformer-XL)
Overall Q,K,V projection and residual connection is similar
- BUT of course, formulated differently
Unbounded memory?
- Input: Existing vectors from LTM, New vectors coming from STM
- sample & contract vectors from LTM , concat with new vectors from STM
- Recalculate the coefficient matrix
No need for positional encoding, memory vectors represent basis coefficients in a predefined continuous space

Wouldn't this approach lose resolution for old memories?

introduce "sticky memory"
Attribute larger space in the LTM's new signal to relevant regions in the previous signal
better capture long context, prevents losing relevant information

Evaluation / Results Synthetic task: Sorting (based on frequency of appearance)

To ensure decision is not made on the 'recent tokens' only, the frequency distribution is set to change over time
Transformer XL > Infinite Transformer >=Compressive for short lengths
Infinite Transformer > Compressive >> Transformer XL for longer lengths
- How does Compressive transformer work?

Language Modeling task

From scratch, Wikitext-103 dataset
- Infinite Transformer > Compressive > Transformer XL
- Infinite Transformer with Sticky Memory > Infinite Transformer

Visualization shows that sticky memory indeed helps.