RoPE: model-agnostic RoPE refactor #31999

gante · 2024-07-16T12:45:38Z

What does this PR do?

This PR:

Refators RoPE such that it is model-agnostic.
- RoPE models now only need one class
- The class is parameterized by the model config.
- Based on the model config, the appropriate type of rope will be loaded into the class
Adds longrope, as part of the model-agnostic refactor on Phi3 (closes Plans to Integrate LongRoPE into LLaMA? #31992); With longrope, phi3's checkpoints are now loadable.

👉 Built on top of the Yarn PR (#30910)

Review

Key files to check, IN THIS SPECIFIC ORDER:

src/transformers/models/llama/modeling_llama.py 
src/transformers/models/llama/configuration_llama.py
src/transformers/modeling_rope_utils.py

👉 Other relevant files include phi3 (longrope) and recurrentgemma (a few custom changes)

Models that require future changes for standardization

⚠️ Some models don't support cache_positions, and therefore they are not changed as part of this PR (the new classes is built with the new pattern in mind). A future PR is needed on these models, where both cache_positions and this new model-agnostic RoPE is added.

Models that were NOT changed but have RoPE:

ESM
Falcon
GPTNeoX
GPTNeoXJapanese
Idefics
Mixtral
Persimmon
Phi
Qwen2
Qwen2MoE
StableLM
Starcoder2

YaRN (Yet another RoPE extension method) combines the NTK-By-Parts Interpolation and Attention Scaling methods, improving upon existing RoPE interpolation methods for longer context window sizes. Fine-tuned models maintain their original performance across benchmarks while enabling efficient extrapolation and transfer learning for quicker convergence, especially in compute-limited environments. We implement YaRN and Dynamic-YaRN for the following list of models: - LLaMA - Falcon - GPT-NeoX - Olmo - Persimmon - Phi - StableLM - OpenLLaMA New unit tests are added to assert YaRN's correct behavior on both short and long sequence inputs. For more details, please refer to https://arxiv.org/abs/2309.00071. Co-authored-by: Miguel Almeida <[email protected]>

Iterate on YaRN implementation for LLaMA and remove diff from remaining models for increased PR modularity. This commit includes the following changes: - Merge 'yarn_rope_scaling' and 'rope_scaling' dictionaries - Remove unnecessary attributes ('extrapolation_factor' and 'finetuned') from YaRN classes - Inherit 'forward' method in YaRN classes from superclass - Rename 'yarn' method to 'compute_yarn_scaling' - Extend YaRN tests with further assertions - Fix style inconsistencies Co-authored-by: Miguel Monte e Freitas <[email protected]>

- Comply with the the tensor building logic introduced in huggingface#30743 - Add referencing to the optimized Attention Factor equation - Remove Dynamic YaRN for a more agile deployment Co-authored-by: mig-mfreitas <[email protected]>

ArthurZucker

Of to a good start.
We want this to be easily configurable IMO, and with the least amount of checks on our side!

src/transformers/models/llama/modeling_llama.py

ArthurZucker · 2024-07-16T14:49:02Z

src/transformers/models/llama/modeling_llama.py

+            cos = cos * self.rope_config["attention_factor"]
+            sin = sin * self.rope_config["attention_factor"]


if this lives in a config vs in a tensor or buffer we will have device issue + we have less freedom IMO and no idea about the dtype

src/transformers/models/llama/modeling_llama.py

ArthurZucker · 2024-07-19T17:38:35Z

src/transformers/models/llama/modeling_llama.py

+            config = LlamaConfig(**kwargs)
+            config.rope_theta = base
+            config.max_position_embeddings = max_position_embeddings
+            config.head_dim = dim  # this one doesn't actually exist, will only be used in the deprecation transition
+            if scaling_factor == 1.0 and len(kwargs) == 0:
+                config.rope_scaling = None
+            else:
+                config.rope_scaling = {"type": "default", "factor": scaling_factor}
+                config.rope_scaling |= kwargs  # may overwrite "type"


that's fairly weird (init a config) but only happens once, should be alright

It's the easiest path for the deprecation: in v4.45 we just delete these lines 👼

gante · 2024-07-21T17:15:37Z

(all RoPE models with cache_positions upgraded, now fixing CI)

LysandreJik

Just gave a quick look at the API which looks good to me. Very nice and clean changes with the deprecation cycle.

Thanks for iterating on the PR! (Would really like to have @amyeroberts take a look at the PR as well if possible)

src/transformers/models/llama/modeling_llama.py

src/transformers/models/llama/configuration_llama.py

src/transformers/models/llama/modeling_llama.py

ritwickchaudhry · 2024-07-29T23:53:36Z

I'm trying to train the Phi-3-small-128k-instruct model and the configuration loading leads to an error in the rope_validation function here because the config has more than 3 hyper-parameters which fails the check.

Will the PR fix this issue? If yes, when can we expect this to merge in main?

ArthurZucker · 2024-07-30T07:09:54Z

MMmm what's weird is that this model uses code on the hub.
Anyways if we broke something we can do a patch but we need a proper reproducer!

ritwickchaudhry · 2024-07-30T10:00:48Z

MMmm what's weird is that this model uses code on the hub. Anyways if we broke something we can do a patch but we need a proper reproducer!

Way to Reproduce:

from transformers import  Phi3ForCausalLM
Phi3ForCausalLM.from_pretrained(<path/to/Phi3_small_128k_instruct>")

ArthurZucker · 2024-08-01T12:37:09Z

That model is "code on the hub" so it's kind of expected

Fazziekey · 2024-08-06T02:51:28Z

@gante hello can you help me to review this PR for fixed ntk scaling?

gante · 2024-09-10T14:56:52Z

Note: splitting this PR into multiple smaller ones, as the refactor needs extra attention in some models (e.g. cohere's RoPE is not exactly the same as llama's)

Keeping the PR open as a reference, until all models have the new RoPE structure

mig-mfreitas and others added 12 commits May 20, 2024 10:21

Merge remote-tracking branch 'upstream/main' into yarn-rope-scaling

fc161dd

Merge remote-tracking branch 'upstream/main' into yarn-rope-scaling

1044c7b

Refactor Tensor Building Logic for YaRN

d84baa9

- Comply with the the tensor building logic introduced in huggingface#30743 - Add referencing to the optimized Attention Factor equation - Remove Dynamic YaRN for a more agile deployment Co-authored-by: mig-mfreitas <[email protected]>

Merge remote-tracking branch 'upstream/main' into yarn-rope-scaling

fdea000

remove unwanted file

a555034

tmp commit

472b168

mvp?

26fd6e9

rm yarn class

6ea2d3c

can set attention_factor

9df8a43

a few optims

10dc891

ArthurZucker reviewed Jul 16, 2024

View reviewed changes

gante added 2 commits July 19, 2024 10:05

single rope layer

e446e64

better config

cc6af77

ArthurZucker mentioned this pull request Jul 19, 2024

Plans to Integrate LongRoPE into LLaMA? #31992

Open

push

9914572

gante force-pushed the rope_refactor branch from 0eb9b7b to 9914572 Compare July 19, 2024 16:45

push more logic to the rope fns

8befb00

ArthurZucker reviewed Jul 19, 2024

View reviewed changes

gante and others added 10 commits July 19, 2024 17:50

dynamic can scale back

20962d8

position_embeddings last

0595968

rename new rope stuff

748a318

Merge branch 'main' into rope_refactor

99305b4

make fixup

2f10261

chameleon

c34ffff

cohere

6dae958

fix gated imports

0bcd2c1

missing this one

0ec8ddb

gemma (and cousins)

1e41bfc

gante added 2 commits July 21, 2024 16:27

phi3 (but not fully working)

032f662

last model D: D: D:

2cba857

gante marked this pull request as ready for review July 21, 2024 17:15

gante added 4 commits July 21, 2024 17:34

moe out

205c740

fix missing attributes in gemma2, recurrentgemma, and olmo

5d19465

same rope validation and docstring everywhere

909b247

fix olmo

441eabb

gante mentioned this pull request Jul 21, 2024

[feat] Apply rope_scaling for general use in phi3, llama #31966

Closed

gante added 2 commits July 21, 2024 18:39

fix name clash; enable models with partial rope

b10fee4

cohere config

961e6ad

LysandreJik approved these changes Jul 22, 2024

View reviewed changes

src/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

src/transformers/models/llama/configuration_llama.py Show resolved Hide resolved

ArthurZucker reviewed Jul 22, 2024

View reviewed changes

gante added 3 commits July 22, 2024 08:38

lysandre's PR comments

0604c44

fast path for dynamic freq reset

556e140

bc using kwargs (instead of a config)

9c5a40e

This was referenced Jul 22, 2024

Llama: RoPE refactor #32135

Merged

Hub test: avoid repo creation deadlocks #32151

Closed

add scaling_factor to GemmaRotaryEmbedding for fix error in GemmaLine… #32141

Open

nprotasov mentioned this pull request Jul 29, 2024

Refactor KV cache, Rope , reduce common code huggingface/optimum-habana#1148

Open

3 tasks

gante mentioned this pull request Aug 2, 2024

Cache: new Cache format in decoder-only models #31421

Merged

Fazziekey mentioned this pull request Aug 5, 2024

support fixed ntk rope in modeling_rope_utils.py #32424

Closed

5 tasks

jonatanklosko mentioned this pull request Aug 6, 2024

Unify RoPE strategies elixir-nx/bumblebee#388

Open

gante mentioned this pull request Sep 10, 2024

Cohere: update RoPE structure #33408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoPE: model-agnostic RoPE refactor #31999

RoPE: model-agnostic RoPE refactor #31999

gante commented Jul 16, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jul 16, 2024

ArthurZucker Jul 19, 2024

gante Jul 19, 2024

gante commented Jul 21, 2024

LysandreJik left a comment

ritwickchaudhry commented Jul 29, 2024 •

edited

Loading

ArthurZucker commented Jul 30, 2024

ritwickchaudhry commented Jul 30, 2024 •

edited

Loading

ArthurZucker commented Aug 1, 2024

Fazziekey commented Aug 6, 2024

gante commented Sep 10, 2024

		cos = cos * self.rope_config["attention_factor"]
		sin = sin * self.rope_config["attention_factor"]

RoPE: model-agnostic RoPE refactor #31999

Are you sure you want to change the base?

RoPE: model-agnostic RoPE refactor #31999

Conversation

gante commented Jul 16, 2024 • edited Loading

What does this PR do?

Review

Models that require future changes for standardization

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jul 16, 2024

Choose a reason for hiding this comment

ArthurZucker Jul 19, 2024

Choose a reason for hiding this comment

gante Jul 19, 2024

Choose a reason for hiding this comment

gante commented Jul 21, 2024

LysandreJik left a comment

Choose a reason for hiding this comment

ritwickchaudhry commented Jul 29, 2024 • edited Loading

ArthurZucker commented Jul 30, 2024

ritwickchaudhry commented Jul 30, 2024 • edited Loading

ArthurZucker commented Aug 1, 2024

Fazziekey commented Aug 6, 2024

gante commented Sep 10, 2024

gante commented Jul 16, 2024 •

edited

Loading

ritwickchaudhry commented Jul 29, 2024 •

edited

Loading

ritwickchaudhry commented Jul 30, 2024 •

edited

Loading