FEAT / Trainer: LOMO optimizer support #30178

younesbelkada · 2024-04-11T09:24:17Z

What does this PR do?

As requested by the community, this PR integrates LOMO optimizers into HF Trainer

https://github.com/OpenLMLab/LOMO

I am facing some issues with respect to AdaLOMO, which seems to have DS as a hard requirement as in the optimizers init: https://github.com/OpenLMLab/LOMO/blob/85d8105c48cbd676dbf6915ee755461cd241da9b/lomo_optim/adalomo.py#L85 - so leaving this as a draft for now until I try to figure out this issue

cc @amyeroberts

HuggingFaceDocBuilderDev · 2024-04-11T09:47:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada · 2024-04-17T10:03:04Z

cc @muellerzr would love to have a first review! 🙏

muellerzr

Thanks! This is a very interesting optimizer, so good to have this. Left some comments, I think it'd be best to downstream the support so accelerate can do .backward() easily and then bring it up to here

muellerzr · 2024-04-17T18:06:46Z

src/transformers/trainer.py

+            if not is_lomo_available():
+                raise ImportError(
+                    "You need to install `galore_torch` in order to use GaLore optimizers"
+                    " install it with `pip install git+https://github.com/jiaweizzhao/GaLore`"


This can be updated to use their pypi package, galore-torch

src/transformers/trainer.py

muellerzr · 2024-04-17T18:08:47Z

src/transformers/trainer.py

@@ -2111,6 +2135,8 @@ def _inner_training_loop(
        self._globalstep_last_logged = self.state.global_step
        model.zero_grad()
        grad_norm: Optional[float] = None
+        # LOMO has a slightly different opitmizer API, see: https://github.com/OpenLMLab/LOMO/issues/73#issuecomment-2049612639
+        _is_lomo_optimizer = "Lomo" in self.optimizer.optimizer.__class__.__name__


Can you check if we can peek at self.optimizer.optimizer.__module__ maybe? (Or similar). Would be a bit more robust of a check than relying on namings

muellerzr · 2024-04-17T18:09:53Z

src/transformers/trainer.py

@@ -2261,8 +2287,9 @@ def _inner_training_loop(
                        else:
                            grad_norm = _grad_norm

-                    # Optimizer step
-                    self.optimizer.step()
+                    if not _is_lomo_optimizer:


Let's keep the comment here, and explain why lomo doesn't need the step

muellerzr · 2024-04-17T18:11:24Z

src/transformers/trainer.py

+            if not is_lomo_optimizer:
+                self.accelerator.backward(loss)
+            else:
+                self.optimizer.optimizer.fused_backward(loss, self._get_learning_rate())


This makes me believe that we want to do this at the Accelerate level, to be quite honest. Simply because accelerator.backward handles grad accum dividing and a bunch of gradient scaling.

Let's discuss this offline for a bit

yeah makes sense, happy to upstream that into accelerate to make things cleaner! ok will ping you offline

Co-authored-by: Zach Mueller <[email protected]>

younesbelkada · 2024-04-22T10:35:17Z

thanks @muellerzr ! I offloaded most of the logic in huggingface/accelerate#2695 - wdyt ? 🙏

muellerzr

Nice! Very well done. cc @LysandreJik for a final review

muellerzr · 2024-05-07T13:54:46Z

src/transformers/trainer.py

+            self._is_lomo_optimizer = is_lomo_available() and isinstance(
+                _unwrap_optimizer(self.optimizer), (Lomo, AdaLomo)
+            )


We can certainly do this, or just do optimizer.optimizer (since we know it'll be wrapped by accelerate).

This is a bit safer so seem good to me :)

amyeroberts

Thanks for the work adding this!

Some general comments about the structure. Similar to badam, the fact parts of the code need to know which optimizers are being used indicates we might be drawing the wrong boundaries around our abstractions.

src/transformers/trainer.py

amyeroberts · 2024-05-13T12:22:48Z

src/transformers/trainer.py

@@ -3225,7 +3282,7 @@ def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor,
            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                scaled_loss.backward()
        else:
-            self.accelerator.backward(loss)
+            self.accelerator.backward(loss, **kwargs)


What happens if we pass the learning rate through when lomo isn't being used?

It will break .. 😢 but we:
1- raise an error if users do not have the correct accelerate version with init-ing the trainer with lomo
2- pass learning_rate only if the optimizer is a lomo optimizer
3- removed kwargs in training step
So hopefully this should be safe enough 🙏

amyeroberts · 2024-05-13T12:23:42Z

src/transformers/trainer.py

+        _is_lomo = False
+
+        if is_lomo_available():
+            from lomo_optim import AdaLomo, Lomo
+
+            _is_lomo = isinstance(_unwrap_optimizer(self.optimizer), (Lomo, AdaLomo))
+
+        # For LOMO optimizers you need to explicitly use the learnign rate
+        if _is_lomo:
+            kwargs["learning_rate"] = self._get_learning_rate()


If we have the optimizer set, don't we also have self._is_lomo_optimizer?

yes indeed ! changed that

amyeroberts · 2024-05-13T12:24:11Z

src/transformers/trainer.py

+    if isinstance(optimizer, AcceleratedOptimizer):
+        optimizer = optimizer.optimizer
+    return optimizer


Is it guaranteed to only ever be one level of wrapping?

Yes indeed ! https://github.com/huggingface/accelerate/blob/4ad4d28c49a9818e985ea12d66a89fe73fe73c87/src/accelerate/optimizer.py#L56

amyeroberts · 2024-05-13T12:26:13Z

src/transformers/trainer.py

+        if is_lomo_available():
+            from lomo_optim import AdaLomo, Lomo
+
+            _is_lomo = isinstance(_unwrap_optimizer(self.optimizer), (Lomo, AdaLomo))


I'd move out the common logic to something like _is_lomo_optimizer from here and L1086, which handles importing Lomo and AdaLomo and unwrapping the optimizer

amyeroberts · 2024-05-13T12:27:28Z

src/transformers/trainer.py


        Return:
            `torch.Tensor`: The tensor with training loss on this batch.
        """
        model.train()
        inputs = self._prepare_inputs(inputs)
+        _is_lomo = False


hmmmm..... needing to have this in the training_step is a good indication the abstractions here are leaky. Once we have the optimizer created, we shouldn't really need to know what type of optimizer it is in the rest of the code

nice catch .. i think it was an old code, now should be much cleaner !

Co-authored-by: amyeroberts <[email protected]>

into add-lomo

amyeroberts

Looks great! Thanks for all the work adding this feature + tests.

I just have one comment about the self._is_lomo_optimizer flag. LMKWYT

amyeroberts · 2024-05-16T18:12:05Z

src/transformers/trainer.py

@@ -398,6 +410,7 @@ def __init__(
        self.hp_name = None
        self.deepspeed = None
        self.is_in_train = False
+        self._is_lomo_optimizer = False


Thinking about this more, could we do something like self._optimizer_type instead? Which then maps to e.g. enum values? This way, if new optimizers are added which require special logic, we just need one instance attribute

Makes sense, in that case what about just using args.optim as they can be used directly as enums:

transformers/src/transformers/trainer.py

Line 1174 in 15c74a2

elif args.optim == OptimizerNames.ADAMW_APEX_FUSED:

? will commit something in that regard

WDDYT of 9d547be ? With that I also removed the _unwrap_optimizer logic which makes some part of the code cleaner!

Looks great - thanks for iterating on this!

amyeroberts · 2024-05-16T18:17:13Z

tests/trainer/test_trainer.py

+            _ = trainer.train()
+
+        for name, param in tiny_llama.named_parameters():
+            self.assertFalse(torch.allclose(param, previous_params[name].to(param.device), rtol=1e-12, atol=1e-12))


This tolerance is super small, do we expect optimizers to make changes on this order?

It is ok to put it higher, I decided to put it low so that even small changes would be captured by the test (sometimes higher tolerances would fail even though the weights are properly updated + with a high learning rate, so just to be on the safe zone)

src/transformers/trainer.py

amyeroberts

Very nice - thanks for all the work adding this, including tests and iterating on the solution!

younesbelkada · 2024-05-17T10:39:42Z

Thanks so much for the extensive review and help @amyeroberts @muellerzr !

add V1 - adalomo not working yet

e7d7bbe

younesbelkada mentioned this pull request Apr 11, 2024

FIX: avoid math errors for edge cases OpenLMLab/LOMO#78

Merged

younesbelkada added 5 commits April 11, 2024 15:12

add todo docs + refactor from comments

8cdc21e

adjust LR

029a9c9

add docs

62b5e0e

Merge remote-tracking branch 'upstream/main' into add-lomo

629413c

add more elaborated test

4907531

younesbelkada marked this pull request as ready for review April 17, 2024 10:02

younesbelkada requested a review from muellerzr April 17, 2024 10:02

muellerzr reviewed Apr 17, 2024

View reviewed changes

Apply suggestions from code review

51c8e9e

Co-authored-by: Zach Mueller <[email protected]>

younesbelkada mentioned this pull request Apr 22, 2024

FEAT: Add LOMO optimizer huggingface/accelerate#2695

Merged

younesbelkada added 2 commits April 22, 2024 12:08

fix

d9499c5

push

afaabfc

younesbelkada requested a review from muellerzr April 22, 2024 10:35

younesbelkada added 4 commits May 3, 2024 10:13

Merge remote-tracking branch 'upstream/main' into add-lomo

a57dd5e

add accelerate check

beb7edc

Merge remote-tracking branch 'upstream/main' into add-lomo

5184057

fix DDP case

ac007ee

muellerzr approved these changes May 7, 2024

View reviewed changes

muellerzr requested a review from LysandreJik May 7, 2024 13:55

younesbelkada requested a review from amyeroberts May 10, 2024 13:23

amyeroberts reviewed May 13, 2024

View reviewed changes

younesbelkada and others added 4 commits May 16, 2024 10:50

Merge remote-tracking branch 'origin/main' into add-lomo

741a1a4

Apply suggestions from code review

80105e1

Co-authored-by: amyeroberts <[email protected]>

fix

49ce45e

Merge branch 'add-lomo' of https://github.com/younesbelkada/transformers

8d008a5

into add-lomo

init kwargs

40db2fa

younesbelkada requested a review from amyeroberts May 16, 2024 09:03

younesbelkada added 2 commits May 16, 2024 11:44

safely add attribute

5a536bf

Merge remote-tracking branch 'origin/main' into add-lomo

c1ac8bf

amyeroberts reviewed May 16, 2024

View reviewed changes

revert to enum logic

9d547be

younesbelkada requested a review from amyeroberts May 17, 2024 08:11

younesbelkada commented May 17, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

amyeroberts approved these changes May 17, 2024

View reviewed changes

younesbelkada and others added 2 commits May 17, 2024 12:39

Update src/transformers/trainer.py

efe04a5

Merge remote-tracking branch 'origin/main' into add-lomo

6cadb75

younesbelkada merged commit 8871b26 into huggingface:main May 21, 2024
22 checks passed

younesbelkada deleted the add-lomo branch May 21, 2024 10:40

amyeroberts mentioned this pull request Aug 8, 2024

Add support for GrokAdamW optimizer #32521

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT / Trainer: LOMO optimizer support #30178

FEAT / Trainer: LOMO optimizer support #30178

younesbelkada commented Apr 11, 2024

HuggingFaceDocBuilderDev commented Apr 11, 2024

younesbelkada commented Apr 17, 2024

muellerzr left a comment

muellerzr Apr 17, 2024

muellerzr Apr 17, 2024

muellerzr Apr 17, 2024

muellerzr Apr 17, 2024

younesbelkada Apr 18, 2024

younesbelkada commented Apr 22, 2024

muellerzr left a comment

muellerzr May 7, 2024

amyeroberts left a comment

amyeroberts May 13, 2024

younesbelkada May 16, 2024

amyeroberts May 13, 2024

younesbelkada May 16, 2024

amyeroberts May 13, 2024

younesbelkada May 16, 2024

amyeroberts May 13, 2024

amyeroberts May 13, 2024

younesbelkada May 16, 2024

amyeroberts left a comment

amyeroberts May 16, 2024

younesbelkada May 17, 2024

younesbelkada May 17, 2024

amyeroberts May 17, 2024

amyeroberts May 16, 2024

younesbelkada May 17, 2024

amyeroberts left a comment

younesbelkada commented May 17, 2024

FEAT / Trainer: LOMO optimizer support #30178

FEAT / Trainer: LOMO optimizer support #30178

Conversation

younesbelkada commented Apr 11, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 11, 2024

younesbelkada commented Apr 17, 2024

muellerzr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada commented Apr 22, 2024

muellerzr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

younesbelkada commented May 17, 2024