Making train_parallel.py work on a single machine (non-distributed) #29

levinkhho · 2024-11-13T03:19:17Z

Small tweaks to train_parallel.py (and related changes to distributed.py and generate_batch.py) so that it works on a single machine. Mainly, it now only uses the nn.parallel.DistributedDataParallel model container if in a distributed environment.

luke-carlson · 2024-11-19T16:22:00Z

ml_mdm/clis/train_parallel.py

@@ -274,12 +288,15 @@ def main(args):
                    "args": args,
                }  # save full config.
                ema_model.save(vision_model_file, other_items=other_items)
-                diffusion_model.model.module.vision_model.save(
+                getattr(diffusion_model.model, "module", diffusion_model.model).vision_model.save(


Why do we need getattr here, does the diffusion_model.model sometimes not have a module? Or is this connected to the torch change?

It seems that diffusion_model.model only has module when it's in the nn.parallel.DistributedDataParallel container, so this getattr accounts for when model = diffusion_model.model.

If there's a better way to account for this (e.g. just adding the module attribute to the model instance if it's missing), please let me know and I can make the change!

levinkhho added 5 commits November 12, 2024 21:37

Update distributed.py

e024eb0

Update trainer.py

367131c

Update train_parallel.py

571a6b9

Update train_parallel.py

3124abc

Update train_parallel.py

274ee5f

luke-carlson reviewed Nov 19, 2024

View reviewed changes

Update pyproject.toml

3b1eca2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making train_parallel.py work on a single machine (non-distributed) #29

Making train_parallel.py work on a single machine (non-distributed) #29

levinkhho commented Nov 13, 2024

luke-carlson Nov 19, 2024

levinkhho Nov 19, 2024 •

edited

Loading

levinkhho Nov 19, 2024

Making train_parallel.py work on a single machine (non-distributed) #29

Are you sure you want to change the base?

Making train_parallel.py work on a single machine (non-distributed) #29

Conversation

levinkhho commented Nov 13, 2024

luke-carlson Nov 19, 2024

Choose a reason for hiding this comment

levinkhho Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

levinkhho Nov 19, 2024

Choose a reason for hiding this comment

levinkhho Nov 19, 2024 •

edited

Loading