[Submitit] Restructure the plugin configuration #682

jrapin · 2020-06-15T17:17:41Z

Motivation

Only use AutoExecutor for submitit plugin, since it is the recommended method

Todo

clarify parameter names (timeout_min -> timeout_minute(s)? gpus_per_node -> gpus_per_task (forward compatibility). TBD
clean DictConf instantiation in init
can we get rid of executor parameter (local/slurm)?
manual test on slurm
update documentation

Have you read the Contributing Guidelines on pull requests?

Yes

Test Plan

To do

jrapin

The PR does not work at all for now, it just states would should be the configs in my opinion
Note that it's actually shorter, simpler and actually provides more options.
We'll discuss names later, some may require renaming

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py

omry · 2020-06-16T14:35:34Z

ask for review when ready.

jrapin · 2020-06-16T17:11:44Z

ask for review when ready.

I wouldnt call it ready but I need help sorting this out. I couldn't make the config group work.

First thing to deal with in my opinion is what the structure of the conf should look like:
for now its:

folder
queue
params:
  gpus_per_task
  etc

I tried to change it to something like the following, removing one level

folder
queue
gpus_per_task
etc

but that makes the init of the plugin a bit more complicated.

When that is dealt with, the second question will of course be "how to change from local to slurm"

omry · 2020-06-16T23:30:30Z

The plugin is instantiated using hydra.util.instantiate.
The structure it expects is ObjectConf.
This is where the first params is coming from. This is not flexible.

There is an open rfc to eliminate that and support recursive instantiation (#566).
For now, keep the first params.

Try something like this (note that I renamed *Conf to *Params to reflect that I want to use them as the params object.

@dataclass
class BaseParams:
    # Where does this belong? maybe in base? 
    # if not move it to one of the the two specific configs
    folder: str = "${hydra.sweep.dir}/.${hydra.launcher.params.executor}"
    ...

@dataclass
class SlurmParams(BaseParams):
   ...

@dataclass
class LocalParams(BaseParams):
    ...


@dataclass
class SlurmConf(ObjectConf):
  cls: ...
  params: SlurmParams = SlurmParams()

@dataclass
class LocalConf(ObjectConf):
  cls: ...
  params: LocalParams = LocalParams()

# finally, register two different choices:
ConfigStore.instance().store(
    group="hydra/launcher",
    name="submitit_local",
    node=LocalConf,
    provider="submitit_launcher",
)

ConfigStore.instance().store(
    group="hydra/launcher",
    name="submitit_slurm",
    node=SlurmConf,
    provider="submitit_launcher",
)

There is a similar example here (without the ObjectConf).

omry · 2020-06-16T23:34:09Z

An alternative style which might be easier to read is this:
(Not sure which one I prefer yet).
I do this in Hydra's config when it's not getting too deep.

@dataclass
class BaseParams:
  ...

@dataclass
class SlurmConf(ObjectConf):
  cls :str = "hydra_plugins.....SubmititLauncher"

  @dataclass
  class SlurmParams(BaseParams):
     ...

  params: SlurmParams = SlurmParams()

@dataclass
class LocalConf(ObjectConf):
  cls :str = "hydra_plugins.....SubmititLauncher"
  @dataclass
  class LocalParams(BaseParams):
      ...

  params: LocalParams = LocalParams()

jrapin

Thanks for the help.
It's now working locally. There are a few questions left before I move to test on manually in slurm (see comments)

jrapin · 2020-06-17T11:48:24Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py

-    # cluster to use (currently either "slurm" or "local" are supported,
-    # None defaults to an available cluster)
-    cluster: str = "slurm"
+    executor: ExecutorName = ExecutorName.slurm  # can we get rid of this?


Is this needed here? Is there another way to get this information since we now call submitit_local or submitit_slurm. That would prevent a user from setting it to the wrong value.

Right now you are using this as the flag to gate slurm or local initialization logic.
An alternative I think will be much cleaner is to use a different class. I will comment below.

jrapin · 2020-06-17T11:48:45Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py

+        self.params = OmegaConf.structured(
+            LocalParams if params["executor"].value == "local" else SlurmParams
+        )
+        for key, val in params.items():
+            OmegaConf.update(self.params, key, val if key != "executor" else val.value)


How can this be simplified?

jrapin · 2020-06-17T11:50:14Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py

            }
        )
+        init_keys = specific_init_keys | set(init_renamer)  # used config keys
+        executor = submitit.AutoExecutor(**init_params)


Now everything goes through AutoExecutor, whatever the choice of the executor. This makes it more maintainable.

As you can tell, it takes some trickery to prepare the correct dictionary for the AutoExecutor.
The reason is that the input config is no longer lining up with AutoExecutor parameters.

If you used SlurmExecutor and LocalExecutor directly, with corresponding SubmititSlurmLauncher and SubmititLocalLauncher as I suggested above I think you would be able to more easily match the init signature of each to what you want to pass the respective submitit launcher.

You can still use auto, but at least in this case the logic in each of the launchers will need to initialize a known mode which will simplify things

If you used SlurmExecutor and LocalExecutor directly, with corresponding SubmititSlurmLauncher and SubmititLocalLauncher as I suggested above I think you would be able to more easily match the init signature of each to what you want to pass the respective submitit launcher.

I won't call SlurmExecutor and LocalExecutor directly, that would make the plugin harder to maintain. The alignment will actually be much simpler to maintain. Also, there's a PR on its way in submitit that would make it even easier.

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py

plugins/hydra_submitit_launcher/tests/test_submitit_launcher.py

omry

I like this more.
see my comments, I think we can simplify more by splitting into two different launcher classes.
they can inherit from your current launcher and have custom __init__() methods.

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py

omry · 2020-06-17T15:41:08Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py

-    # cluster to use (currently either "slurm" or "local" are supported,
-    # None defaults to an available cluster)
-    cluster: str = "slurm"
+    executor: ExecutorName = ExecutorName.slurm  # can we get rid of this?


Right now you are using this as the flag to gate slurm or local initialization logic.
An alternative I think will be much cleaner is to use a different class. I will comment below.

omry · 2020-06-17T15:49:41Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py



 @dataclass
-class SubmititLauncherConf(ObjectConf):
+class LocalConf(ObjectConf):
    cls: str = "hydra_plugins.hydra_submitit_launcher.submitit_launcher.SubmititLauncher"


You can define a launcher class for each mode and have and have the custom logic in it.
They can share a superclass or otherwise reuse most of the logic through other means.

cls: str = "hydra_plugins.hydra_submitit_launcher.submitit_launcher.SubmititLocalLauncher"

This will also allow you to named parameters in the __init__ of the class:

class SubmititLauncher(Launcher): def __init__(self, partition: Optional[str], comment: Optional[str], constraint: Optional[str], exclude: Optional[str], ... ) -> None: # forward directly to the implementation # (my preference would be to SlurmExecutor and LocalExecutor at this point). ...

I'd rather not double down the name of each parameter both in the config and in the init. Naming them here would just add code for nothing in my opinion (we don't need type hints here since it's never called directly by a user, and all the checks can be efficiently performed through the config). So is there a shortcut to initialize the omega conf from a dictionary? I couldnt find any without requiring a for loop

You asked how to simplify. this is my advice.
I don't think you will need to double much:
between config inheritance, and passing the actually needed parameters into each init there will be no duplication in my opinion.

regardless, it's your call.
Changing submitit can also help.

but is there a better way to build the OmegaConf structure from a dict. Currently I am doing:

self.params = OmegaConf.structured(LocalParams) for key, val in params.items(): OmegaConf.update(self.params, key, val)

bur I would think I am missing something (and it would not work with Enum, but fortunately there is no more Enum now).
Otherwise I'll use the Config directly but I prefer the flexibility of OmegaConf

try :

self.params = OmegaConf.structured(LocalParams(**params))

The general answer to:

but is there a better way to build the OmegaConf structure from a dict

is:

OmegaConf.create({})

In this example you want to get the type safety from the dataclass.
Here is some context:
https://omegaconf.readthedocs.io/en/latest/structured_config.html#merging-with-other-configs

You can also try what I suggested, which is possibly a better match here.

omry · 2020-06-17T15:55:43Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py

            }
        )
+        init_keys = specific_init_keys | set(init_renamer)  # used config keys
+        executor = submitit.AutoExecutor(**init_params)


As you can tell, it takes some trickery to prepare the correct dictionary for the AutoExecutor.
The reason is that the input config is no longer lining up with AutoExecutor parameters.

If you used SlurmExecutor and LocalExecutor directly, with corresponding SubmititSlurmLauncher and SubmititLocalLauncher as I suggested above I think you would be able to more easily match the init signature of each to what you want to pass the respective submitit launcher.

You can still use auto, but at least in this case the logic in each of the launchers will need to initialize a known mode which will simplify things

omry · 2020-06-17T15:56:02Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py

+
+        # specify resources/parameters
+        baseparams = set(dataclasses.asdict(BaseParams()).keys())
+        print(baseparams)


don't forget the prints.

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py

omry · 2020-06-18T15:41:34Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py

        self.params = OmegaConf.structured(
-            LocalParams if params["executor"].value == "local" else SlurmParams
+            LocalParams if self._EXECUTOR == "local" else SlurmParams
        )


I suggest adding a base class to express that the slurm launcher is not REALLY a local launcher.

# your current local: class BaseSubmititLauncher(Launcher): def _get_params_class() -> Type[BaseParams]: raise NotImplementedError ... # the rest is the same. class LocalSubmititLauncher(BaseSubmititLauncher) def _get_params_class() -> Type[BaseParams]: return LocalParams class SlurmSubmititLauncher(BaseSubmititLauncher) def _get_params_class() -> Type[BaseParams]: return SlurmParams

You can go back to a DictConfig with something like this (and now you don't need the update loop)

self.params = OmegaConf.structured(self._get_params_class(**params))

omry · 2020-06-18T15:42:47Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py

        for key, val in params.items():
-            OmegaConf.update(self.params, key, val if key != "executor" else val.value)
+            OmegaConf.update(self.params, key, val)


I am surprised that worked in the previous version when you use the actual object and not a DictConfig.

regardless, you can get rid of this loop with the example I gave you above.

plugins/hydra_submitit_launcher/tests/test_submitit_launcher.py

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py

jrapin · 2020-06-19T15:09:00Z

@omry it's now manually tested and working on SLURM cluster.
In my opinion it seems ready, let me know if it requires more updates.

omry · 2020-06-19T16:38:08Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py

-    # cluster to use (currently either "slurm" or "local" are supported,
-    # None defaults to an available cluster)
-    cluster: str = "slurm"
+    submitit_folder: str = "${hydra.sweep.dir}/.submitit/%j"


Let's talk about this for a moment:
With Hydra, each job is getting a working directory inside the sweep dir.

If we can keep all the files for the job in that directory it would be really good (users will have an easier time debugging issues for a particular job).
This may take some submitit changes, but is something to consider for the future.

Right now, if a job is failing the user will need to:

Look at the log file in the job output directory.

Look in the submitit output folder, figure out which files belong to the job that caused the issue somehow and check the output/error files.

If we managed to use the working directory here it would be better.
I suspect currently this is not possible without changes to submitit.

An alternative solution to this problem is to symlink the stdout and stderr from the submitit folder into the job working directory. But this can also be tricky.

This is not a feature request for this PR, but something to think about to improve the user experience.

If we can keep all the files for the job in that directory it would be really good

Indeed (for now this is the same behavior as before, but I agree it is not convenient)

I suspect currently this is not possible without changes to submitit.

It would actually be possible if not using arrays, but that's not a good option. With arrays on the other hand we can't specify the exact folder of each run individually, it must be deduced from the parameters of the array. Hacking something is somehow possible but would require a lot of (mostly unhelpful) changes in submitit and would not be robust anyway.

The other option would be that hydra uses the folder provided by submitit (using the job_id instead of 0, 1, 2 etc). But that's probably not something you'd want either, right?

The user can customize the sweep subdirectory through Hydra.
So yeah, I can't really use the submitit generated dir.

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py

omry · 2020-06-19T16:44:10Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py

            }
        )
+        init_keys = specific_init_keys | {"submitit_folder"}
+        executor = submitit.AutoExecutor(cluster=self._EXECUTOR, **init_params)


this line is confusing to me:

The auto executor is expecting the strings "local" or "slurm" as options for cluster?

omry · 2020-06-19T16:48:28Z

website/docs/plugins/submitit_launcher.md

 ```

 Note that this plugin expects a valid environment in the target host. usually this means a shared file system between
 the launching host and the target host.

-Submitit supports 3 types of queues: auto, local and slurm. Its config looks like this
+Submitit actually implements 2 different launchers: `submitit_slurm` to run on a SLURM cluster, and `local` to test locally.


What do you think about the following?

Include links to config.py with dataclasses instead of copying them in.

include the outputs of

$ python foo.py hydra/launcher=submitit_slurm --cfg hydra -p hydra.launcher $ python foo.py hydra/launcher=submitit_local-cfg hydra -p hydra.launcher

When describing each of the two launchers.
I am still trying to discover what is the best way to document plugins with structured configs, and I think I like the above proposal.

Ah, I saw that you already added similar output.
Also since the primary use case is SLURM we can focus on it.

This will output the current slurm parameters:

python foo.py hydra/launcher=submitit_slurm --cfg hydra -p hydra.launcher

and I recommend that instead of inlining the dataclasses we can just give a link.

Also, an example of how to override the launcher parameters from the command line would be helpful:

python foo.py hydra/launcher=submitit_slurm hydra.launcher.params.partition=prod

Ah, I saw that you already added similar output.

Yes I find them the most interesting since they make it easier to discover settings than just stating the config classes.

and I recommend that instead of inlining the dataclasses we can just give a link.

I have shared feelings about this, since the interest of adding the config files there came from the description of each parameters. Then again, since it often ends up being outdated if not careful, it's probably better to just have the link.

So I just:

removed the config classes

added a link the the config classes

added the discovery of submitit_local parameters

added a commandline override example

At some point I will support field description in the help or --cfg. when that happens we will be able a better description here.

omry

Looks great. please respond to my suggestion to change the docs.

omry · 2020-06-19T19:49:08Z

website/docs/plugins/submitit_launcher.md

 ```

 Note that this plugin expects a valid environment in the target host. usually this means a shared file system between
 the launching host and the target host.

-Submitit supports 3 types of queues: auto, local and slurm. Its config looks like this
+Submitit actually implements 2 different launchers: `submitit_slurm` to run on a SLURM cluster, and `local` to test locally.


Ah, I saw that you already added similar output.
Also since the primary use case is SLURM we can focus on it.

This will output the current slurm parameters:

python foo.py hydra/launcher=submitit_slurm --cfg hydra -p hydra.launcher

and I recommend that instead of inlining the dataclasses we can just give a link.

Also, an example of how to override the launcher parameters from the command line would be helpful:

python foo.py hydra/launcher=submitit_slurm hydra.launcher.params.partition=prod

…auncher/submitit_launcher.py Co-authored-by: Omry Yadan <[email protected]>

jrapin · 2020-06-22T14:19:56Z

ci/circleci: py36_win seems broken :s

omry

reran the windows thing, probably transient infra issue.

omry · 2020-06-22T16:38:25Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py

-    # cluster to use (currently either "slurm" or "local" are supported,
-    # None defaults to an available cluster)
-    cluster: str = "slurm"
+    submitit_folder: str = "${hydra.sweep.dir}/.submitit/%j"


The user can customize the sweep subdirectory through Hydra.
So yeah, I can't really use the submitit generated dir.

omry · 2020-06-22T16:41:15Z

plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py

+# finally, register two different choices:
+ConfigStore.instance().store(
+    group="hydra/launcher",
+    name="submitit_local",
+    node=LocalConf,
+    provider="submitit_launcher",
+)


I think we can lose LocalConf and SlurmConf class definitions without losing functionality or clarity.

ConfigStore.instance().store(..., node=ObjectConf(cls: str ="XYZClass", params = XYZParams()))

I may have done something wrong but I get:
omegaconf.errors.UnsupportedValueType: Value 'LocalParams' is not a supported primitive type

Is LocalParams a dataclass?

its a BaseParams, which is a dataclass, does it need a decorator as well? I'm not really familiar with dataclasses

Try to create a minimal example showing the problem with OmegaConf directly.
this seems to work and is very close to what you are doing.

In [1]: from dataclasses import dataclass In [9]: from typing import Any In [2]: from omegaconf import OmegaConf In [3]: @dataclass ...: class Foo: ...: num : int = 10 ...: In [4]: @dataclass ...: class LocalFoo(Foo): ...: name: str = "name" ...: In [5]: OmegaConf.structured(LocalFoo) Out[5]: {'num': 10, 'name': 'name'} In [6]: OmegaConf.structured(LocalFoo()) Out[6]: {'num': 10, 'name': 'name'} In [7]: OmegaConf.structured({"add-hoc":LocalFoo()}) Out[7]: {'add-hoc': {'num': 10, 'name': 'name'}} In [10]: @dataclass ...: class ObjectConf: ...: cls: str = "abc" ...: params: Any = None ...: In [11]: OmegaConf.structured(ObjectConf(cls="xyz", params=LocalFoo())) Out[11]: {'cls': 'xyz', 'params': {'num': 10, 'name': 'name'}}

Example from someone else (he didn't complain so I think it's working for him):

https://github.com/anthonytec2/pytorch-lightning/pull/1/files/6b8d535b6eef983184b95beb3aec7838f6df41ab..09c393501a2c7ba451f2fc15e8d91b608fe2e770

omry · 2020-06-22T18:27:22Z

I increased the no-output timeout for windows. can you rebase and try again?
also see my comment about simplifying config.py a bit.

omry · 2020-06-22T21:34:05Z

going to land it, I am having some test failures with the existing submitit plugin after making some core changes and it will be best to fix them on top of your changes.
I will try to simplify the config as I suggested.

omry · 2020-06-22T21:35:31Z

One thing I noticed that is missing is a news fragment. especially for breaking changes like this one we should add them.
I will do it for this one.

jrapin · 2020-06-23T13:18:52Z

One thing I noticed that is missing is a news fragment. especially for breaking changes like this one we should add them.
I will do it for this one.

Thanks for this and for merging ;)

[WIP] [Submitit] Use only auto executor

12fdd9f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2020

jrapin commented Jun 15, 2020

View reviewed changes

Start using only autoexecutor (only local is working)

4346543

jrapin added 2 commits June 16, 2020 18:47

Prepare next steps

efd26f1

isort

1a3650f

jrapin added 3 commits June 17, 2020 13:35

Working local

4e059ee

isort

5269c7e

Merge branch 'only_auto_fail' into only_auto

4dbb681

jrapin commented Jun 17, 2020

View reviewed changes

removeprints

46e8edf

omry reviewed Jun 17, 2020

View reviewed changes

jrapin added 2 commits June 18, 2020 12:25

Split launcher class

9cba683

pytest

78285ad

omry reviewed Jun 18, 2020

View reviewed changes

jrapin added 3 commits June 19, 2020 16:17

nits

b13ff1a

Update doc

4200394

black

37e4bf5

jrapin changed the title ~~[WIP] [Submitit] Use only auto executor~~ [Submitit] Use only auto executor Jun 19, 2020

Add discoverability

f7a6ecb

omry reviewed Jun 19, 2020

View reviewed changes

omry approved these changes Jun 19, 2020

View reviewed changes

omry changed the title ~~[Submitit] Use only auto executor~~ [Submitit] Restructure the plugin configuration Jun 19, 2020

jrapin and others added 2 commits June 22, 2020 11:40

Update plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_l…

240fc8a

…auncher/submitit_launcher.py Co-authored-by: Omry Yadan <[email protected]>

Update doc

cf21e0f

trigger_tests

975a372

omry reviewed Jun 22, 2020

View reviewed changes

Merge branch 'master' into only_auto

1f9203c

omry merged commit 556bc93 into facebookresearch:master Jun 22, 2020

jrapin deleted the only_auto branch June 23, 2020 13:18

[Submitit] Restructure the plugin configuration #682

[Submitit] Restructure the plugin configuration #682

Conversation

jrapin commented Jun 15, 2020 • edited Loading

Motivation

Todo

Have you read the Contributing Guidelines on pull requests?

Test Plan

jrapin left a comment

Choose a reason for hiding this comment

omry commented Jun 16, 2020

jrapin commented Jun 16, 2020 • edited Loading

omry commented Jun 16, 2020 • edited Loading

omry commented Jun 16, 2020 • edited Loading

jrapin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry Jun 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrapin commented Jun 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrapin commented Jun 22, 2020

omry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry Jun 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry commented Jun 22, 2020

omry commented Jun 22, 2020

omry commented Jun 22, 2020

jrapin commented Jun 23, 2020

jrapin commented Jun 15, 2020 •

edited

Loading

jrapin commented Jun 16, 2020 •

edited

Loading

omry commented Jun 16, 2020 •

edited

Loading

omry commented Jun 16, 2020 •

edited

Loading

omry Jun 18, 2020 •

edited

Loading

omry Jun 22, 2020 •

edited

Loading