Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration Profiles #1020

Open
quasiben opened this issue Oct 18, 2022 · 8 comments
Open

Configuration Profiles #1020

quasiben opened this issue Oct 18, 2022 · 8 comments

Comments

@quasiben
Copy link
Member

quasiben commented Oct 18, 2022

We've had several issues come in related to defaults/cluster configurations (#990 / #348 / ...) and a general request to support "Configuration Profiles". As there are many ways to configure a Dask Cluster: CLI, ENV Var, Yaml Config, etc -- users would like to have a higher level concept i.e. a Profile, which codifies are particular configuration. This might be a configuration related to performance, debugging, benchmarking, environment (cloud vs hpc, vs local), etc.

Earlier today, it was pointed out to me that Dask does in fact check a couple of locations for YAML config files and the location itself is also configurable via the DASK_CONFIG env var:

For example, we can create the following RAPIDS-TPCDS/distributed.yaml

# distributed.yaml
distributed:
  logging:
    distributed: info
  version: 2
rmm:
  pool-size: 30GB
ucx:
  cuda_copy: true
  infiniband: true
  nvlink: true
  reuse-endpoints: true
  tcp: true

Then load like the following:

$ DASK_CONFIG=~/RAPIDS-TPCDS/ python -c "import dask, distributed; print(dask.config.get('distributed.rmm'))"
{'pool-size': '30GB'}

This is quite useful and would unblock, I think, several folks. However, I don't think we are done yet.

  1. The above example shows us that the UCX configuration in Distiributed.yaml is lacking the all button
  2. We are missing a number of options: device-memory-limit, jit-unspill
  3. We need help building some pre-configured configurations

For problem 1) and 2), what would folks think about building a config system inside of dask-cuda. This would be similar to what dask-kubernetes and dask-sql do. If we had this, then would could also think moving/reducing the configuration inside distributed.

For 3) would could ask folks like @VibhuJawa @ayushdg and @randerzander for their input in building their defaults as highlighted options

Another thing we could also think about is pushing the DASK_CONFIG env var to a CLI option for workers and schedulers;

dask-worker --dask-config-path /foo/bar

@ayushdg
Copy link
Member

ayushdg commented Oct 18, 2022

Thanks a lot for starting this discussion @quasiben. These set of yaml files for pre-baked profiles is something I'm in favor of.

I had a few questions around the usage of configuration files:

  1. Do the yaml based configuration only get picked up when creating a cluster from a python client (
    from dask_cuda import LocalCUDACluster; cluster = LocalCUDACluster())
    or does it also apply to scheduler and worker processes started up from the command line as well? (which is how we setup most multi-node workflows today)

  2. Can the Yaml file be extended to also set certain ucx environment variables (like UCX_TCP_RX_SEG_SIZE) or would we still have to export those separately on the scheduler, worker and client processes?

  3. Specifically for things like rmm pool-size it might make sense to allow passing a memory fraction instead of an absolute value. Curious on your thoughts here @VibhuJawa.

Generally I'm happy to help providing input for some of these default options, and also curious to hear if others opinions on what high level defaults make the most sense.

@randerzander
Copy link
Contributor

Specifically for things like rmm pool-size it might make sense to allow passing a memory fraction instead of an absolute value. Curious on your thoughts here @VibhuJawa.

Agree with this. Would make profiles less specific to a particular device type

@pentschev
Copy link
Member

  • Do the yaml based configuration only get picked up when creating a cluster from a python client (
    from dask_cuda import LocalCUDACluster; cluster = LocalCUDACluster())
    or does it also apply to scheduler and worker processes started up from the command line as well? (which is how we setup most multi-node workflows today)

It seems that it would work for the scheduler as well, but maybe @quasiben can confirm this. This is nevertheless a good question, the client must also pick the same configs.

  • Can the Yaml file be extended to also set certain ucx environment variables (like UCX_TCP_RX_SEG_SIZE) or would we still have to export those separately on the scheduler, worker and client processes?

Theoretically we could pass this to Dask, but I'm generally -1 on that because we would need to create a new Dask config for every possible UCX configuration. I think we need to think of a more robust way to address this rather than individually adding a new config for every UCX option, but I don't know if we can add in Dask some kind of config translation, e.g., distributed.comm.ucx.some-config -> UCX_SOME_CONFIG. If that's possible then we could support it without any major issues I believe. Alternatively, maybe Dask could support specifying environment variables directly within the yaml file?

  • Specifically for things like rmm pool-size it might make sense to allow passing a memory fraction instead of an absolute value. Curious on your thoughts here @VibhuJawa.

That's a good point, addressing this now in #1021 .

@wence-
Copy link
Contributor

wence- commented Oct 19, 2022

Theoretically we could pass this to Dask, but I'm generally -1 on that because we would need to create a new Dask config for every possible UCX configuration. I think we need to think of a more robust way to address this rather than individually adding a new config for every UCX option, but I don't know if we can add in Dask some kind of config translation, e.g., distributed.comm.ucx.some-config -> UCX_SOME_CONFIG. If that's possible then we could support it without any major issues I believe. Alternatively, maybe Dask could support specifying environment variables directly within the yaml file?

We can, with something like this:

diff --git a/distributed/comm/ucx.py b/distributed/comm/ucx.py
index fc07d048..254918cc 100644
--- a/distributed/comm/ucx.py
+++ b/distributed/comm/ucx.py
@@ -126,7 +126,16 @@ def init_once():
 
     ucp = _ucp
 
-    ucp.init(options=ucx_config, env_takes_precedence=True)
+    def munge(key):
+        return "_".join(map(str.upper, key.split("-")))
+
+    environment = {
+        munge(k): v
+        for k, v in dask.config.get("distributed.comm.ucx.environment", {}).items()
+    }
+    # Specific ucx_config options should override general environment ones
+    environment.update(ucx_config)
+    ucp.init(options=environment, env_takes_precedence=True)
 
     pool_size_str = dask.config.get("distributed.rmm.pool-size")
 
diff --git a/distributed/distributed-schema.yaml b/distributed/distributed-schema.yaml
index 4e31840d..d8207e5a 100644
--- a/distributed/distributed-schema.yaml
+++ b/distributed/distributed-schema.yaml
@@ -961,7 +961,14 @@ properties:
                   additional variables for each transport, while ensuring optimal connectivity. When
                   ``True``, a CUDA context will be created on the first device listed in
                   ``CUDA_VISIBLE_DEVICES``.
-
+              environment:
+                type: [object, 'null']
+                description: |
+                  Mapping setting specified UCX environment variables.
+                  A name
+                  ``distributed.comm.ucx.environment.some-option=value``
+                  is equivalent to setting ``UCX_SOME_OPTION=value` in
+                  the environment.
           tcp:
             type: object
             properties:
diff --git a/distributed/distributed.yaml b/distributed/distributed.yaml
index 81033703..821232b1 100644
--- a/distributed/distributed.yaml
+++ b/distributed/distributed.yaml
@@ -229,7 +229,8 @@ distributed:
       infiniband: null  # enable Infiniband
       rdmacm: null  # enable RDMACM
       create-cuda-context: null  # create CUDA context before UCX initialization
-
+      environment:
+        max_copy_reg: 1
     zstd:
       level: 3      # Compression level, between 1 and 22.
       threads: 0    # Threads to use. 0 for single-threaded, -1 to infer from cpu count.

Then I can do:

In[3]: dask.config.get("distributed.comm.ucx.environment")
Out[3]: {'max_copy_reg': 1}

@pentschev
Copy link
Member

I'm 100% onboard with this idea @wence- , could you submit a PR to distributed? And thanks for one more neat solution Lawrence-style!

@VibhuJawa
Copy link
Member

For 3) would could ask folks like @VibhuJawa @ayushdg and @randerzander for their input in building their defaults as highlighted options

More than happy to help on this . We can couple of these one for pure ETL,RAPIDS+DL , RAPIDS+Graph etc.

Specifically for things like rmm pool-size it might make sense to allow passing a memory fraction instead of an absolute value. Curious on your thoughts here @VibhuJawa.

I think we are all in agreement about allowing memory fraction is a good idea for defaults

For example, we can create the following RAPIDS-TPCDS/distributed.yaml

I think we can think about creating one config in the cugraph repo too to help customers and users to the right configuration.
CC: @rlratzel , @jnke2016

rapids-bot bot pushed a commit that referenced this issue Oct 20, 2022
This is already supported by `memory_limit`/`device_memory_limit`, and this has been raised in #1020 during discussions on how to make Dask Profiles usable in Dask-CUDA.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1021
@pentschev
Copy link
Member

#1021 is now in and should allow specifying rmm_pool_size=0.95/--rmm-pool-size 0.95, for example. Please let me know if you find any issues with it.

@wence-
Copy link
Contributor

wence- commented Oct 20, 2022

Generic UCX environment setting via dask config is dask/distributed#7164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

6 participants