Non-daemonic workers #2718

calebho · 2019-05-21T00:39:40Z

Related to #2142, but the solution doesn't apply in my case. I have a use case for workers running in separate processes, but as non-daemons because the worker processes need to use multiprocessing. Here's an example:

import torch
import torch.distributed as dist
import torchvision
import os
from distributed import Client, LocalCluster


def worker_fn(rank, world_size):
    print('worker', rank)

    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '8989'
    dist.init_process_group(
        backend=dist.Backend.NCCL,
        rank=rank,
        world_size=world_size,
    )
    print('initialized distributed', rank)

    if rank == 0:
        dataset = torchvision.datasets.MNIST(
            '../data/',
            train=True,
            download=True,
        )
    dist.barrier()
    if rank != 0:
        dataset = torchvision.datasets.MNIST(
            '../data/',
            train=True,
            download=False,
        )
    # load data, uses multiprocessing
    loader = torch.utils.data.DataLoader(
        dataset,
        sampler=torch.utils.data.distributed.DistributedSampler(
            dataset,
            rank=rank,
            num_replicas=world_size,
        ),
        num_workers=2,
    )
    print('constructed data loader', rank)

    # if cuda is available, initializes it as well
    assert torch.cuda.is_available()
    # do distributed training, but in this case it suffices to iterate
    for x, y in loader:
        pass


def main():
    world_size = 2
    cluster = LocalCluster(
        n_workers=world_size,
        processes=True,
        resources={
            'GPUS': 1,  # don't allow two tasks to run on the same worker
        },
    )
    cl = Client(cluster)
    futs = []
    for rank in range(world_size):
        futs.append(
            cl.submit(
                worker_fn,
                rank,
                world_size,
                resources={'GPUS': 1},
            ))

    for f in futs:
        f.result()


if __name__ == '__main__':
    main()

If processes=True, then we get an error about daemonic processes not being allowed to have children:

worker 0
worker 1
initialized distributed 1
initialized distributed 0
constructed data loader 0
constructed data loader 1
distributed.worker - WARNING -  Compute Failed
Function:  worker_fn
args:      (0, 2)
kwargs:    {}
Exception: AssertionError('daemonic processes are not allowed to have children',)

Traceback (most recent call last):
  File "scratch.py", line 152, in <module>
    main()
  File "scratch.py", line 148, in main
    f.result()
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/site-packages/distributed/client.py", line 227, in result
    six.reraise(*result)
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "scratch.py", line 123, in worker_fn
    for x, y in loader:
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 193, in __iter__
    return _DataLoaderIter(self)
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 469, in __init__
    w.start()
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/multiprocessing/process.py", line 103, in start
    'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
distributed.worker - WARNING -  Compute Failed
Function:  worker_fn
args:      (1, 2)
kwargs:    {}
Exception: AssertionError('daemonic processes are not allowed to have children',)

If processes=False, we get stuck at distributed initialization.

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-05-21T02:39:55Z

You might try setting up dask-worker processes yourself, perhaps with the --no-nanny flag. If you come up with another easier way to handle this it would be great if you could share here.

calebho · 2019-05-21T18:26:36Z

The first solution that comes to mind is to not run workers as daemons, but I suspect there's a reason why they are. So why are workers run as daemons?

calebho · 2019-05-21T21:07:55Z

In particular, is there any reason why this isn't a parameter to the WorkerProcess constructor?

mrocklin · 2019-05-21T21:37:52Z

I have no problem with making that configurable. I might suggest a config option rather than piping it through all of the constructors.

https://docs.dask.org/en/latest/configuration.html

Any interest in submitting a Pull Request with that change?

calebho · 2019-05-21T22:46:58Z

Yeah I can work on a PR

mrocklin · 2019-05-28T15:17:53Z

Checking in here @calebho . Is there anything I can do to help?

calebho · 2019-05-28T18:20:35Z

@mrocklin Thanks for checking in. I started working on this last week, but I had to put it on hold because other things came up. I should have something by the end of the week or so

mrocklin · 2019-05-28T20:24:10Z

OK cool. Thanks for the update.

…

On Tue, May 28, 2019 at 1:20 PM Caleb ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> Thanks for checking in. I started working on this last week, but I had to put it on hold because other things came up. I should have something by the end of the week or so — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2718?email_source=notifications&email_token=AACKZTCZ7NEWYZ5642QMGODPXVZXJA5CNFSM4HOGODHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWM76UA#issuecomment-496631632>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTGJQPMROZHU5JO4ANTPXVZXJANCNFSM4HOGODHA> .

zhanghang1989 · 2019-05-28T23:59:37Z

I am new to dask. Is that possible to set --no-nanny when using dask-ssh?

mrocklin · 2019-05-29T00:04:36Z

@zhanghang1989 I recommend raising a new issue. I recommend not repeating your comment on multiple issues.

calebho · 2019-05-30T00:08:00Z

Hey @mrocklin question about dask.config. I'm writing a test to see whether setting the config value actually does anything:

@gen_cluster(ncores=[("127.0.0.1", 1)], client=True, Worker=Nanny)
def test_worker_no_daemon(c, s, a):
    def noop():
        pass

    def multiprocessing_worker():
        p = mp.Process(target=noop)
        p.start()
        p.join()

    with dask.config.set({"distributed.worker.daemon": False}):
        from pprint import pprint
        print('config in test')
        pprint(dask.config.config)
        yield c.submit(multiprocessing_worker)

The config value seems to be set in the test, but not in the WorkerProcess.start where I read from the same config value to set self.process.daemon. The test fails with "daemonic processes are not allowed to have children". Sure enough when I print the config inside of start, I don't see the config value at all and so it's falling back to the default value which I set to bet True. Am I using dask.config.set properly? I guess a workaround could be to create a temporary file in ~/.config/dask...

mrocklin · 2019-05-30T01:45:09Z

You are calling dask.config.set properly. However at that stage the workers/nannies would already have been created, so they would be daemon processes regardless. Presumably in testing you would want to use the config= keyword to the gen_cluster decorator (assuming that you've already made changes to respect the distributed.worker.daemon config value.

Closes dask#2718

Closes #2718

calebho added a commit to calebho/distributed that referenced this issue May 31, 2019

Allow user to configure whether workers are daemon.

a670274

Closes dask#2718

calebho mentioned this issue May 31, 2019

Allow user to configure whether workers are daemon. #2739

Merged

mrocklin closed this as completed in #2739 May 31, 2019

mrocklin pushed a commit that referenced this issue May 31, 2019

Allow user to configure whether workers are daemon. (#2739)

a16b8ff

Closes #2718

SimonBoothroyd mentioned this issue Aug 1, 2019

Re-enable Support for Dask Nanny Process openforcefield/openff-evaluator#59

Merged

1 task

jbweston mentioned this issue Nov 21, 2019

Error on windows: daemonic processes are not allowed to have children python-adaptive/adaptive#225

Closed

jrbourbeau mentioned this issue Jul 14, 2021

dask sees pytorch model as daemonic #5058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-daemonic workers #2718

Non-daemonic workers #2718

calebho commented May 21, 2019

mrocklin commented May 21, 2019

calebho commented May 21, 2019

calebho commented May 21, 2019

mrocklin commented May 21, 2019

calebho commented May 21, 2019

mrocklin commented May 28, 2019

calebho commented May 28, 2019

mrocklin commented May 28, 2019 via email

zhanghang1989 commented May 28, 2019

mrocklin commented May 29, 2019

calebho commented May 30, 2019

mrocklin commented May 30, 2019

Non-daemonic workers #2718

Non-daemonic workers #2718

Comments

calebho commented May 21, 2019

mrocklin commented May 21, 2019

calebho commented May 21, 2019

calebho commented May 21, 2019

mrocklin commented May 21, 2019

calebho commented May 21, 2019

mrocklin commented May 28, 2019

calebho commented May 28, 2019

mrocklin commented May 28, 2019 via email

zhanghang1989 commented May 28, 2019

mrocklin commented May 29, 2019

calebho commented May 30, 2019

mrocklin commented May 30, 2019