[Bug] cannot pickle 'CudnnModule' object when using Joblib #1180

briankosw · 2020-12-03T08:15:35Z

🐛 Bug

Description

It seems that CUDNN does not play well with Hydra's Joblib pluging. See the repro below.

Checklist

I checked on the latest version of Hydra
I created a minimal repro

To reproduce

import hydra
import torch.backends.cudnn as cudnn

@hydra.main(config_name="cudnn")
def main(cfg):
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    main()

defaults:
    - hydra/launcher: joblib

>>> python main.py -m +gpu=0,1,2,3

[2020-12-03 08:10:44,560][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 4 jobs
[2020-12-03 08:10:44,560][HYDRA] Launching jobs, sweep output dir : multirun/2020-12-03/08-10-44
[2020-12-03 08:10:44,560][HYDRA]        #0 : +gpu=0
[2020-12-03 08:10:44,560][HYDRA]        #1 : +gpu=1
[2020-12-03 08:10:44,560][HYDRA]        #2 : +gpu=2
[2020-12-03 08:10:44,560][HYDRA]        #3 : +gpu=3
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
    lambda: hydra.multirun(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_joblib_launcher/joblib_launcher.py", line 45, in launch
    return _core.launch(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_joblib_launcher/_core.py", line 101, in launch
    runs = Parallel(**joblib_cfg)(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
    self.retrieve()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.

System information

Hydra Version : 1.0.4
Python version : 3.8.5
Virtual environment type and version : conda 4.8.3
Operating system : Ubuntu 18.04.3 LTS

Additional context

This seems to be a problem with how the global CUDNN state is being set and its interaction with Joblib. This snippet works perfectly well with PyTorch and Python's distributed processing primitives.

The text was updated successfully, but these errors were encountered:

omry · 2020-12-03T09:28:24Z

This seems like a bug between joblib and Pytorch.

Can you try to repro by using joblib directly?
As a workaround, can you try to import cudnn lazily inside your hydra.main() function?
Can you test with the submitit launcher in local mode? it also runs locally and I am curious if it has the same issue.

briankosw · 2020-12-03T10:27:04Z

1. Can you try to repro by using joblib directly?

# main.py
import hydra
from joblib import Parallel, delayed
import torch.backends.cudnn as cudnn


@hydra.main(config_name="cudnn")
def main():
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    runs = Parallel(n_jobs=4)(delayed(main)())

# cudnn.yaml

defaults:
    - hydra/launcher: joblib

>>> python main.py -m +gpu=0,1,2,3

joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "main.py", line 13, in <module>
    runs = Parallel(n_jobs=4)(delayed(main)())
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
    self.retrieve()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.

2. As a workaround, can you try to import cudnn lazily inside your `hydra.main()` function?

#main.py 
import hydra

@hydra.main(config_name="cudnn")
def main(cfg):
    import torch.backends.cudnn as cudnn

    cudnn.benchmark = False
    cudnn.deterministic = True
    return


if __name__ == "__main__":
    main()

#cudnn.yaml

defaults:
    - hydra/launcher: joblib

>>> python main.py -m +gpu=0,1,2,3

[2020-12-03 10:18:24,770][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 4 jobs
[2020-12-03 10:18:24,770][HYDRA] Launching jobs, sweep output dir : multirun/2020-12-03/10-18-24
[2020-12-03 10:18:24,770][HYDRA]        #0 : +gpu=0
[2020-12-03 10:18:24,770][HYDRA]        #1 : +gpu=1
[2020-12-03 10:18:24,770][HYDRA]        #2 : +gpu=2
[2020-12-03 10:18:24,770][HYDRA]        #3 : +gpu=3

3. Can you test with the submitit launcher in local mode? it also runs locally and I am curious if it has the same issue.

# main.py
import hydra
import torch.backends.cudnn as cudnn


@hydra.main(config_name="cudnn")
def main():
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    runs = Parallel(n_jobs=4)(delayed(main)())

# cudnn.yaml

defaults:
    - hydra/launcher: submitit_local

>>> python main.py -m +gpu=0,1,2,3

[2020-12-03 10:20:24,271][HYDRA] Submitit 'local' sweep output dir : multirun/2020-12-03/10-20-23
[2020-12-03 10:20:24,271][HYDRA]        #0 : +gpu=0
[2020-12-03 10:20:24,272][HYDRA]        #1 : +gpu=1
[2020-12-03 10:20:24,272][HYDRA]        #2 : +gpu=2
[2020-12-03 10:20:24,272][HYDRA]        #3 : +gpu=3
Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
    lambda: hydra.multirun(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py", line 149, in launch
    jobs = executor.map_array(self, *zip(*job_params))
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/core.py", line 630, in map_array
    return self._internal_process_submissions(submissions)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/auto/auto.py", line 202, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/core.py", line 716, in _internal_process_submissions
    delayed.dump(pickle_path)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/utils.py", line 131, in dump
    cloudpickle_dump(self, filepath)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/utils.py", line 278, in cloudpickle_dump
    cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 55, in dump
    CloudPickler(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
"""

Using submitit and importing lazily works as well.

Weirdly, this following code doesn't work even when you import lazily.

import hydra
import torch


@hydra.main(config_name="imagenetconf")
def main(cfg):
    import torch.backends.cudnn as cudnn

    cudnn.benchmark = False
    cudnn.deterministic = True
    if cfg.gpu is not None:
        logger.info(f"Use GPU: {cfg.gpu} for training")
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12345"

    model = models.__dict__[cfg.arch]()
    torch.cuda.set_device(cfg.gpu)
    model.cuda(cfg.gpu)
    cfg.batch_size = int(cfg.batch_size / cfg.world_size)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[cfg.gpu])

>>> python main.py -m gpu=0,1,2,3

[2020-12-03 10:24:53,423][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 4 jobs
[2020-12-03 10:24:53,423][HYDRA] Launching jobs, sweep output dir : multirun/2020-12-03/10-24-52
[2020-12-03 10:24:53,423][HYDRA]        #0 : gpu=0
[2020-12-03 10:24:53,423][HYDRA]        #1 : gpu=1
[2020-12-03 10:24:53,423][HYDRA]        #2 : gpu=2
[2020-12-03 10:24:53,423][HYDRA]        #3 : gpu=3
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
    lambda: hydra.multirun(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_joblib_launcher/joblib_launcher.py", line 45, in launch
    return _core.launch(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_joblib_launcher/_core.py", line 101, in launch
    runs = Parallel(**joblib_cfg)(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
    self.retrieve()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.

omry · 2020-12-03T16:51:14Z

Can you try to repro by using joblib directly?

Please take Hydra out of the equation in this one. does it still happen without @hydra.main() ?

As a workaround, can you try to import cudnn lazily inside your hydra.main() function?

So importing lazily works around the issue? Is it actually functional?

Can you test with the submitit launcher in local mode? it also runs locally and I am curious if it has the same issue.

So you are seeing the same behavior with submitit as well, correct?

briankosw · 2020-12-04T04:12:11Z

Can you try to repo by using joblib directly?

As a workaround, can you try to import cudnn lazily inside your hydra.main() function?

This works:

from joblib import Parallel, delayed


def main():
    import torch.backends.cudnn as cudnn
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    runs = Parallel(n_jobs=4)(delayed(main)() for i in range(4))

This doesn't work:

 import torch.backends.cudnn as cudnn
from joblib import Parallel, delayed


def main():
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    runs = Parallel(n_jobs=4)(delayed(main)() for i in range(4))

Can you test with the submitit launcher in local mode? it also runs locally and I am curious if it has the same issue.

Same behavior as submitit.

omry · 2020-12-05T08:44:10Z

Minimal repro without joblib:

import torch.backends.cudnn
import cloudpickle

def foo():
    print(torch.backends.cudnn)

cloudpickle.dumps(foo)

This should probably be submitted as a bug report for pytorch and cloudpickle. it's something in between them.

omry · 2020-12-16T18:35:19Z

@briankosw, did you an issue against pytorch/cloudpickle?

briankosw · 2020-12-17T16:15:14Z

PyTorch
cloudpickle

omry · 2020-12-18T09:33:21Z

Thanks!
Closing as this is not an issue with Hydra.
I subscribed to two issues you opened and will keep an eye on the.

omry · 2021-01-07T23:48:14Z

@briankosw, there is a suggested fix from the cloudpickle folks. would you be interested in trying to apply it to pytorch locally and see if that would allow us to proceed with the minimal example we are trying to build?

omry · 2021-06-04T01:31:44Z

Given that this is a compatibility issue between cloudpickle and PyTorch, Hydra is not a part of this.
Closing.

briankosw added the bug Something isn't working label Dec 3, 2020

briankosw mentioned this issue Dec 5, 2020

Single-node ImageNet DDP implementation pytorch/hydra-torch#38

Draft

omry closed this as completed Dec 18, 2020

omry mentioned this issue Jan 1, 2021

PicklingError using torch_xla with the RQ Launcher #1252

Closed

2 tasks

omry reopened this Jan 7, 2021

omry closed this as completed Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] cannot pickle 'CudnnModule' object when using Joblib #1180

[Bug] cannot pickle 'CudnnModule' object when using Joblib #1180

briankosw commented Dec 3, 2020 •

edited

Loading

omry commented Dec 3, 2020 •

edited

Loading

briankosw commented Dec 3, 2020

omry commented Dec 3, 2020

briankosw commented Dec 4, 2020

omry commented Dec 5, 2020

omry commented Dec 16, 2020

briankosw commented Dec 17, 2020

omry commented Dec 18, 2020

omry commented Jan 7, 2021

omry commented Jun 4, 2021

[Bug] cannot pickle 'CudnnModule' object when using Joblib #1180

[Bug] cannot pickle 'CudnnModule' object when using Joblib #1180

Comments

briankosw commented Dec 3, 2020 • edited Loading

🐛 Bug

Description

Checklist

To reproduce

System information

Additional context

omry commented Dec 3, 2020 • edited Loading

briankosw commented Dec 3, 2020

omry commented Dec 3, 2020

briankosw commented Dec 4, 2020

omry commented Dec 5, 2020

omry commented Dec 16, 2020

briankosw commented Dec 17, 2020

omry commented Dec 18, 2020

omry commented Jan 7, 2021

omry commented Jun 4, 2021

briankosw commented Dec 3, 2020 •

edited

Loading

omry commented Dec 3, 2020 •

edited

Loading