Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] cannot pickle 'CudnnModule' object when using Joblib #1180

Closed
2 tasks done
briankosw opened this issue Dec 3, 2020 · 10 comments
Closed
2 tasks done

[Bug] cannot pickle 'CudnnModule' object when using Joblib #1180

briankosw opened this issue Dec 3, 2020 · 10 comments
Labels
bug Something isn't working

Comments

@briankosw
Copy link
Contributor

briankosw commented Dec 3, 2020

🐛 Bug

Description

It seems that CUDNN does not play well with Hydra's Joblib pluging. See the repro below.

Checklist

  • I checked on the latest version of Hydra
  • I created a minimal repro

To reproduce

import hydra
import torch.backends.cudnn as cudnn

@hydra.main(config_name="cudnn")
def main(cfg):
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    main()
defaults:
    - hydra/launcher: joblib
>>> python main.py -m +gpu=0,1,2,3

[2020-12-03 08:10:44,560][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 4 jobs
[2020-12-03 08:10:44,560][HYDRA] Launching jobs, sweep output dir : multirun/2020-12-03/08-10-44
[2020-12-03 08:10:44,560][HYDRA]        #0 : +gpu=0
[2020-12-03 08:10:44,560][HYDRA]        #1 : +gpu=1
[2020-12-03 08:10:44,560][HYDRA]        #2 : +gpu=2
[2020-12-03 08:10:44,560][HYDRA]        #3 : +gpu=3
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
    lambda: hydra.multirun(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_joblib_launcher/joblib_launcher.py", line 45, in launch
    return _core.launch(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_joblib_launcher/_core.py", line 101, in launch
    runs = Parallel(**joblib_cfg)(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
    self.retrieve()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.

System information

  • Hydra Version : 1.0.4
  • Python version : 3.8.5
  • Virtual environment type and version : conda 4.8.3
  • Operating system : Ubuntu 18.04.3 LTS

Additional context

This seems to be a problem with how the global CUDNN state is being set and its interaction with Joblib. This snippet works perfectly well with PyTorch and Python's distributed processing primitives.

@briankosw briankosw added the bug Something isn't working label Dec 3, 2020
@omry
Copy link
Collaborator

omry commented Dec 3, 2020

This seems like a bug between joblib and Pytorch.

  1. Can you try to repro by using joblib directly?
  2. As a workaround, can you try to import cudnn lazily inside your hydra.main() function?
  3. Can you test with the submitit launcher in local mode? it also runs locally and I am curious if it has the same issue.

@briankosw
Copy link
Contributor Author

1. Can you try to repro by using joblib directly?
# main.py
import hydra
from joblib import Parallel, delayed
import torch.backends.cudnn as cudnn


@hydra.main(config_name="cudnn")
def main():
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    runs = Parallel(n_jobs=4)(delayed(main)())
# cudnn.yaml

defaults:
    - hydra/launcher: joblib
>>> python main.py -m +gpu=0,1,2,3

joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "main.py", line 13, in <module>
    runs = Parallel(n_jobs=4)(delayed(main)())
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
    self.retrieve()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.
2. As a workaround, can you try to import cudnn lazily inside your `hydra.main()` function?
#main.py 
import hydra

@hydra.main(config_name="cudnn")
def main(cfg):
    import torch.backends.cudnn as cudnn

    cudnn.benchmark = False
    cudnn.deterministic = True
    return


if __name__ == "__main__":
    main()
#cudnn.yaml

defaults:
    - hydra/launcher: joblib
>>> python main.py -m +gpu=0,1,2,3

[2020-12-03 10:18:24,770][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 4 jobs
[2020-12-03 10:18:24,770][HYDRA] Launching jobs, sweep output dir : multirun/2020-12-03/10-18-24
[2020-12-03 10:18:24,770][HYDRA]        #0 : +gpu=0
[2020-12-03 10:18:24,770][HYDRA]        #1 : +gpu=1
[2020-12-03 10:18:24,770][HYDRA]        #2 : +gpu=2
[2020-12-03 10:18:24,770][HYDRA]        #3 : +gpu=3
3. Can you test with the submitit launcher in local mode? it also runs locally and I am curious if it has the same issue.
# main.py
import hydra
import torch.backends.cudnn as cudnn


@hydra.main(config_name="cudnn")
def main():
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    runs = Parallel(n_jobs=4)(delayed(main)())
# cudnn.yaml

defaults:
    - hydra/launcher: submitit_local
>>> python main.py -m +gpu=0,1,2,3

[2020-12-03 10:20:24,271][HYDRA] Submitit 'local' sweep output dir : multirun/2020-12-03/10-20-23
[2020-12-03 10:20:24,271][HYDRA]        #0 : +gpu=0
[2020-12-03 10:20:24,272][HYDRA]        #1 : +gpu=1
[2020-12-03 10:20:24,272][HYDRA]        #2 : +gpu=2
[2020-12-03 10:20:24,272][HYDRA]        #3 : +gpu=3
Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
    lambda: hydra.multirun(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py", line 149, in launch
    jobs = executor.map_array(self, *zip(*job_params))
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/core.py", line 630, in map_array
    return self._internal_process_submissions(submissions)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/auto/auto.py", line 202, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/core.py", line 716, in _internal_process_submissions
    delayed.dump(pickle_path)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/utils.py", line 131, in dump
    cloudpickle_dump(self, filepath)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/utils.py", line 278, in cloudpickle_dump
    cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 55, in dump
    CloudPickler(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
"""

Using submitit and importing lazily works as well.

Weirdly, this following code doesn't work even when you import lazily.

import hydra
import torch


@hydra.main(config_name="imagenetconf")
def main(cfg):
    import torch.backends.cudnn as cudnn

    cudnn.benchmark = False
    cudnn.deterministic = True
    if cfg.gpu is not None:
        logger.info(f"Use GPU: {cfg.gpu} for training")
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12345"

    model = models.__dict__[cfg.arch]()
    torch.cuda.set_device(cfg.gpu)
    model.cuda(cfg.gpu)
    cfg.batch_size = int(cfg.batch_size / cfg.world_size)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[cfg.gpu])
>>> python main.py -m gpu=0,1,2,3

[2020-12-03 10:24:53,423][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 4 jobs
[2020-12-03 10:24:53,423][HYDRA] Launching jobs, sweep output dir : multirun/2020-12-03/10-24-52
[2020-12-03 10:24:53,423][HYDRA]        #0 : gpu=0
[2020-12-03 10:24:53,423][HYDRA]        #1 : gpu=1
[2020-12-03 10:24:53,423][HYDRA]        #2 : gpu=2
[2020-12-03 10:24:53,423][HYDRA]        #3 : gpu=3
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
    lambda: hydra.multirun(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
    return sweeper.sweep(arguments=task_overrides)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_joblib_launcher/joblib_launcher.py", line 45, in launch
    return _core.launch(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_joblib_launcher/_core.py", line 101, in launch
    runs = Parallel(**joblib_cfg)(
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
    self.retrieve()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.

@omry
Copy link
Collaborator

omry commented Dec 3, 2020

  1. Can you try to repro by using joblib directly?

Please take Hydra out of the equation in this one. does it still happen without @hydra.main() ?

  1. As a workaround, can you try to import cudnn lazily inside your hydra.main() function?

So importing lazily works around the issue? Is it actually functional?

  1. Can you test with the submitit launcher in local mode? it also runs locally and I am curious if it has the same issue.

So you are seeing the same behavior with submitit as well, correct?

@briankosw
Copy link
Contributor Author

  1. Can you try to repo by using joblib directly?
  2. As a workaround, can you try to import cudnn lazily inside your hydra.main() function?

This works:

from joblib import Parallel, delayed


def main():
    import torch.backends.cudnn as cudnn
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    runs = Parallel(n_jobs=4)(delayed(main)() for i in range(4))

This doesn't work:

 import torch.backends.cudnn as cudnn
from joblib import Parallel, delayed


def main():
    cudnn.benchmark = False
    cudnn.deterministic = True


if __name__ == "__main__":
    runs = Parallel(n_jobs=4)(delayed(main)() for i in range(4))
  1. Can you test with the submitit launcher in local mode? it also runs locally and I am curious if it has the same issue.

Same behavior as submitit.

@omry
Copy link
Collaborator

omry commented Dec 5, 2020

Minimal repro without joblib:

import torch.backends.cudnn
import cloudpickle

def foo():
    print(torch.backends.cudnn)

cloudpickle.dumps(foo)

This should probably be submitted as a bug report for pytorch and cloudpickle. it's something in between them.

@omry
Copy link
Collaborator

omry commented Dec 16, 2020

@briankosw, did you an issue against pytorch/cloudpickle?

@briankosw
Copy link
Contributor Author

PyTorch
cloudpickle

@omry
Copy link
Collaborator

omry commented Dec 18, 2020

Thanks!
Closing as this is not an issue with Hydra.
I subscribed to two issues you opened and will keep an eye on the.

@omry
Copy link
Collaborator

omry commented Jan 7, 2021

@briankosw, there is a suggested fix from the cloudpickle folks. would you be interested in trying to apply it to pytorch locally and see if that would allow us to proceed with the minimal example we are trying to build?

@omry omry reopened this Jan 7, 2021
@omry
Copy link
Collaborator

omry commented Jun 4, 2021

Given that this is a compatibility issue between cloudpickle and PyTorch, Hydra is not a part of this.
Closing.

@omry omry closed this as completed Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants