-
-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] cannot pickle 'CudnnModule' object when using Joblib #1180
Comments
This seems like a bug between joblib and Pytorch.
|
# main.py
import hydra
from joblib import Parallel, delayed
import torch.backends.cudnn as cudnn
@hydra.main(config_name="cudnn")
def main():
cudnn.benchmark = False
cudnn.deterministic = True
if __name__ == "__main__":
runs = Parallel(n_jobs=4)(delayed(main)())
>>> python main.py -m +gpu=0,1,2,3
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
obj_ = dumps(obj, reducers=reducers)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "main.py", line 13, in <module>
runs = Parallel(n_jobs=4)(delayed(main)())
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
self.retrieve()
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.
#main.py
import hydra
@hydra.main(config_name="cudnn")
def main(cfg):
import torch.backends.cudnn as cudnn
cudnn.benchmark = False
cudnn.deterministic = True
return
if __name__ == "__main__":
main()
>>> python main.py -m +gpu=0,1,2,3
[2020-12-03 10:18:24,770][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 4 jobs
[2020-12-03 10:18:24,770][HYDRA] Launching jobs, sweep output dir : multirun/2020-12-03/10-18-24
[2020-12-03 10:18:24,770][HYDRA] #0 : +gpu=0
[2020-12-03 10:18:24,770][HYDRA] #1 : +gpu=1
[2020-12-03 10:18:24,770][HYDRA] #2 : +gpu=2
[2020-12-03 10:18:24,770][HYDRA] #3 : +gpu=3
# main.py
import hydra
import torch.backends.cudnn as cudnn
@hydra.main(config_name="cudnn")
def main():
cudnn.benchmark = False
cudnn.deterministic = True
if __name__ == "__main__":
runs = Parallel(n_jobs=4)(delayed(main)())
>>> python main.py -m +gpu=0,1,2,3
[2020-12-03 10:20:24,271][HYDRA] Submitit 'local' sweep output dir : multirun/2020-12-03/10-20-23
[2020-12-03 10:20:24,271][HYDRA] #0 : +gpu=0
[2020-12-03 10:20:24,272][HYDRA] #1 : +gpu=1
[2020-12-03 10:20:24,272][HYDRA] #2 : +gpu=2
[2020-12-03 10:20:24,272][HYDRA] #3 : +gpu=3
Traceback (most recent call last):
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
lambda: hydra.multirun(
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
return sweeper.sweep(arguments=task_overrides)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py", line 149, in launch
jobs = executor.map_array(self, *zip(*job_params))
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/core.py", line 630, in map_array
return self._internal_process_submissions(submissions)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/auto/auto.py", line 202, in _internal_process_submissions
return self._executor._internal_process_submissions(delayed_submissions)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/core.py", line 716, in _internal_process_submissions
delayed.dump(pickle_path)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/utils.py", line 131, in dump
cloudpickle_dump(self, filepath)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/submitit/core/utils.py", line 278, in cloudpickle_dump
cloudpickle.dump(obj, ofile, pickle.HIGHEST_PROTOCOL)
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 55, in dump
CloudPickler(
File "/home/vishwam/anaconda3/envs/hydra-torch/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle 'CudnnModule' object
""" Using submitit and importing lazily works as well. Weirdly, this following code doesn't work even when you import lazily. import hydra
import torch
@hydra.main(config_name="imagenetconf")
def main(cfg):
import torch.backends.cudnn as cudnn
cudnn.benchmark = False
cudnn.deterministic = True
if cfg.gpu is not None:
logger.info(f"Use GPU: {cfg.gpu} for training")
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12345"
model = models.__dict__[cfg.arch]()
torch.cuda.set_device(cfg.gpu)
model.cuda(cfg.gpu)
cfg.batch_size = int(cfg.batch_size / cfg.world_size)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[cfg.gpu])
|
Please take Hydra out of the equation in this one. does it still happen without
So importing lazily works around the issue? Is it actually functional?
So you are seeing the same behavior with submitit as well, correct? |
This works: from joblib import Parallel, delayed
def main():
import torch.backends.cudnn as cudnn
cudnn.benchmark = False
cudnn.deterministic = True
if __name__ == "__main__":
runs = Parallel(n_jobs=4)(delayed(main)() for i in range(4)) This doesn't work: import torch.backends.cudnn as cudnn
from joblib import Parallel, delayed
def main():
cudnn.benchmark = False
cudnn.deterministic = True
if __name__ == "__main__":
runs = Parallel(n_jobs=4)(delayed(main)() for i in range(4))
Same behavior as submitit. |
Minimal repro without joblib: import torch.backends.cudnn
import cloudpickle
def foo():
print(torch.backends.cudnn)
cloudpickle.dumps(foo) This should probably be submitted as a bug report for pytorch and cloudpickle. it's something in between them. |
@briankosw, did you an issue against pytorch/cloudpickle? |
Thanks! |
@briankosw, there is a suggested fix from the cloudpickle folks. would you be interested in trying to apply it to pytorch locally and see if that would allow us to proceed with the minimal example we are trying to build? |
Given that this is a compatibility issue between cloudpickle and PyTorch, Hydra is not a part of this. |
🐛 Bug
Description
It seems that CUDNN does not play well with Hydra's Joblib pluging. See the repro below.
Checklist
To reproduce
System information
Additional context
This seems to be a problem with how the global CUDNN state is being set and its interaction with Joblib. This snippet works perfectly well with PyTorch and Python's distributed processing primitives.
The text was updated successfully, but these errors were encountered: