You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training models with pytorch using multi-worker data loaders is generally necessary for efficient data loading. Current multi-process executor in Oríon does not support running multi-process inside parallel workers (which are sub processes spawned using python's multi-process module). This is very constraining and should be fixed.
Example of stack trace reported:
Traceback (most recent call last):
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 25, in _couldpickle_exec
result = function(*args, **kwargs)
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/client/runner.py", line 122, in _optimize
return fct(**unflatten(kwargs))
File "main.py", line 112, in run
signal = task.run()
File "/home/mila/s/schmidtv/ocp-project/ocp-drlab/ocpmodels/tasks/task.py", line 50, in run
return self.trainer.train(
File "/home/mila/s/schmidtv/ocp-project/ocp-drlab/ocpmodels/trainers/single_trainer.py", line 224, in train
train_loader_iter = iter(self.loaders["train"])
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 444, in __iter__
return self._get_iterator()
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__
w.start()
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/process.py", line 118, in start
assert not _current_process._config.get('daemon'), \
AssertionError: daemonic processes are not allowed to have children
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 227, in async_get
results.append(AsyncResult(future, future.get()))
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 54, in get
r = self.future.get(timeout)
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
AssertionError: daemonic processes are not allowed to have children
The text was updated successfully, but these errors were encountered:
bouthilx
added
bug
Indicates an unexpected problem or unintended behavior
medium
The bug breaks a feature but it can still be used or causes a confusing user experience
labels
Jan 10, 2023
When training models with pytorch using multi-worker data loaders is generally necessary for efficient data loading. Current multi-process executor in Oríon does not support running multi-process inside parallel workers (which are sub processes spawned using python's multi-process module). This is very constraining and should be fixed.
Example of stack trace reported:
The text was updated successfully, but these errors were encountered: