You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given how important wandb is in ablation studies, it would be quite helpful to get it running without crashing the script. I understand from #1 that this does not seem to affect your side, however, it is also not an issue with MPI and wandb alone.
Running a test script like the following with mpirun -n 1 is fine.
import json
import wandb
wandb_entity="my-entity"
wandb_project="my-project"
exclude = ["device"]
with open('~/skill-chaining/log/table_lack_0825.gail.p0.123/params.json', "r") as fp:
cdict=json.load(fp)
wandb.init(
resume='table_lack_0825.gail.p0.123',
project=wandb_project,
config={k: v for k, v in cdict.items() if k not in exclude},
dir='~/skill-chaining/log/table_lack_0825.gail.p0.123',
entity=wandb_entity,
notes='',
mode="online",
)
Using MPI with run.py and wandb enabled, however, crashes the script - it is not a resource issue or a native error to the MPI + wandb pair:
$ mpirun -n 1 python -m run --algo gail --furniture_name table_lack_0825 --demo_path demos/table_lack/Sawyer_table_lack_0825_0 --num_connects 1 --run_prefix p0 --gpu 0 --wandb True --max_global_step 100000000 --wandb_entity my-entity --wandb_project my-project
pybullet build time: Apr 21 2022 20:41:06
[DEBUG] Wandb Init Before
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:228: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
interpolation: int = Image.BILINEAR,
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:295: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
interpolation: int = Image.NEAREST,
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:328: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
interpolation: int = Image.BICUBIC,
wandb: Currently logged in as: my-team (use `wandb login --relogin` to force relogin)
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
getting local rank failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[digi2:2953274] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Problem at: ~/skill-chaining/method/robot_learning/main.py 133 _make_log_files
Traceback (most recent call last):
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 995, in init
run = wi.init()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 648, in init
backend.cleanup()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 246, in cleanup
self.interface.join()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in join
super().join()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 653, in join
_ = self._communicate_shutdown()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _communicate_shutdown
_ = self._communicate(record)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 226, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 231, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 995, in init
run = wi.init()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 648, in init
backend.cleanup()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 246, in cleanup
self.interface.join()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in join
super().join()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 653, in join
_ = self._communicate_shutdown()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _communicate_shutdown
_ = self._communicate(record)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 226, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 231, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "~/skill-chaining/run.py", line 44, in <module>
SkillChainingRun(parser).run()
File "~/skill-chaining/run.py", line 10, in __init__
super().__init__(parser)
File "~/skill-chaining/method/robot_learning/main.py", line 44, in __init__
self._make_log_files()
File "~/skill-chaining/method/robot_learning/main.py", line 133, in _make_log_files
mode="online" if config.wandb else "disabled",
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1033, in init
raise Exception("problem") from error_seen
Exception: problem
Any ideia what could be the problem?
The text was updated successfully, but these errors were encountered:
Good day,
Given how important
wandb
is in ablation studies, it would be quite helpful to get it running without crashing the script. I understand from #1 that this does not seem to affect your side, however, it is also not an issue with MPI andwandb
alone.Running a test script like the following with
mpirun -n 1
is fine.Using MPI with
run.py
and wandb enabled, however, crashes the script - it is not a resource issue or a native error to the MPI + wandb pair:Any ideia what could be the problem?
The text was updated successfully, but these errors were encountered: