Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI fails when trainer has --wandb True #8

Open
feup-jmc opened this issue Jun 2, 2022 · 0 comments
Open

MPI fails when trainer has --wandb True #8

feup-jmc opened this issue Jun 2, 2022 · 0 comments

Comments

@feup-jmc
Copy link

feup-jmc commented Jun 2, 2022

Good day,

Given how important wandb is in ablation studies, it would be quite helpful to get it running without crashing the script. I understand from #1 that this does not seem to affect your side, however, it is also not an issue with MPI and wandb alone.

Running a test script like the following with mpirun -n 1 is fine.

import json                                                          
import wandb                                                         
                                                                     
wandb_entity="my-entity"                                         
wandb_project="my-project"                                                
                                                                     
exclude = ["device"]                                                 
                                                                     
with open('~/skill-chaining/log/table_lack_0825.gail.p0.123/params.json', "r") as fp:      
    cdict=json.load(fp)                                              
                                                                     
wandb.init(                                                                               
    resume='table_lack_0825.gail.p0.123',                            
    project=wandb_project,                                           
    config={k: v for k, v in cdict.items() if k not in exclude},     
    dir='~/skill-chaining/log/table_lack_0825.gail.p0.123',
    entity=wandb_entity,                                             
    notes='',                                                        
    mode="online",                                                   
)                                                                    

Using MPI with run.py and wandb enabled, however, crashes the script - it is not a resource issue or a native error to the MPI + wandb pair:

$ mpirun -n 1 python -m run --algo gail --furniture_name table_lack_0825 --demo_path demos/table_lack/Sawyer_table_lack_0825_0 --num_connects 1 --run_prefix p0 --gpu 0 --wandb True --max_global_step 100000000 --wandb_entity my-entity --wandb_project my-project
pybullet build time: Apr 21 2022 20:41:06
[DEBUG] Wandb Init Before
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:228: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  interpolation: int = Image.BILINEAR,
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:295: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  interpolation: int = Image.NEAREST,
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:328: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  interpolation: int = Image.BICUBIC,
wandb: Currently logged in as: my-team (use `wandb login --relogin` to force relogin)
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[digi2:2953274] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Problem at: ~/skill-chaining/method/robot_learning/main.py 133 _make_log_files
Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 995, in init
    run = wi.init()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 648, in init
    backend.cleanup()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 246, in cleanup
    self.interface.join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in join
    super().join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 653, in join
    _ = self._communicate_shutdown()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _communicate_shutdown
    _ = self._communicate(record)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 226, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 231, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 995, in init
    run = wi.init()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 648, in init
    backend.cleanup()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 246, in cleanup
    self.interface.join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in join
    super().join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 653, in join
    _ = self._communicate_shutdown()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _communicate_shutdown
    _ = self._communicate(record)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 226, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 231, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "~/skill-chaining/run.py", line 44, in <module>
    SkillChainingRun(parser).run()
  File "~/skill-chaining/run.py", line 10, in __init__
    super().__init__(parser)
  File "~/skill-chaining/method/robot_learning/main.py", line 44, in __init__
    self._make_log_files()
  File "~/skill-chaining/method/robot_learning/main.py", line 133, in _make_log_files
    mode="online" if config.wandb else "disabled",
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1033, in init
    raise Exception("problem") from error_seen
Exception: problem

Any ideia what could be the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant