Test failed #48

herveyrobot · 2023-07-21T02:23:05Z

i can't run the test code (ai2thor-rearrangement# allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py
).
thanks for any answers.

[07/20 19:14:38 INFO:] Running with args Namespace(experiment='baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py', eval=False, config_kwargs=None, extra_tag='', output_dir='rearrange_out', save_dir_fmt=<SaveDirFormat.FLAT: 'FLAT'>, seed=None, experiment_base='.', checkpoint=None, infer_output_dir=False, approx_ckpt_step_interval=None, restart_pipeline=False, deterministic_cudnn=False, max_sampler_processes_per_worker=None, deterministic_agents=False, log_level='info', disable_tensorboard=False, disable_config_saving=False, collect_valid_results=False, valid_on_initial_weights=False, test_expert=False, distributed_ip_and_port='127.0.0.1:0', machine_id=0, callbacks='', enable_crash_recovery=False, test_date=None, approx_ckpt_steps_count=None, skip_checkpoints=0) [main.py: 452]
fatal: not a git repository (or any of the parent directories): .git
[07/20 19:14:39 WARNING:] Failed to get a git diff of the current project. Is it possible that /root/ai2thor-rearrangement is not under version control? [runner.py: 892]
[07/20 19:14:39 INFO:] Config files saved to rearrange_out/used_configs/OnePhaseRGBResNetDagger_40proc/2023-07-20_19-14-39 [runner.py: 935]
[07/20 19:14:39 INFO:] Using 1 train workers on devices (device(type='cpu'),) [runner.py: 317]
[07/20 19:14:39 INFO:] Using local worker ids [0] (total 1 workers in machine 0) [runner.py: 326]
[07/20 19:14:39 INFO:] Started 1 train processes [runner.py: 595]
[07/20 19:14:39 INFO:] No processes allocated to validation, no validation will be run. [runner.py: 626]
[07/20 19:14:41 INFO:] train 0 args {'experiment_name': 'OnePhaseRGBResNetDagger_40proc', 'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7f8809b19670>, 'callback_sensors': [], 'results_queue': <multiprocessing.queues.Queue object at 0x7f8809b196d0>, 'checkpoints_queue': None, 'checkpoints_dir': 'rearrange_out/checkpoints/OnePhaseRGBResNetDagger_40proc/2023-07-20_19-14-39', 'seed': 1118467761, 'deterministic_cudnn': False, 'mp_ctx': <multiprocessing.context.SpawnContext object at 0x7f86d3c559a0>, 'num_workers': 1, 'device': device(type='cpu'), 'distributed_ip': '127.0.0.1', 'distributed_port': 0, 'max_sampler_processes_per_worker': None, 'save_ckpt_after_every_pipeline_stage': True, 'initial_model_state_dict': '[SUPPRESSED]', 'first_local_worker_id': 0, 'distributed_preemption_threshold': 0.7, 'try_restart_after_task_error': False, 'mode': 'train', 'worker_id': 0} [runner.py: 416]
[07/20 19:14:41 ERROR:] [train worker 0] Encountered TypeError, exiting. [engine.py: 1858]
[07/20 19:14:41 ERROR:] Traceback (most recent call last):
File "/root/ai2thor-rearrangement/baseline_configs/rearrange_base.py", line 292, in stagewise_task_sampler_args
x_displays = get_open_x_displays(throw_error_if_empty=True)
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact_plugins/ithor_plugin/ithor_util.py", line 88, in get_open_x_displays
raise IOError(
OSError: Could not find any open X-displays on which to run AI2-THOR processes. Please see the AI2-THOR installation instructions at https://allenact.org/installation/installation-framework/#installation-of-ithor-ithor-plugin for information as to how to start such displays.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 1850, in train
self.run_pipeline(valid_on_initial_weights=valid_on_initial_weights)
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 1506, in run_pipeline
self.initialize_storage_and_viz(
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 483, in initialize_storage_and_viz
observations = self.vector_tasks.get_observations()
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 327, in vector_tasks
sampler_fn_args=self.get_sampler_fn_args(seeds),
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 382, in get_sampler_fn_args
return [
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 383, in
fn(
File "/root/ai2thor-rearrangement/baseline_configs/rearrange_base.py", line 374, in train_task_sampler_args
**cls.stagewise_task_sampler_args(
File "/root/ai2thor-rearrangement/baseline_configs/rearrange_base.py", line 308, in stagewise_task_sampler_args
[d != torch.device("cpu") and d >= 0 for d in devices]
TypeError: 'NoneType' object is not iterable
[engine.py: 1861]
[07/20 19:14:41 ERROR:] Encountered Exception. Terminating runner. [runner.py: 1467]
[07/20 19:14:41 ERROR:] Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[runner.py: 1468]
Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[07/20 19:14:41 INFO:] Terminating train 0 [runner.py: 1543]
[07/20 19:14:41 INFO:] Joining train 0 [runner.py: 1543]
[07/20 19:14:41 INFO:] Closed train 0 [runner.py: 1543]

herveyrobot · 2023-07-21T11:31:00Z

[07/20 19:56:19 ERROR:] [train worker 0] Encountered TypeError, exiting. [engine.py: 1858]
[07/20 19:56:19 ERROR:] Traceback (most recent call last):
File "/root/ai2thor-rearrangement/baseline_configs/rearrange_base.py", line 292, in stagewise_task_sampler_args
x_displays = get_open_x_displays(throw_error_if_empty=True)
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact_plugins/ithor_plugin/ithor_util.py", line 88, in get_open_x_displays
raise IOError(
OSError: Could not find any open X-displays on which to run AI2-THOR processes. Please see the AI2-THOR installation instructions at https://allenact.org/installation/installation-framework/#installation-of-ithor-ithor-plugin for information as to how to start such displays.

[07/20 19:56:19 ERROR:] Encountered Exception. Terminating runner. [runner.py: 1467]
[07/20 19:56:19 ERROR:] Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[runner.py: 1468]
Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated

nbqu · 2023-07-21T12:27:44Z

Unfortunately I have the same problem with the author, I'm also trying to use in docker enviornment.
I got similar error message, and I'll show what I tried:

run scripts/startx.py from allenact
I downloaded the script and run it to start xserver. Since xorg package was not installed in docker image so I installed it through apt-get install xorg, and this doesn't help me.

X.Org X Server 1.20.13
X Protocol Version 11, Revision 0
Build Operating System: linux Ubuntu
Current Operating System: Linux 2ff19c400e05 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-69-generic root=UUID=e60fe97a-2dcc-41fd-8ab7-c8b75a10503d ro splash quiet biosdevname=0 net.ifnames=0 nouveau.noaccel=1 rdblacklist=nouveau rd.driver.blacklist=nouveau nouveau.modeset=0 vt.handoff=7
Build Date: 29 March 2023 12:53:02PM
xorg-server 2:1.20.13-1ubuntu1~20.04.8 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.38.4
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Fri Jul 21 05:19:13 2023
(++) Using config file: "/tmp/tmpm2oq_3l6"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(EE)
Fatal server error:
(EE) parse_vt_settings: Cannot open /dev/tty0 (No such file or directory)
(EE)
(EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
(EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
(EE)
(EE) Server terminated with error (1). Closing log file.

Based on the error I re-run the container with adding --device /dev/tty0, same happens.

run ai2thor-xorg from ai2thor
Seems this script behaves similar to that of allenact, it didn't work for me either. Same error message.

I'm running this container on headless remote GPU cluster. Could you suggest something I can do?

herveyrobot · 2023-07-27T10:25:03Z

do you solve it?

Lucaweihs · 2023-08-08T00:01:03Z

Hi @herveyrobot and @nbqu ,

The example.py code was written with the assumption that it would be run on a mac, sorry about that! As the error message notes:

[d != torch.device("cpu") and d >= 0 for d in devices]
TypeError: 'NoneType' object is not iterable
[engine.py: 1861]

the parameter devices passed to the TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args function is None by default which is fine when you're not using Linux but causes issues otherwise. Can you try changing:

task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1,
)

to

task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1, devices=[0]
)

and trying again?

herveyrobot · 2023-08-09T02:52:14Z

Hi @herveyrobot and @nbqu ,

The example.py code was written with the assumption that it would be run on a mac, sorry about that! As the error message notes:
[d != torch.device("cpu") and d >= 0 for d in devices]
TypeError: 'NoneType' object is not iterable
[engine.py: 1861]
the parameter devices passed to the TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args function is None by default which is fine when you're not using Linux but causes issues otherwise. Can you try changing:
task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1,
)
to
task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1, devices=[0]
)
and trying again?

i have the same error with ubuntu 18.04.

Wallong · 2023-08-12T04:28:10Z

Hi @herveyrobot and @nbqu ,
The example.py code was written with the assumption that it would be run on a mac, sorry about that! As the error message notes:
[d != torch.device("cpu") and d >= 0 for d in devices]
TypeError: 'NoneType' object is not iterable
[engine.py: 1861]
the parameter devices passed to the TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args function is None by default which is fine when you're not using Linux but causes issues otherwise. Can you try changing:
task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1,
)
to
task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1, devices=[0]
)
and trying again?
i have the same error with ubuntu 18.04.

Hi, I have the same problem. It shows the docker can't find an X display. So I add -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=:0 to the docker command when runs one. And It works for me.

Lucaweihs · 2023-08-15T22:00:03Z

Hi @Wallong ,

Thanks for pointing out this fix! Can I ask what your entire docker run command looks like?

herveyrobot mentioned this issue Aug 9, 2023

Hi @herveyrobot and @nbqu , #51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failed #48

Test failed #48

herveyrobot commented Jul 21, 2023 •

edited

Loading

herveyrobot commented Jul 21, 2023

nbqu commented Jul 21, 2023 •

edited

Loading

herveyrobot commented Jul 27, 2023

Lucaweihs commented Aug 8, 2023

herveyrobot commented Aug 9, 2023

Wallong commented Aug 12, 2023

Lucaweihs commented Aug 15, 2023

Test failed #48

Test failed #48

Comments

herveyrobot commented Jul 21, 2023 • edited Loading

herveyrobot commented Jul 21, 2023

nbqu commented Jul 21, 2023 • edited Loading

herveyrobot commented Jul 27, 2023

Lucaweihs commented Aug 8, 2023

herveyrobot commented Aug 9, 2023

Wallong commented Aug 12, 2023

Lucaweihs commented Aug 15, 2023

herveyrobot commented Jul 21, 2023 •

edited

Loading

nbqu commented Jul 21, 2023 •

edited

Loading