Dashboard failures with include_dashboard set to false #11940

roireshef · 2020-11-11T12:17:10Z

Running Tune with A3C fails straight at the beginning with the following traceback:

2020-11-11 14:13:37,114	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
    agent_ip_address=self.ip))
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1605096817.110308830","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605096817.110303917","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>

**This is obviously a dashboard related exception, which is unexpected since include_dashboard is set to False.
It might be related to #11943 but it shouldn't happen if this flag is set to False, so it's a different issue.
**

Ray version and other system information (Python version, TensorFlow version, OS):
Ray installed via https://docs.ray.io/en/master/development.html#building-ray-python-only
on both latest master and releases/1.0.1

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

    ray.init(include_dashboard=False)
    tune.run(
        A3CTrainer,
        config=<any config>,
        stop={
            "timesteps_total": 50e6,
        },
    )

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

The text was updated successfully, but these errors were encountered:

roireshef · 2020-11-11T12:21:09Z

Although this is documented in https://docs.ray.io/en/master/configure.html?highlight=ports#ports-configurations
when running ray.init() without "ray start" beforehand, the include_dashboard flag doesn't work well

roireshef · 2020-11-11T12:44:09Z

This actually happens even if I run "ray start" with "--include-dashboard=false"

rkooo567 · 2020-11-11T17:28:26Z

This looks like a bad bug. @mfitton can you take a look at it?

mfitton · 2020-11-11T20:16:19Z

Yep, I'll take a look. Thanks for reporting @roireshef and apologies for the inconvenience. We're moving to a new dashboard backend that's currently in the nightly, so your bug report is really helpful as far as helping iron out issues before we roll this out more broadly.

mfitton · 2020-11-11T22:30:47Z

I've noticed this is because in the new dashboard architecture we start-up the dashboard agent regardless of whether include_dashboard is specified. This could be because the dashboard agent is the entity that receives ray stats via GRPC for export to Prometheus.

@fyrestone I'm planning on creating a PR to make the dashboard agent not start when include_dashboard is false. Am I missing any issues that doing this could cause?

rkooo567 · 2020-11-11T22:31:54Z

The immidiate issue I noticed is people cannot export metrics?

mfitton · 2020-11-11T22:34:40Z

Yep, that's true, they wouldn't be able to export metrics without running the dashboard.

rkooo567 · 2020-11-11T22:36:14Z

I think that’s not ideal though. I can imagine users who want to export metrics while they don’t have the dashboard

mfitton · 2020-11-11T22:39:26Z

Yeah. It's kind of a general problem that people might want to run certain backend dashboard modules without running the actual web ui. Especially when more APIs are introduced that are accessed via the dashboard.

That said, the main reason to want to not run the dashboard is performance-oriented, as it generally won't cause any other issues to have it run.

We might want to eventually move to an API where a user can specify which dashboard modules they want to run.

mfitton · 2020-11-11T22:40:38Z

Not sure what the best thing to do is for now. I can't repro the issue yet because the repro script is incomplete, but I would be surprised if the dashboard warning message is actually linked to the training failing. I'm going to try to search for where the A3CTrainer comes from.

mfitton · 2020-11-11T22:46:50Z

I found A3CTrainer in ray.rllib.agents.ac3. That said, the provided script fails with the error

Traceback (most recent call last):
  File "/Users/maxfitton/Development/ray/python/ray/tune/trial_runner.py", line 547, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/maxfitton/Development/ray/python/ray/tune/ray_trial_executor.py", line 484, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/maxfitton/Development/ray/python/ray/worker.py", line 1472, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::A3C.train() (pid=80578, ip=192.168.50.14)
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task
    worker.memory_monitor.raise_if_low_memory()
  File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
    task_exception = True
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
    outputs = function_executor(*args, **kwargs)
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
    task_exception = False
  File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task.function_executor
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer_template.py", line 106, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 445, in __init__
    self._env_id = self._register_if_needed(env or config.get("env"))
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 1179, in _register_if_needed
    "You can specify a custom env as either a class "
ValueError: None is an invalid env specification. You can specify a custom env as either a class (e.g., YourEnvCls) or a registered env id (e.g., "your_env").

@roireshef I'm happy to keep looking into this, but I need you to provide a script that is runnable as-is. I doubt that the dashboard error message you're seeing is impacting training, as the two processes should run separately and not overlap / crash one another.

rkooo567 · 2020-11-11T23:01:24Z

@mfitton This issue should be reproducible without the tune code right?

rkooo567 · 2020-11-11T23:01:52Z

Also about the solution; We can probably collect stats only when the include_dashboard is set to be True. Otherwise, start only agents + we can stop collecting stats from the endpoints.

fyrestone · 2020-11-12T02:27:23Z

@mfitton The traceback shows that agent can't register to raylet, it seems that the raylet process has been crash. The dashboard process is the head role, if the dashboard is not started, then the agent just collects stats and not report to dashboard directly (but publish to redis).

roireshef · 2020-11-12T13:19:18Z

I found A3CTrainer in ray.rllib.agents.ac3. That said, the provided script fails with the error

Traceback (most recent call last):
  File "/Users/maxfitton/Development/ray/python/ray/tune/trial_runner.py", line 547, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/maxfitton/Development/ray/python/ray/tune/ray_trial_executor.py", line 484, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/maxfitton/Development/ray/python/ray/worker.py", line 1472, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::A3C.train() (pid=80578, ip=192.168.50.14)
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task
    worker.memory_monitor.raise_if_low_memory()
  File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
    task_exception = True
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
    outputs = function_executor(*args, **kwargs)
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
    task_exception = False
  File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task.function_executor
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer_template.py", line 106, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 445, in __init__
    self._env_id = self._register_if_needed(env or config.get("env"))
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 1179, in _register_if_needed
    "You can specify a custom env as either a class "
ValueError: None is an invalid env specification. You can specify a custom env as either a class (e.g., YourEnvCls) or a registered env id (e.g., "your_env").

@roireshef I'm happy to keep looking into this, but I need you to provide a script that is runnable as-is. I doubt that the dashboard error message you're seeing is impacting training, as the two processes should run separately and not overlap / crash one another.

@mfitton I can't attach the original script I'm using because it relies on a custom environment I developed which I can't expose its code outside of my company, I hope you understand. That said, I don't think this issue is related to the environment or RL algorithm implementation at all. Try to use any environment you have in hand. If you run it inside a docker container like I do, I'm pretty sure it will reproduce. The exception you currently see is because you didn't define any environment for tune to run with.

roireshef · 2020-11-12T13:22:54Z

Guys, if I may, from my perspective having metrics written to the logs files so they can be viewed in Tensorboard is crucial, regardless if the web dashboard is working on not. I would kindly like to ask not to break this functionality, I believe it's being heavily used by others as well...

roireshef · 2020-11-12T13:25:06Z

I've noticed this is because in the new dashboard architecture we start-up the dashboard agent regardless of whether include_dashboard is specified. This could be because the dashboard agent is the entity that receives ray stats via GRPC for export to Prometheus.

@fyrestone I'm planning on creating a PR to make the dashboard agent not start when include_dashboard is false. Am I missing any issues that doing this could cause?

@mfitton - could you please point me to where you are starting it so I could disable it as a local hotfix? If I do that, will it disable metrics as well? or is there any way around it?

mfitton · 2020-11-12T17:46:13Z

@roireshef I don't have any RLLib/Tune environments on hand, as I don't do any ML work.

The tune metrics written to log files is not affected by whether the dashboard is running or not. You'll still be able to run tensor board by starting it with the log directory that Tune writes to. We'll make sure not to break this functionality.

That said, what's crashing your program isn't the dashboard. The message you're seeing logged by the agent, like fyrestone said, isn't causing your Ray cluster to crash, but is rather a symptom that the Raylet (which handles scheduling tasks among other things) has crashed.

Could you include your raylet.err and raylet.out log files? Those would be more helpful as far as getting to the bottom of this.

roireshef · 2020-11-12T21:30:51Z

@mfitton
A. Sure, I can send the raylet files, but where can I find them?
B. If you have Ray installed, you already have RLlib-ready environments at hand. See: https://github.com/ray-project/ray/tree/master/rllib/examples - I think you'll find it very useful to use one and close a training loop for debugging...

roireshef · 2020-11-12T21:55:40Z

@mfitton - I'll try to provide some more information in the meantime:

This is how I setup ray cluster (handshakes between nodes):

Head (inside docker):
ray start --block --head --port=$redis_port --redis-password=$redis_password --node-ip-address=$head_node_ip \ --gcs-server-port=6005 --dashboard-port=6006 --node-manager-port=6007 --object-manager-port=6008 \ --redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 --min-worker-port=6100 --max-worker-port=6299 --include-dashboard=false

Worker Nodes (inside docker, different machine(s)):
ray start --block --address=$head_node_ip:$redis_port --redis-password=$redis_password --node-ip-address=$worker_node_ip --node-manager-port=6007 --object-manager-port=6008 --min-worker-port=6100 --max-worker-port=6299

After I do that, I call:

ray.init(address=$head_node_ip:$redis_port, _redis-password=$redis_password)
tune.run(
        A3CTrainer,
        config=<any config>,
        stop={
            "timesteps_total": 50e6,
        },
    )

The tune.run() part you could find in the examples, including environment implementations. Alternatively, this reproduces also without setting a ray cluster, like I described in the body of this issue (above).

Observing the console produces a flow of exceptions, all look similar (note that this time I captured more informative one than the one attached to the body of this issue, the *** part is for security reasons):

2020-11-12 16:53:56,179	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 123, in run
    modules = self._load_modules()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 82, in _load_modules
    c = cls(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/usr/local/lib/python3.6/dist-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
    options=option, gatherer=option.registry, collector=collector)
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 266, in __init__
    self.serve_http()
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 321, in serve_http
    port=self.options.port, addr=str(self.options.address))
  File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 78, in start_wsgi_server
    httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
  File "/usr/lib/python3.6/wsgiref/simple_server.py", line 153, in make_server
    server = server_class((host, port), handler_class)
  File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib/python3.6/wsgiref/simple_server.py", line 50, in server_bind
    HTTPServer.server_bind(self)
  File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

(pid=raylet, ip=***) Traceback (most recent call last):
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 308, in <module>
(pid=raylet, ip=***)     raise e
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
(pid=raylet, ip=***)     loop.run_until_complete(agent.run())
(pid=raylet, ip=***)   File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
(pid=raylet, ip=***)     return future.result()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 123, in run
(pid=raylet, ip=***)     modules = self._load_modules()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 82, in _load_modules
(pid=raylet, ip=***)     c = cls(self)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
(pid=raylet, ip=***)     self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/metrics_agent.py", line 42, in __init__
(pid=raylet, ip=***)     namespace="ray", port=metrics_export_port)))
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
(pid=raylet, ip=***)     options=option, gatherer=option.registry, collector=collector)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 266, in __init__
(pid=raylet, ip=***)     self.serve_http()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 321, in serve_http
(pid=raylet, ip=***)     port=self.options.port, addr=str(self.options.address))
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 78, in start_wsgi_server
(pid=raylet, ip=***)     httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/wsgiref/simple_server.py", line 153, in make_server
(pid=raylet, ip=***)     server = server_class((host, port), handler_class)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
(pid=raylet, ip=***)     self.server_bind()
(pid=raylet, ip=***)   File "/usr/lib/python3.6/wsgiref/simple_server.py", line 50, in server_bind
(pid=raylet, ip=***)     HTTPServer.server_bind(self)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
(pid=raylet, ip=***)     socketserver.TCPServer.server_bind(self)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
(pid=raylet, ip=***)     self.socket.bind(self.server_address)
(pid=raylet, ip=***) OSError: [Errno 98] Address already in use
2020-11-12 16:53:56,392	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
    agent_ip_address=self.ip))
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1605218036.477366833","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605218036.477361267","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"

dHannasch · 2020-11-12T23:26:17Z

raylet.out and raylet.err are explained at https://docs.ray.io/en/master/configure.html#logging-and-debugging. (This should be linked from https://docs.ray.io/en/master/debugging.html#backend-logging, but the link is broken at the moment. (#11956))

roireshef · 2020-11-23T11:51:58Z

@fyrestone @dHannasch - Thanks for your thorough response! it definitely gives me a better direction now. I'll investigate my node configuration further.

roireshef · 2020-11-29T12:32:01Z

@fyrestone @dHannasch - while this works great now when started with ray start, it seems when started with ray init(...) only, the arguments don't pass well to the dashboard, creating a different issue at startup. Did you happen to test that setup as well...?

It's not a blocker for me at the moment, although I think it's a low hanging fruit to make this fix complete. Otherwise, this issue can be closed.

rkooo567 · 2021-01-16T02:55:24Z

Hey @fyrestone What's the progress on this?

rkooo567 · 2021-01-16T02:55:29Z

We should fix this ASAP

fyrestone · 2021-01-18T02:23:54Z

We should fix this ASAP

There are many environment variables that affect grpc: https://github.com/grpc/grpc/blob/master/doc/environment_variables.md

I am not sure if it is a good idea to modify the default behavior of grpc. One solution is,

Disable proxy for dashboard / dashboard agent by default.

fyrestone · 2021-01-18T04:03:20Z

Related PR: #12598

rkooo567 · 2021-01-18T04:11:56Z

I am confused why this is related to grpc env variables? If so, it shouldn't happen when include_dashboard is True right? Isn't it just we should remove all RPC to the dashboard head when include_dashboard is False? (and the error is from the fact that the head process is not running)?

fyrestone · 2021-01-18T05:56:38Z

I am confused why this is related to grpc env variables? If so, it shouldn't happen when include_dashboard is True right? Isn't it just we should remove all RPC to the dashboard head when include_dashboard is False? (and the error is from the fact that the head process is not running)?

According to the comments above, this issue is related to the proxy. The problem you mentioned is another one, but as far as I know, there is no gRPC request send from dashboard agent to the dashboard.

The dashboard is the controller, it calls RPC to GCS / raylet / dashboard agent through gRPC.
Dashboard agent reports data to the dashboard through redis publish, dashboard subscribes the data.

After we fix the proxy problem, the dashboard agent register to raylet success, then this issue will be fixed.

It's possible to make the dashboard agent not be activated when include_dashboard set to false.

Translate the include_dashboard options to the _system_config in ray.init().
raylet starts the dashboard agent if include_dashboard of _system_config is True.

The include_dashboard will be a cluster config, a job ray.init() with a include_dashboard takes no effects, because the cluster is already started.

rkooo567 · 2021-01-18T07:55:03Z

I understood your point, but that still didn't explain why it causes issues only when include_dashboard is False right? I have the impression the reported problem around proxy is a symptom of something else rather than the root cause.

rkooo567 · 2021-01-18T07:56:05Z

RE: not starting agents when include_dashboard is False => I am not sure if it is a good idea given dashboard agent is doing something else like collecting metrics.

fyrestone · 2021-01-18T08:14:46Z

I understood your point, but that still didn't explain why it causes issues only when include_dashboard is False right? I have the impression the reported problem around proxy is a symptom of something else rather than the root cause.

Because of the proxy issue, it will fail even the include_dashboard is True. According to the traceback:

  File "python3.8/site-packages/ray/new_dashboard/agent.py", line 169, in run
    await raylet_stub.RegisterAgent(

rkooo567 · 2021-01-18T08:28:37Z

Sorry if I am missing some context. I couldn't read everything in this thread cuz it was too long.

So @roireshef says it happens only when include_dashboard is False right? @roireshef Can you verify if you can reproduce this when it is True as well?

fyrestone · 2021-01-18T08:28:46Z

RE: not starting agents when include_dashboard is False => I am not sure if it is a good idea given dashboard agent is doing something else like collecting metrics.

So, here are two fixes:

Fix the gRPC proxy issue by setting no proxy by default. (It changes the default behavior of gRPC, but only for the internal connections of the ray components, I think it's OK)
Not starting agents when include_dashboard is False. (I think it's unnecessary, but I can create a PR if needed.)

rkooo567 · 2021-01-18T08:35:32Z

I also think 2 is unnecessary if the issue is indeed happening regardless of the flag.

rkooo567 · 2021-01-18T08:35:55Z

Btw, thanks for looking into this! Are you reviewing the existing PR now?

fyrestone · 2021-01-18T09:05:57Z

Btw, thanks for looking into this! Are you reviewing the existing PR now?

I have reviewed the PR. I think a gRPC option can fix the problem, we don't need a new API for creating a gRPC channel. We can discuss it in PR.

kyillene · 2021-01-19T08:54:23Z

I was having a similar issue for couple of hours yesterday and your comments enlightened me a lot. Thanks!
I am using a gpu-server in the university and dashboard was not a priority for me so I tried to set it False but as you already experience it didn't work. Setting http_proxy and https_proxy to correct values didn't do the trick either.
In the end, my tmux pane was spammed with dashboard related warnings eventhough it was set to False.
In order get rid of these nasty warning messages, I made a modification to the file at: envs/myenv/lib/python3.7/site-packages/grpc/aio/_call.py to stop printing out warnings. Precisely, I commented out the part in else at line 285 as below:

        if response is cygrpc.EOF:
            if self._cython_call.is_locally_cancelled():
                raise asyncio.CancelledError()
            #else:
                #raise _create_rpc_error(self._cython_call._initial_metadata, self._cython_call._status)
        else:
            return response

It resolved the cluttered tmux pane problem and I doubt that I will have any consequences due to this dirty work around. Could you confirm if that is actually the case?

rkooo567 · 2021-01-22T04:10:41Z

@roireshef @kyillene The possible fix was merged in the latest master. Can you guys try?

rkooo567 · 2021-01-22T04:10:55Z

Also, @fyrestone Do you mind answering @kyillene ?

fyrestone · 2021-01-25T11:11:02Z

I was having a similar issue for couple of hours yesterday and your comments enlightened me a lot. Thanks!
I am using a gpu-server in the university and dashboard was not a priority for me so I tried to set it False but as you already experience it didn't work. Setting http_proxy and https_proxy to correct values didn't do the trick either.
In the end, my tmux pane was spammed with dashboard related warnings eventhough it was set to False.
In order get rid of these nasty warning messages, I made a modification to the file at: envs/myenv/lib/python3.7/site-packages/grpc/aio/_call.py to stop printing out warnings. Precisely, I commented out the part in else at line 285 as below:
        if response is cygrpc.EOF:
            if self._cython_call.is_locally_cancelled():
                raise asyncio.CancelledError()
            #else:
                #raise _create_rpc_error(self._cython_call._initial_metadata, self._cython_call._status)
        else:
            return response
It resolved the cluttered tmux pane problem and I doubt that I will have any consequences due to this dirty work around. Could you confirm if that is actually the case?

Thank you for contributing a solution. The proxy issue has been fixed in master, but it has not been released yet. Can you try these lines in dashboard/agent.py to create the aiogrpc_raylet_channel?

class DashboardAgent(object):
    def __init__(self, ...):
        # ...
        # Create a gRPC channel without proxy.
        options = (("grpc.enable_http_proxy", 0), )
        self.aiogrpc_raylet_channel = aiogrpc.insecure_channel(
            f"{self.ip}:{self.node_manager_port}", options=options)

roireshef · 2021-01-25T21:51:56Z

@roireshef @kyillene The possible fix was merged in the latest master. Can you guys try?

Hi @rkooo567, what is it that you want me to test? Are you refering to this?

@fyrestone @dHannasch - while this works great now when started with ray start, it seems when started with ray init(...) only, the arguments don't pass well to the dashboard, creating a different issue at startup. Did you happen to test that setup as well...?

rkooo567 · 2021-01-25T22:48:29Z

Oh I just want to make sure your issue has been fixed!

diman82 · 2021-02-11T12:15:03Z

Doesn't work for mini-cluster, it still trying to load dashboard (I'm on ray version 1.1.0, Windows 10):

My code:


    def test_run_e2e_hyperparam_search_mini_cluster_ray_distributed(self):

        from ray.cluster_utils import Cluster

 

        # Starts a head-node for the cluster.

        cluster = Cluster(

            initialize_head=True,

            head_node_args={

                "num_cpus": 1,

            })

 

        ray.init(address=cluster.address, include_dashboard=False)

And this is the error:


2021-02-11 14:07:29,597 INFO View the Ray dashboard at http://127.0.0.1:8265

2021-02-11 14:07:30,732               INFO worker.py:656 -- Connecting to existing Ray cluster at address: 10.240.194.92:6379

2021-02-11 14:07:31,152               WARNING worker.py:1034 -- The actor or task with ID df5a1a828c9685d3ffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

2021-02-11 14:07:40,106               WARNING worker.py:1034 -- The dashboard on node TLVCMEW001410 failed with the following error:

Traceback (most recent call last):

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 187, in <module>

    dashboard = Dashboard(

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 81, in __init__

    build_dir = setup_static_dir()

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 38, in setup_static_dir

    raise OSError(

FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/new_dashboard/client && npm install && npm ci && npm run build): 'C:\\Users\\dm57337\\.conda\\envs\\py38tf\\lib\\site-packages\\ray\\new_dashboard\\client\\build'

fyrestone · 2021-02-12T04:36:39Z

Doesn't work for mini-cluster, it still trying to load dashboard (I'm on ray version 1.1.0, Windows 10):

My code:


    def test_run_e2e_hyperparam_search_mini_cluster_ray_distributed(self):

        from ray.cluster_utils import Cluster

 

        # Starts a head-node for the cluster.

        cluster = Cluster(

            initialize_head=True,

            head_node_args={

                "num_cpus": 1,

            })

 

        ray.init(address=cluster.address, include_dashboard=False)

And this is the error:


2021-02-11 14:07:29,597 INFO View the Ray dashboard at http://127.0.0.1:8265

2021-02-11 14:07:30,732               INFO worker.py:656 -- Connecting to existing Ray cluster at address: 10.240.194.92:6379

2021-02-11 14:07:31,152               WARNING worker.py:1034 -- The actor or task with ID df5a1a828c9685d3ffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

2021-02-11 14:07:40,106               WARNING worker.py:1034 -- The dashboard on node TLVCMEW001410 failed with the following error:

Traceback (most recent call last):

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 187, in <module>

    dashboard = Dashboard(

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 81, in __init__

    build_dir = setup_static_dir()

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 38, in setup_static_dir

    raise OSError(

FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/new_dashboard/client && npm install && npm ci && npm run build): 'C:\\Users\\dm57337\\.conda\\envs\\py38tf\\lib\\site-packages\\ray\\new_dashboard\\client\\build'

I have noticed that the dashboard frontend is not built in Windows CI:

install_npm_project() {
  if [ "${OSTYPE}" = msys ]; then
    # Not Windows-compatible: https://github.com/npm/cli/issues/558#issuecomment-584673763
    { echo "WARNING: Skipping NPM due to module incompatibilities with Windows"; } 2> /dev/null
  else
    npm ci -q
  fi
}

build_dashboard_front_end() {
  if [ "${OSTYPE}" = msys ]; then
    { echo "WARNING: Skipping dashboard due to NPM incompatibilities with Windows"; } 2> /dev/null
  else
    (
      cd ray/new_dashboard/client

      if [ -z "${BUILDKITE-}" ]; then
        set +x  # suppress set -x since it'll get very noisy here
        . "${HOME}/.nvm/nvm.sh"
        nvm use --silent node
      fi
      install_npm_project
      npm run -s build
    )
  fi
}

@mxz96102 Could you look into the problem?

rkooo567 · 2021-03-03T17:41:11Z

Dashboard is not currently supported by Windows! I think this issue should've been resolved. Please reopen if you see any other issue!

roireshef added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 11, 2020

roireshef closed this as completed Nov 11, 2020

roireshef reopened this Nov 11, 2020

roireshef changed the title ~~[tune] [rllib] grpc.experimental.aio._call.AioRpcError: failed to connect to all addresses~~ Dashboard failures with include_dashboard set to false Nov 11, 2020

rkooo567 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 11, 2020

rkooo567 assigned mfitton Nov 11, 2020

rkooo567 added the dashboard Issues specific to the Ray Dashboard label Nov 11, 2020

Legend94rz mentioned this issue Dec 29, 2020

[dashboard] failed to connect to all addresses in docker #13090

Closed

2 tasks

rkooo567 self-assigned this Jan 7, 2021

gshimansky mentioned this issue Jan 14, 2021

Docker container with modin on Ray fails to run with proxy environment variables modin-project/modin#2608

Closed

rkooo567 closed this as completed Mar 3, 2021

roireshef mentioned this issue May 18, 2021

[dashboard] Dashboard crashes ray in local cluster setup #15882

Closed

Dashboard failures with include_dashboard set to false #11940

Dashboard failures with include_dashboard set to false #11940

Comments

roireshef commented Nov 11, 2020 • edited Loading

Reproduction (REQUIRED)

roireshef commented Nov 11, 2020

roireshef commented Nov 11, 2020

rkooo567 commented Nov 11, 2020

mfitton commented Nov 11, 2020

mfitton commented Nov 11, 2020

rkooo567 commented Nov 11, 2020

mfitton commented Nov 11, 2020

rkooo567 commented Nov 11, 2020

mfitton commented Nov 11, 2020

mfitton commented Nov 11, 2020 • edited Loading

mfitton commented Nov 11, 2020

rkooo567 commented Nov 11, 2020

rkooo567 commented Nov 11, 2020 • edited Loading

fyrestone commented Nov 12, 2020

roireshef commented Nov 12, 2020

roireshef commented Nov 12, 2020

roireshef commented Nov 12, 2020

mfitton commented Nov 12, 2020

roireshef commented Nov 12, 2020

roireshef commented Nov 12, 2020 • edited Loading

dHannasch commented Nov 12, 2020

roireshef commented Nov 23, 2020

roireshef commented Nov 29, 2020 • edited Loading

rkooo567 commented Jan 16, 2021

rkooo567 commented Jan 16, 2021

fyrestone commented Jan 18, 2021

fyrestone commented Jan 18, 2021

rkooo567 commented Jan 18, 2021

fyrestone commented Jan 18, 2021 • edited Loading

rkooo567 commented Jan 18, 2021

rkooo567 commented Jan 18, 2021

fyrestone commented Jan 18, 2021

rkooo567 commented Jan 18, 2021

fyrestone commented Jan 18, 2021

rkooo567 commented Jan 18, 2021

rkooo567 commented Jan 18, 2021

fyrestone commented Jan 18, 2021

kyillene commented Jan 19, 2021 • edited Loading

rkooo567 commented Jan 22, 2021

rkooo567 commented Jan 22, 2021

fyrestone commented Jan 25, 2021

roireshef commented Jan 25, 2021 • edited Loading

rkooo567 commented Jan 25, 2021

diman82 commented Feb 11, 2021

fyrestone commented Feb 12, 2021

rkooo567 commented Mar 3, 2021

roireshef commented Nov 11, 2020 •

edited

Loading

mfitton commented Nov 11, 2020 •

edited

Loading

rkooo567 commented Nov 11, 2020 •

edited

Loading

roireshef commented Nov 12, 2020 •

edited

Loading

roireshef commented Nov 29, 2020 •

edited

Loading

fyrestone commented Jan 18, 2021 •

edited

Loading

kyillene commented Jan 19, 2021 •

edited

Loading

roireshef commented Jan 25, 2021 •

edited

Loading