Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard failures with include_dashboard set to false #11940

Closed
2 tasks
roireshef opened this issue Nov 11, 2020 · 63 comments
Closed
2 tasks

Dashboard failures with include_dashboard set to false #11940

roireshef opened this issue Nov 11, 2020 · 63 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard fix-error-msg This issue has a bad error message that should be improved. P1 Issue that should be fixed within a few weeks

Comments

@roireshef
Copy link
Contributor

roireshef commented Nov 11, 2020

Running Tune with A3C fails straight at the beginning with the following traceback:

2020-11-11 14:13:37,114	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
    agent_ip_address=self.ip))
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1605096817.110308830","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605096817.110303917","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>

**This is obviously a dashboard related exception, which is unexpected since include_dashboard is set to False.
It might be related to #11943 but it shouldn't happen if this flag is set to False, so it's a different issue.
**

Ray version and other system information (Python version, TensorFlow version, OS):
Ray installed via https://docs.ray.io/en/master/development.html#building-ray-python-only
on both latest master and releases/1.0.1

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

    ray.init(include_dashboard=False)
    tune.run(
        A3CTrainer,
        config=<any config>,
        stop={
            "timesteps_total": 50e6,
        },
    )
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@roireshef roireshef added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 11, 2020
@roireshef
Copy link
Contributor Author

Although this is documented in https://docs.ray.io/en/master/configure.html?highlight=ports#ports-configurations
when running ray.init() without "ray start" beforehand, the include_dashboard flag doesn't work well

@roireshef roireshef reopened this Nov 11, 2020
@roireshef
Copy link
Contributor Author

This actually happens even if I run "ray start" with "--include-dashboard=false"

@roireshef roireshef changed the title [tune] [rllib] grpc.experimental.aio._call.AioRpcError: failed to connect to all addresses Dashboard failures with include_dashboard set to false Nov 11, 2020
@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 11, 2020
@rkooo567 rkooo567 added the dashboard Issues specific to the Ray Dashboard label Nov 11, 2020
@rkooo567
Copy link
Contributor

This looks like a bad bug. @mfitton can you take a look at it?

@mfitton
Copy link
Contributor

mfitton commented Nov 11, 2020

Yep, I'll take a look. Thanks for reporting @roireshef and apologies for the inconvenience. We're moving to a new dashboard backend that's currently in the nightly, so your bug report is really helpful as far as helping iron out issues before we roll this out more broadly.

@mfitton
Copy link
Contributor

mfitton commented Nov 11, 2020

I've noticed this is because in the new dashboard architecture we start-up the dashboard agent regardless of whether include_dashboard is specified. This could be because the dashboard agent is the entity that receives ray stats via GRPC for export to Prometheus.

@fyrestone I'm planning on creating a PR to make the dashboard agent not start when include_dashboard is false. Am I missing any issues that doing this could cause?

@rkooo567
Copy link
Contributor

The immidiate issue I noticed is people cannot export metrics?

@mfitton
Copy link
Contributor

mfitton commented Nov 11, 2020

Yep, that's true, they wouldn't be able to export metrics without running the dashboard.

@rkooo567
Copy link
Contributor

I think that’s not ideal though. I can imagine users who want to export metrics while they don’t have the dashboard

@mfitton
Copy link
Contributor

mfitton commented Nov 11, 2020

Yeah. It's kind of a general problem that people might want to run certain backend dashboard modules without running the actual web ui. Especially when more APIs are introduced that are accessed via the dashboard.

That said, the main reason to want to not run the dashboard is performance-oriented, as it generally won't cause any other issues to have it run.

We might want to eventually move to an API where a user can specify which dashboard modules they want to run.

@mfitton
Copy link
Contributor

mfitton commented Nov 11, 2020

Not sure what the best thing to do is for now. I can't repro the issue yet because the repro script is incomplete, but I would be surprised if the dashboard warning message is actually linked to the training failing. I'm going to try to search for where the A3CTrainer comes from.

@mfitton
Copy link
Contributor

mfitton commented Nov 11, 2020

I found A3CTrainer in ray.rllib.agents.ac3. That said, the provided script fails with the error

Traceback (most recent call last):
  File "/Users/maxfitton/Development/ray/python/ray/tune/trial_runner.py", line 547, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/maxfitton/Development/ray/python/ray/tune/ray_trial_executor.py", line 484, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/maxfitton/Development/ray/python/ray/worker.py", line 1472, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::A3C.train() (pid=80578, ip=192.168.50.14)
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task
    worker.memory_monitor.raise_if_low_memory()
  File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
    task_exception = True
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
    outputs = function_executor(*args, **kwargs)
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
    task_exception = False
  File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task.function_executor
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer_template.py", line 106, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 445, in __init__
    self._env_id = self._register_if_needed(env or config.get("env"))
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 1179, in _register_if_needed
    "You can specify a custom env as either a class "
ValueError: None is an invalid env specification. You can specify a custom env as either a class (e.g., YourEnvCls) or a registered env id (e.g., "your_env").

@roireshef I'm happy to keep looking into this, but I need you to provide a script that is runnable as-is. I doubt that the dashboard error message you're seeing is impacting training, as the two processes should run separately and not overlap / crash one another.

@rkooo567
Copy link
Contributor

@mfitton This issue should be reproducible without the tune code right?

@rkooo567
Copy link
Contributor

rkooo567 commented Nov 11, 2020

Also about the solution; We can probably collect stats only when the include_dashboard is set to be True. Otherwise, start only agents + we can stop collecting stats from the endpoints.

@fyrestone
Copy link
Contributor

@mfitton The traceback shows that agent can't register to raylet, it seems that the raylet process has been crash. The dashboard process is the head role, if the dashboard is not started, then the agent just collects stats and not report to dashboard directly (but publish to redis).

@roireshef
Copy link
Contributor Author

I found A3CTrainer in ray.rllib.agents.ac3. That said, the provided script fails with the error

Traceback (most recent call last):
  File "/Users/maxfitton/Development/ray/python/ray/tune/trial_runner.py", line 547, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/maxfitton/Development/ray/python/ray/tune/ray_trial_executor.py", line 484, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/maxfitton/Development/ray/python/ray/worker.py", line 1472, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::A3C.train() (pid=80578, ip=192.168.50.14)
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task
    worker.memory_monitor.raise_if_low_memory()
  File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
    task_exception = True
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
    outputs = function_executor(*args, **kwargs)
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
    task_exception = False
  File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task.function_executor
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer_template.py", line 106, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 445, in __init__
    self._env_id = self._register_if_needed(env or config.get("env"))
  File "/Users/maxfitton/Development/ray/python/ray/rllib/agents/trainer.py", line 1179, in _register_if_needed
    "You can specify a custom env as either a class "
ValueError: None is an invalid env specification. You can specify a custom env as either a class (e.g., YourEnvCls) or a registered env id (e.g., "your_env").

@roireshef I'm happy to keep looking into this, but I need you to provide a script that is runnable as-is. I doubt that the dashboard error message you're seeing is impacting training, as the two processes should run separately and not overlap / crash one another.

@mfitton I can't attach the original script I'm using because it relies on a custom environment I developed which I can't expose its code outside of my company, I hope you understand. That said, I don't think this issue is related to the environment or RL algorithm implementation at all. Try to use any environment you have in hand. If you run it inside a docker container like I do, I'm pretty sure it will reproduce. The exception you currently see is because you didn't define any environment for tune to run with.

@roireshef
Copy link
Contributor Author

Guys, if I may, from my perspective having metrics written to the logs files so they can be viewed in Tensorboard is crucial, regardless if the web dashboard is working on not. I would kindly like to ask not to break this functionality, I believe it's being heavily used by others as well...

@roireshef
Copy link
Contributor Author

I've noticed this is because in the new dashboard architecture we start-up the dashboard agent regardless of whether include_dashboard is specified. This could be because the dashboard agent is the entity that receives ray stats via GRPC for export to Prometheus.

@fyrestone I'm planning on creating a PR to make the dashboard agent not start when include_dashboard is false. Am I missing any issues that doing this could cause?

@mfitton - could you please point me to where you are starting it so I could disable it as a local hotfix? If I do that, will it disable metrics as well? or is there any way around it?

@mfitton
Copy link
Contributor

mfitton commented Nov 12, 2020

@roireshef I don't have any RLLib/Tune environments on hand, as I don't do any ML work.

The tune metrics written to log files is not affected by whether the dashboard is running or not. You'll still be able to run tensor board by starting it with the log directory that Tune writes to. We'll make sure not to break this functionality.

That said, what's crashing your program isn't the dashboard. The message you're seeing logged by the agent, like fyrestone said, isn't causing your Ray cluster to crash, but is rather a symptom that the Raylet (which handles scheduling tasks among other things) has crashed.

Could you include your raylet.err and raylet.out log files? Those would be more helpful as far as getting to the bottom of this.

@roireshef
Copy link
Contributor Author

@mfitton
A. Sure, I can send the raylet files, but where can I find them?
B. If you have Ray installed, you already have RLlib-ready environments at hand. See: https://github.com/ray-project/ray/tree/master/rllib/examples - I think you'll find it very useful to use one and close a training loop for debugging...

@roireshef
Copy link
Contributor Author

roireshef commented Nov 12, 2020

@mfitton - I'll try to provide some more information in the meantime:

This is how I setup ray cluster (handshakes between nodes):

Head (inside docker):
ray start --block --head --port=$redis_port --redis-password=$redis_password --node-ip-address=$head_node_ip \ --gcs-server-port=6005 --dashboard-port=6006 --node-manager-port=6007 --object-manager-port=6008 \ --redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 --min-worker-port=6100 --max-worker-port=6299 --include-dashboard=false

Worker Nodes (inside docker, different machine(s)):
ray start --block --address=$head_node_ip:$redis_port --redis-password=$redis_password --node-ip-address=$worker_node_ip --node-manager-port=6007 --object-manager-port=6008 --min-worker-port=6100 --max-worker-port=6299

After I do that, I call:

ray.init(address=$head_node_ip:$redis_port, _redis-password=$redis_password)
tune.run(
        A3CTrainer,
        config=<any config>,
        stop={
            "timesteps_total": 50e6,
        },
    )

The tune.run() part you could find in the examples, including environment implementations. Alternatively, this reproduces also without setting a ray cluster, like I described in the body of this issue (above).

Observing the console produces a flow of exceptions, all look similar (note that this time I captured more informative one than the one attached to the body of this issue, the *** part is for security reasons):

2020-11-12 16:53:56,179	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 123, in run
    modules = self._load_modules()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 82, in _load_modules
    c = cls(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/usr/local/lib/python3.6/dist-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
    options=option, gatherer=option.registry, collector=collector)
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 266, in __init__
    self.serve_http()
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 321, in serve_http
    port=self.options.port, addr=str(self.options.address))
  File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 78, in start_wsgi_server
    httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
  File "/usr/lib/python3.6/wsgiref/simple_server.py", line 153, in make_server
    server = server_class((host, port), handler_class)
  File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib/python3.6/wsgiref/simple_server.py", line 50, in server_bind
    HTTPServer.server_bind(self)
  File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

(pid=raylet, ip=***) Traceback (most recent call last):
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 308, in <module>
(pid=raylet, ip=***)     raise e
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
(pid=raylet, ip=***)     loop.run_until_complete(agent.run())
(pid=raylet, ip=***)   File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
(pid=raylet, ip=***)     return future.result()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 123, in run
(pid=raylet, ip=***)     modules = self._load_modules()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 82, in _load_modules
(pid=raylet, ip=***)     c = cls(self)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
(pid=raylet, ip=***)     self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/metrics_agent.py", line 42, in __init__
(pid=raylet, ip=***)     namespace="ray", port=metrics_export_port)))
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
(pid=raylet, ip=***)     options=option, gatherer=option.registry, collector=collector)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 266, in __init__
(pid=raylet, ip=***)     self.serve_http()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 321, in serve_http
(pid=raylet, ip=***)     port=self.options.port, addr=str(self.options.address))
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 78, in start_wsgi_server
(pid=raylet, ip=***)     httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/wsgiref/simple_server.py", line 153, in make_server
(pid=raylet, ip=***)     server = server_class((host, port), handler_class)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
(pid=raylet, ip=***)     self.server_bind()
(pid=raylet, ip=***)   File "/usr/lib/python3.6/wsgiref/simple_server.py", line 50, in server_bind
(pid=raylet, ip=***)     HTTPServer.server_bind(self)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
(pid=raylet, ip=***)     socketserver.TCPServer.server_bind(self)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
(pid=raylet, ip=***)     self.socket.bind(self.server_address)
(pid=raylet, ip=***) OSError: [Errno 98] Address already in use
2020-11-12 16:53:56,392	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
    agent_ip_address=self.ip))
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1605218036.477366833","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605218036.477361267","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"

@dHannasch
Copy link
Contributor

raylet.out and raylet.err are explained at https://docs.ray.io/en/master/configure.html#logging-and-debugging. (This should be linked from https://docs.ray.io/en/master/debugging.html#backend-logging, but the link is broken at the moment. (#11956))

@roireshef
Copy link
Contributor Author

@fyrestone @dHannasch - Thanks for your thorough response! it definitely gives me a better direction now. I'll investigate my node configuration further.

@roireshef
Copy link
Contributor Author

roireshef commented Nov 29, 2020

@fyrestone @dHannasch - while this works great now when started with ray start, it seems when started with ray init(...) only, the arguments don't pass well to the dashboard, creating a different issue at startup. Did you happen to test that setup as well...?

It's not a blocker for me at the moment, although I think it's a low hanging fruit to make this fix complete. Otherwise, this issue can be closed.

@rkooo567
Copy link
Contributor

Hey @fyrestone What's the progress on this?

@rkooo567
Copy link
Contributor

We should fix this ASAP

@fyrestone
Copy link
Contributor

We should fix this ASAP

There are many environment variables that affect grpc: https://github.com/grpc/grpc/blob/master/doc/environment_variables.md

I am not sure if it is a good idea to modify the default behavior of grpc. One solution is,

  • Disable proxy for dashboard / dashboard agent by default.

@fyrestone
Copy link
Contributor

Related PR: #12598

@rkooo567
Copy link
Contributor

I am confused why this is related to grpc env variables? If so, it shouldn't happen when include_dashboard is True right? Isn't it just we should remove all RPC to the dashboard head when include_dashboard is False? (and the error is from the fact that the head process is not running)?

@fyrestone
Copy link
Contributor

fyrestone commented Jan 18, 2021

I am confused why this is related to grpc env variables? If so, it shouldn't happen when include_dashboard is True right? Isn't it just we should remove all RPC to the dashboard head when include_dashboard is False? (and the error is from the fact that the head process is not running)?

According to the comments above, this issue is related to the proxy. The problem you mentioned is another one, but as far as I know, there is no gRPC request send from dashboard agent to the dashboard.

  • The dashboard is the controller, it calls RPC to GCS / raylet / dashboard agent through gRPC.
  • Dashboard agent reports data to the dashboard through redis publish, dashboard subscribes the data.

After we fix the proxy problem, the dashboard agent register to raylet success, then this issue will be fixed.

It's possible to make the dashboard agent not be activated when include_dashboard set to false.

  1. Translate the include_dashboard options to the _system_config in ray.init().
  2. raylet starts the dashboard agent if include_dashboard of _system_config is True.

The include_dashboard will be a cluster config, a job ray.init() with a include_dashboard takes no effects, because the cluster is already started.

@rkooo567
Copy link
Contributor

I understood your point, but that still didn't explain why it causes issues only when include_dashboard is False right? I have the impression the reported problem around proxy is a symptom of something else rather than the root cause.

@rkooo567
Copy link
Contributor

RE: not starting agents when include_dashboard is False => I am not sure if it is a good idea given dashboard agent is doing something else like collecting metrics.

@fyrestone
Copy link
Contributor

I understood your point, but that still didn't explain why it causes issues only when include_dashboard is False right? I have the impression the reported problem around proxy is a symptom of something else rather than the root cause.

Because of the proxy issue, it will fail even the include_dashboard is True. According to the traceback:

  File "python3.8/site-packages/ray/new_dashboard/agent.py", line 169, in run
    await raylet_stub.RegisterAgent(

@rkooo567
Copy link
Contributor

Sorry if I am missing some context. I couldn't read everything in this thread cuz it was too long.

So @roireshef says it happens only when include_dashboard is False right? @roireshef Can you verify if you can reproduce this when it is True as well?

@fyrestone
Copy link
Contributor

RE: not starting agents when include_dashboard is False => I am not sure if it is a good idea given dashboard agent is doing something else like collecting metrics.

So, here are two fixes:

  1. Fix the gRPC proxy issue by setting no proxy by default. (It changes the default behavior of gRPC, but only for the internal connections of the ray components, I think it's OK)
  2. Not starting agents when include_dashboard is False. (I think it's unnecessary, but I can create a PR if needed.)

@rkooo567
Copy link
Contributor

I also think 2 is unnecessary if the issue is indeed happening regardless of the flag.

@rkooo567
Copy link
Contributor

Btw, thanks for looking into this! Are you reviewing the existing PR now?

@fyrestone
Copy link
Contributor

Btw, thanks for looking into this! Are you reviewing the existing PR now?

I have reviewed the PR. I think a gRPC option can fix the problem, we don't need a new API for creating a gRPC channel. We can discuss it in PR.

@kyillene
Copy link

kyillene commented Jan 19, 2021

I was having a similar issue for couple of hours yesterday and your comments enlightened me a lot. Thanks!
I am using a gpu-server in the university and dashboard was not a priority for me so I tried to set it False but as you already experience it didn't work. Setting http_proxy and https_proxy to correct values didn't do the trick either.
In the end, my tmux pane was spammed with dashboard related warnings eventhough it was set to False.
In order get rid of these nasty warning messages, I made a modification to the file at: envs/myenv/lib/python3.7/site-packages/grpc/aio/_call.py to stop printing out warnings. Precisely, I commented out the part in else at line 285 as below:

        if response is cygrpc.EOF:
            if self._cython_call.is_locally_cancelled():
                raise asyncio.CancelledError()
            #else:
                #raise _create_rpc_error(self._cython_call._initial_metadata, self._cython_call._status)
        else:
            return response

It resolved the cluttered tmux pane problem and I doubt that I will have any consequences due to this dirty work around. Could you confirm if that is actually the case?

@rkooo567
Copy link
Contributor

@roireshef @kyillene The possible fix was merged in the latest master. Can you guys try?

@rkooo567
Copy link
Contributor

Also, @fyrestone Do you mind answering @kyillene ?

@fyrestone
Copy link
Contributor

I was having a similar issue for couple of hours yesterday and your comments enlightened me a lot. Thanks!
I am using a gpu-server in the university and dashboard was not a priority for me so I tried to set it False but as you already experience it didn't work. Setting http_proxy and https_proxy to correct values didn't do the trick either.
In the end, my tmux pane was spammed with dashboard related warnings eventhough it was set to False.
In order get rid of these nasty warning messages, I made a modification to the file at: envs/myenv/lib/python3.7/site-packages/grpc/aio/_call.py to stop printing out warnings. Precisely, I commented out the part in else at line 285 as below:

        if response is cygrpc.EOF:
            if self._cython_call.is_locally_cancelled():
                raise asyncio.CancelledError()
            #else:
                #raise _create_rpc_error(self._cython_call._initial_metadata, self._cython_call._status)
        else:
            return response

It resolved the cluttered tmux pane problem and I doubt that I will have any consequences due to this dirty work around. Could you confirm if that is actually the case?

Thank you for contributing a solution. The proxy issue has been fixed in master, but it has not been released yet. Can you try these lines in dashboard/agent.py to create the aiogrpc_raylet_channel?

class DashboardAgent(object):
    def __init__(self, ...):
        # ...
        # Create a gRPC channel without proxy.
        options = (("grpc.enable_http_proxy", 0), )
        self.aiogrpc_raylet_channel = aiogrpc.insecure_channel(
            f"{self.ip}:{self.node_manager_port}", options=options)

@roireshef
Copy link
Contributor Author

roireshef commented Jan 25, 2021

@roireshef @kyillene The possible fix was merged in the latest master. Can you guys try?

Hi @rkooo567, what is it that you want me to test? Are you refering to this?

@fyrestone @dHannasch - while this works great now when started with ray start, it seems when started with ray init(...) only, the arguments don't pass well to the dashboard, creating a different issue at startup. Did you happen to test that setup as well...?

@rkooo567
Copy link
Contributor

Oh I just want to make sure your issue has been fixed!

@diman82
Copy link

diman82 commented Feb 11, 2021

Doesn't work for mini-cluster, it still trying to load dashboard (I'm on ray version 1.1.0, Windows 10):

My code:


    def test_run_e2e_hyperparam_search_mini_cluster_ray_distributed(self):

        from ray.cluster_utils import Cluster

 

        # Starts a head-node for the cluster.

        cluster = Cluster(

            initialize_head=True,

            head_node_args={

                "num_cpus": 1,

            })

 

        ray.init(address=cluster.address, include_dashboard=False)

And this is the error:


2021-02-11 14:07:29,597 INFO View the Ray dashboard at http://127.0.0.1:8265

2021-02-11 14:07:30,732               INFO worker.py:656 -- Connecting to existing Ray cluster at address: 10.240.194.92:6379

2021-02-11 14:07:31,152               WARNING worker.py:1034 -- The actor or task with ID df5a1a828c9685d3ffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

2021-02-11 14:07:40,106               WARNING worker.py:1034 -- The dashboard on node TLVCMEW001410 failed with the following error:

Traceback (most recent call last):

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 187, in <module>

    dashboard = Dashboard(

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 81, in __init__

    build_dir = setup_static_dir()

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 38, in setup_static_dir

    raise OSError(

FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/new_dashboard/client && npm install && npm ci && npm run build): 'C:\\Users\\dm57337\\.conda\\envs\\py38tf\\lib\\site-packages\\ray\\new_dashboard\\client\\build'

@fyrestone
Copy link
Contributor

Doesn't work for mini-cluster, it still trying to load dashboard (I'm on ray version 1.1.0, Windows 10):

My code:


    def test_run_e2e_hyperparam_search_mini_cluster_ray_distributed(self):

        from ray.cluster_utils import Cluster

 

        # Starts a head-node for the cluster.

        cluster = Cluster(

            initialize_head=True,

            head_node_args={

                "num_cpus": 1,

            })

 

        ray.init(address=cluster.address, include_dashboard=False)

And this is the error:


2021-02-11 14:07:29,597 INFO View the Ray dashboard at http://127.0.0.1:8265

2021-02-11 14:07:30,732               INFO worker.py:656 -- Connecting to existing Ray cluster at address: 10.240.194.92:6379

2021-02-11 14:07:31,152               WARNING worker.py:1034 -- The actor or task with ID df5a1a828c9685d3ffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

2021-02-11 14:07:40,106               WARNING worker.py:1034 -- The dashboard on node TLVCMEW001410 failed with the following error:

Traceback (most recent call last):

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 187, in <module>

    dashboard = Dashboard(

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 81, in __init__

    build_dir = setup_static_dir()

  File "C:\Users\dm57337\.conda\envs\py38tf\lib\site-packages\ray\new_dashboard\dashboard.py", line 38, in setup_static_dir

    raise OSError(

FileNotFoundError: [Errno 2] Dashboard build directory not found. If installing from source, please follow the additional steps required to build the dashboard(cd python/ray/new_dashboard/client && npm install && npm ci && npm run build): 'C:\\Users\\dm57337\\.conda\\envs\\py38tf\\lib\\site-packages\\ray\\new_dashboard\\client\\build'

I have noticed that the dashboard frontend is not built in Windows CI:

install_npm_project() {
  if [ "${OSTYPE}" = msys ]; then
    # Not Windows-compatible: https://github.com/npm/cli/issues/558#issuecomment-584673763
    { echo "WARNING: Skipping NPM due to module incompatibilities with Windows"; } 2> /dev/null
  else
    npm ci -q
  fi
}

build_dashboard_front_end() {
  if [ "${OSTYPE}" = msys ]; then
    { echo "WARNING: Skipping dashboard due to NPM incompatibilities with Windows"; } 2> /dev/null
  else
    (
      cd ray/new_dashboard/client

      if [ -z "${BUILDKITE-}" ]; then
        set +x  # suppress set -x since it'll get very noisy here
        . "${HOME}/.nvm/nvm.sh"
        nvm use --silent node
      fi
      install_npm_project
      npm run -s build
    )
  fi
}

@mxz96102 Could you look into the problem?

@rkooo567
Copy link
Contributor

rkooo567 commented Mar 3, 2021

Dashboard is not currently supported by Windows! I think this issue should've been resolved. Please reopen if you see any other issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard fix-error-msg This issue has a bad error message that should be improved. P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

8 participants