-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dashboard failures with include_dashboard set to false #11940
Comments
Although this is documented in https://docs.ray.io/en/master/configure.html?highlight=ports#ports-configurations |
This actually happens even if I run "ray start" with "--include-dashboard=false" |
This looks like a bad bug. @mfitton can you take a look at it? |
Yep, I'll take a look. Thanks for reporting @roireshef and apologies for the inconvenience. We're moving to a new dashboard backend that's currently in the nightly, so your bug report is really helpful as far as helping iron out issues before we roll this out more broadly. |
I've noticed this is because in the new dashboard architecture we start-up the dashboard agent regardless of whether @fyrestone I'm planning on creating a PR to make the dashboard agent not start when |
The immidiate issue I noticed is people cannot export metrics? |
Yep, that's true, they wouldn't be able to export metrics without running the dashboard. |
I think that’s not ideal though. I can imagine users who want to export metrics while they don’t have the dashboard |
Yeah. It's kind of a general problem that people might want to run certain backend dashboard modules without running the actual web ui. Especially when more APIs are introduced that are accessed via the dashboard. That said, the main reason to want to not run the dashboard is performance-oriented, as it generally won't cause any other issues to have it run. We might want to eventually move to an API where a user can specify which dashboard modules they want to run. |
Not sure what the best thing to do is for now. I can't repro the issue yet because the repro script is incomplete, but I would be surprised if the dashboard warning message is actually linked to the training failing. I'm going to try to search for where the |
I found
@roireshef I'm happy to keep looking into this, but I need you to provide a script that is runnable as-is. I doubt that the dashboard error message you're seeing is impacting training, as the two processes should run separately and not overlap / crash one another. |
@mfitton This issue should be reproducible without the tune code right? |
Also about the solution; We can probably collect stats only when the include_dashboard is set to be True. Otherwise, start only agents + we can stop collecting stats from the endpoints. |
@mfitton The traceback shows that agent can't register to raylet, it seems that the raylet process has been crash. The dashboard process is the head role, if the dashboard is not started, then the agent just collects stats and not report to dashboard directly (but publish to redis). |
@mfitton I can't attach the original script I'm using because it relies on a custom environment I developed which I can't expose its code outside of my company, I hope you understand. That said, I don't think this issue is related to the environment or RL algorithm implementation at all. Try to use any environment you have in hand. If you run it inside a docker container like I do, I'm pretty sure it will reproduce. The exception you currently see is because you didn't define any environment for tune to run with. |
Guys, if I may, from my perspective having metrics written to the logs files so they can be viewed in Tensorboard is crucial, regardless if the web dashboard is working on not. I would kindly like to ask not to break this functionality, I believe it's being heavily used by others as well... |
@mfitton - could you please point me to where you are starting it so I could disable it as a local hotfix? If I do that, will it disable metrics as well? or is there any way around it? |
@roireshef I don't have any RLLib/Tune environments on hand, as I don't do any ML work. The tune metrics written to log files is not affected by whether the dashboard is running or not. You'll still be able to run tensor board by starting it with the log directory that Tune writes to. We'll make sure not to break this functionality. That said, what's crashing your program isn't the dashboard. The message you're seeing logged by the agent, like fyrestone said, isn't causing your Ray cluster to crash, but is rather a symptom that the Raylet (which handles scheduling tasks among other things) has crashed. Could you include your raylet.err and raylet.out log files? Those would be more helpful as far as getting to the bottom of this. |
@mfitton |
@mfitton - I'll try to provide some more information in the meantime: This is how I setup ray cluster (handshakes between nodes): Head (inside docker): Worker Nodes (inside docker, different machine(s)): After I do that, I call:
The tune.run() part you could find in the examples, including environment implementations. Alternatively, this reproduces also without setting a ray cluster, like I described in the body of this issue (above). Observing the console produces a flow of exceptions, all look similar (note that this time I captured more informative one than the one attached to the body of this issue, the *** part is for security reasons):
|
raylet.out and raylet.err are explained at https://docs.ray.io/en/master/configure.html#logging-and-debugging. (This should be linked from https://docs.ray.io/en/master/debugging.html#backend-logging, but the link is broken at the moment. (#11956)) |
@fyrestone @dHannasch - Thanks for your thorough response! it definitely gives me a better direction now. I'll investigate my node configuration further. |
@fyrestone @dHannasch - while this works great now when started with ray start, it seems when started with ray init(...) only, the arguments don't pass well to the dashboard, creating a different issue at startup. Did you happen to test that setup as well...? It's not a blocker for me at the moment, although I think it's a low hanging fruit to make this fix complete. Otherwise, this issue can be closed. |
Hey @fyrestone What's the progress on this? |
We should fix this ASAP |
There are many environment variables that affect grpc: https://github.com/grpc/grpc/blob/master/doc/environment_variables.md I am not sure if it is a good idea to modify the default behavior of grpc. One solution is,
|
Related PR: #12598 |
I am confused why this is related to grpc env variables? If so, it shouldn't happen when |
According to the comments above, this issue is related to the proxy. The problem you mentioned is another one, but as far as I know, there is no gRPC request send from dashboard agent to the dashboard.
After we fix the proxy problem, the dashboard agent register to raylet success, then this issue will be fixed. It's possible to make the dashboard agent not be activated when
The |
I understood your point, but that still didn't explain why it causes issues only when |
RE: not starting agents when include_dashboard is False => I am not sure if it is a good idea given dashboard agent is doing something else like collecting metrics. |
Because of the proxy issue, it will fail even the
|
Sorry if I am missing some context. I couldn't read everything in this thread cuz it was too long. So @roireshef says it happens only when include_dashboard is False right? @roireshef Can you verify if you can reproduce this when it is True as well? |
So, here are two fixes:
|
I also think 2 is unnecessary if the issue is indeed happening regardless of the flag. |
Btw, thanks for looking into this! Are you reviewing the existing PR now? |
I have reviewed the PR. I think a gRPC option can fix the problem, we don't need a new API for creating a gRPC channel. We can discuss it in PR. |
I was having a similar issue for couple of hours yesterday and your comments enlightened me a lot. Thanks!
It resolved the cluttered tmux pane problem and I doubt that I will have any consequences due to this dirty work around. Could you confirm if that is actually the case? |
@roireshef @kyillene The possible fix was merged in the latest master. Can you guys try? |
Also, @fyrestone Do you mind answering @kyillene ? |
Thank you for contributing a solution. The proxy issue has been fixed in master, but it has not been released yet. Can you try these lines in class DashboardAgent(object):
def __init__(self, ...):
# ...
# Create a gRPC channel without proxy.
options = (("grpc.enable_http_proxy", 0), )
self.aiogrpc_raylet_channel = aiogrpc.insecure_channel(
f"{self.ip}:{self.node_manager_port}", options=options) |
Hi @rkooo567, what is it that you want me to test? Are you refering to this?
|
Oh I just want to make sure your issue has been fixed! |
Doesn't work for mini-cluster, it still trying to load dashboard (I'm on ray version 1.1.0, Windows 10): My code:
And this is the error:
|
I have noticed that the dashboard frontend is not built in Windows CI: install_npm_project() {
if [ "${OSTYPE}" = msys ]; then
# Not Windows-compatible: https://github.com/npm/cli/issues/558#issuecomment-584673763
{ echo "WARNING: Skipping NPM due to module incompatibilities with Windows"; } 2> /dev/null
else
npm ci -q
fi
}
build_dashboard_front_end() {
if [ "${OSTYPE}" = msys ]; then
{ echo "WARNING: Skipping dashboard due to NPM incompatibilities with Windows"; } 2> /dev/null
else
(
cd ray/new_dashboard/client
if [ -z "${BUILDKITE-}" ]; then
set +x # suppress set -x since it'll get very noisy here
. "${HOME}/.nvm/nvm.sh"
nvm use --silent node
fi
install_npm_project
npm run -s build
)
fi
} @mxz96102 Could you look into the problem? |
Dashboard is not currently supported by Windows! I think this issue should've been resolved. Please reopen if you see any other issue! |
Running Tune with A3C fails straight at the beginning with the following traceback:
**This is obviously a dashboard related exception, which is unexpected since include_dashboard is set to False.
It might be related to #11943 but it shouldn't happen if this flag is set to False, so it's a different issue.
**
Ray version and other system information (Python version, TensorFlow version, OS):
Ray installed via https://docs.ray.io/en/master/development.html#building-ray-python-only
on both latest master and releases/1.0.1
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
The text was updated successfully, but these errors were encountered: