-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request sent to wrong server #2227
Comments
Does it enter a crash loop where it keeps on failing with this assert and hence isn't able to become the leader? |
Ok, this happens during a query. From the logs, the request was sent to a node serving group 2 and it thought that the predicate Do you know if there could have been a |
We do not drop any items from the graph whilst the pipeline is operational, we just drop everything when redeploying to run a fresh test. In a recent crash around 2018/03/14 20:30:34, we see that the affected server recovers as leader of its group but all further predicate moves are met by "DeadlineExceeded" errors: (server-0: logs_server-0.txt)
(zero: logs_zero.txt)
Other restarts before this seem to recover fine. Logs from other servers: We were wondering also whether the following is an issue?
|
The assert here happens after a predicate move in all cases.
What could happen is that the predicate move is in progress for |
This is not the reason for the crash. This should typically be a temporary issue. I will modify the logs to print this as a warning to make it clear.
Interestingly, not all further predicate moves were met by the error. From
After that predicates moves were successful until. Seems like somehow the connection got broken at this point and wasn't reconnected. I am checking this.
|
I have pushed a docker image at It would be great if you could use it to load some data and share the logs. Otherwise, if you can share some steps to reproduce like some sample data/scripts, then I am happy to bring up a Kubernetes cluster and debug this at my end. Keen to resolve these issues as soon as possible. |
I see, that would be good if we could just retry the query and avoid having the server crash.
Ah yes, I was not initially clear, there were multiple server crashes beforehand but that particular crash at 20:30:34 seemed to be what broke things.
We have pulled this image for testing, here are some logs surrounding a crash:
What new log types should there be? We could not spot anything new but it could be subtle.
We are interested in creating a new sample rig, the current issue is that our pipeline now spans 3 pod-types separated using message queues. We will try to come up with a single pod approximation when possible so no additional requirements are needed. |
Zero: dgraph-zero-0_logs.txt |
Could you share full logs from all the servers please like before? Specifically, I am looking for a log saying this in the server logs.
Since your crash is for that predicate, I want to see which server got to serve it |
I have also updated the image(dgraph/dgraph:dev) to print more logs for Dgraph Zero (i.e. log all tablet proposals that it gets). If you can force pull again before doing another run, it would be helpful. I think I have a good idea of what is happening, logs will help confirm. |
Hi, We ran the updated image today, we see server crashes but no longer see the cause of the crash in the logs? Here is everything we have: The only thing I find odd is the memory usage of dgraph-server-2 spikes (29GB out of 32GB for that node) at around 17:56 and crashes soon after. |
Since there isn't anything in the logs, I suspect these crashes are because of OOM. Though the exact reason can only be found out by inspecting the pods. Since the crashes that you were seeing earlier were because of a race condition, you don't see them because maybe that condition isn't getting trigged yet. If you run it again, I suspect it should trigger. Also, is it possible to take a heap profile when the memory usage is high and share it? Thanks for the logs, I will see what I can find and update here. |
A fix for this (Request sent to the wrong server) has been merged to master and is also part of the updated |
As suggested, creating a new issue to track overflow from: #2172
Summary:
The text was updated successfully, but these errors were encountered: