-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbalanced server load #2172
Comments
Most probably your queries might be using the predicates present on group 1. Zero balances based on size. |
The zero logs are attached in the first post.
On Wed, 28 Feb 2018 at 13:36, Janardhan Reddy ***@***.***> wrote:
Most probably your queries might be using the predicates present on group
1. Zero balances based on size.
Can you please share the logs of zero.(may be predicte move failed)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2172 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACKSZnciuwJ8wEtzS8K-y2FkhABFA41Bks5tZVZSgaJpZM4SWa2q>
.
--
Thanks,
Vlad
|
I have been observing the logs here. One strange thing that I see is
So Another thing is that the server which gets a schema update, mutation or a query for predicate ends up serving it. I see that server 1 received a bunch of schema updates initially hence ended up serving all those predicates. So you might want to randomize that a bit to distribute the load equally initially. |
Hi @pawanrawal, thanks for coming back. I am using the following deployment in Kubernetes:
Sometimes the predicate move is unsuccessful (not getting the error you highlighted above very often, but I do get this:
Having said that, there are situations when the predicates move fine, but the server is still stuck (see #2054 ) |
I am going to spin up a cluster using Kubernetes on AWS and give this a try today.
This error is possible if you had mutations for the predicate going on at that point. Is that always the case? I have not seen a successful predicate move in any of your logs. Do you have consistent write workload or it could be that the client you are using is not aborting transactions on an error? |
This and the other issue is definitely related to the connection being broken between the nodes. I am investigating why and how that happens. I can see the following logs after
|
I tried it on a cluster brought up using kops on AWS. Same machine specs as you have. I did see the intermittent issue with predicate move not completing and context deadline exceeded which is most probably a connection issue but mostly things went smoothly and predicate moves completed. Is your kubernetes setup in-house? Are you sure that your networking is properly setup? I am just trying to find out a setup in which the problem occurs more frequently. I am also updating the nightly binaries to update the size of the tablet being moved among other things. I would recommend trying with |
Hi @pawanrawal:
|
|
Hi again, Regarding the predicate moves, it seems that our clients are indeed waiting for the zero to move them. However, the zero never seems to succeed with its moves (we get more conflicts than deadline exceeded errors in recent tests). We tried using GODEBUG, but we got ~500MB files for just 10 minutes of log data and couldn't see anything telling around the time of a move: (zero)
(clients)
We then implemented a way to randomise how the predicates are distributed. However, it seems like our group sizes are still very unbalanced: (zero)
We have found that our dgraph servers sometimes crash. We initially thought this might be due to resource issues but have since upgraded both our cpu and memory specifications for our nodes to no avail. When this does happen, we tend to see only one of the three go down. Interestingly, in a recent test, the pipeline recovered from an initial crash but not a later one. (zero)
After this last restart, the pipeline gets stuck (the other two servers will still accept alterations and queries but none will accept mutations [results in timeouts]). |
^ This server getting stuck is incredibly frustrating as we can't get the whole system to run for more than ~1hr. |
Do you have logs before the crash or could you check the
Do you have logs after the restart when this server couldn't become the leader? Since this is the only node serving the group it should be able to become the leader.
Since you have three servers all serving different groups, if one of them goes down then all mutations which touch the predicates on that server will get stuck. What we have to see is that why could the server not become a leader after a restart. You could also mitigate this problem by having replicas. |
So right now during predicate move, if there are pending transactions, those are aborted and we do not go ahead with the predicate move. We should go ahead with the predicate move after cancelling the aborting the pending transactions. I will make that change, it should help rebalance the load for your cluster. |
Thanks for getting back to us on this. We have managed to save down the logs for a server pre-crash: Server event history:
Logs from other servers and zero: logs-server-0.txt We will next look at adding the replicas as suggested. We would be very interested in this feature that allows the predicate moves to be forced through our continuous ingestion stream. |
Sure, I am on it and will have something for you soon. Could you share details about how you spin up your kubernetes cluster so that I could replicate this issue? |
Deployment file below (slightly different than the one above, as we now have a way to connect to specific servers - dgraph-server-[0-2]-specific - so we can do the load balancing manually as per Lloyd's explanation above). This is spun up in our own private datacentre, on CentOS 7 machines with 8 cores and 16 GB RAM for each server (of which there are three, as per the deployment). Kubernetes' networking layer is on Calico.
Note that before the crash that Lloyd mentioned above, memory consumption and CPU usage was low (nowhere near the available resources). Error again for your reference, and all 4 logs attached above.
|
Are there any instructions that I can follow to setup this networking layer using Calico? I am just trying to reduce the number of variables here, kubernetes being one. |
Sure: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
It’s step 3.
On Fri, 9 Mar 2018 at 09:26, Pawan Rawal ***@***.***> wrote:
Are there any instructions that I can follow to setup this networking
layer using Calico? I am just trying to reduce the number of variables
here, kubernetes being one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2172 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACKSZgRCFUdY5lX5xhRQG80AtNFTPPHyks5tckrCgaJpZM4SWa2q>
.
--
Thanks,
Vlad
|
We see successful predicate moves after #2215. We'll keep testing and see if the crashes or the infinite loops mentioned in #2054 stop as well.
|
That is good to know. I am interested in this issue (seems like some sort of race condition) and would suggest creating a separate issue for it so that it can be tracked separately.
|
Version: v1.0.3-dev
Hi,
We are having trouble with our 3 server cluster. In a recent test (started ~18:00 27/02/2018), it seems that one our servers (server-1) used far more resources than the others. They all run on separate nodes (4 core, 16GB RAM) that are managed through Kubernetes.
dgraph-server-0 graphs:
dgraph-server-0 logs:
dgraph-server-0_logs.txt
dgraph-server-1 graphs:
dgraph-server-1 logs:
dgraph-server-1_logs.txt
dgraph-server-2 graphs:
dgraph-server-2 logs:
dgraph-server-2_logs.txt
dgraph-zero-0 logs:
dgraph-zero-0_logs.txt
What I notice is it seems that server-1 starts at 18:05, server-0 at 18:25 and server-2 at 18:50 (even though they were all deployed at the same time). Is it possible that server-1 took all the load initially and that the zero could not re-balance? (I see predicate move errors in zero's logs)
The text was updated successfully, but these errors were encountered: