Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbalanced server load #2172

Closed
Levatius opened this issue Feb 28, 2018 · 21 comments
Closed

Unbalanced server load #2172

Levatius opened this issue Feb 28, 2018 · 21 comments
Assignees
Labels
kind/bug Something is broken.

Comments

@Levatius
Copy link

Levatius commented Feb 28, 2018

Version: v1.0.3-dev

Hi,
We are having trouble with our 3 server cluster. In a recent test (started ~18:00 27/02/2018), it seems that one our servers (server-1) used far more resources than the others. They all run on separate nodes (4 core, 16GB RAM) that are managed through Kubernetes.

dgraph-server-0 graphs:
image
dgraph-server-0 logs:
dgraph-server-0_logs.txt

dgraph-server-1 graphs:
image
dgraph-server-1 logs:
dgraph-server-1_logs.txt

dgraph-server-2 graphs:
image
dgraph-server-2 logs:
dgraph-server-2_logs.txt

dgraph-zero-0 logs:
dgraph-zero-0_logs.txt

What I notice is it seems that server-1 starts at 18:05, server-0 at 18:25 and server-2 at 18:50 (even though they were all deployed at the same time). Is it possible that server-1 took all the load initially and that the zero could not re-balance? (I see predicate move errors in zero's logs)

@Levatius
Copy link
Author

@jimanvlad

@janardhan1993
Copy link
Contributor

Most probably your queries might be using the predicates present on group 1. Zero balances based on size.
Can you please share the logs of zero.(may be predicte move failed)

@jimanvlad
Copy link

jimanvlad commented Feb 28, 2018 via email

@pawanrawal
Copy link
Contributor

I have been observing the logs here. One strange thing that I see is

Groups sorted by size: [{gid:1 size:57783} {gid:3 size:38972087} {gid:2 size:221618040}]

2018/02/27 19:41:17 tablet.go:170: size_diff 221560257
2018/02/27 19:41:17 tablet.go:87: Going to move predicate _predicate_ from 2 to 1
2018/02/27 20:01:17 tablet.go:91: Error while trying to move predicate _predicate_ from 2 to 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded

So _predicate_ could have had max size as around 221 MB, and couldn't be moved in 20 mins (timeout for predicate move) which makes me wonder if your nodes can communicate with each other? Are your queries which touch predicates on multiple machines working fine? How and where is your kubernetes cluster setup? I have tried replicating this in a cluster but haven't been able to do so.

Another thing is that the server which gets a schema update, mutation or a query for predicate ends up serving it. I see that server 1 received a bunch of schema updates initially hence ended up serving all those predicates. So you might want to randomize that a bit to distribute the load equally initially.

@pawanrawal pawanrawal added the investigate Requires further investigation label Mar 1, 2018
@janardhan1993 janardhan1993 self-assigned this Mar 1, 2018
@jimanvlad
Copy link

Hi @pawanrawal, thanks for coming back.

I am using the following deployment in Kubernetes:

########## Services
##### Public
# Zero Public
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero-public
  labels:
    app: dgraph-zero
spec:
  type: LoadBalancer
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
    nodePort: 30006
  - port: 6080
    targetPort: 6080
    name: zero-http
    nodePort: 30005
  selector:
    app: dgraph-zero
---
# Server Public
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server-public
  labels:
    app: dgraph-server
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    targetPort: 8080
    name: server-http
    nodePort: 30003
  - port: 9080
    targetPort: 9080
    name: server-grpc
    nodePort: 30004
  selector:
    app: dgraph-server
---
# Ratel Public
apiVersion: v1
kind: Service
metadata:
  name: dgraph-ratel-public
  labels:
    app: dgraph-ratel
spec:
  type: LoadBalancer
  ports:
  - port: 8000
    targetPort: 8000
    name: ratel-http
    nodePort: 30007
  selector:
    app: dgraph-ratel
---
##### Headless
# Zero Headless
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero
  labels:
    app: dgraph-zero
spec:
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
  clusterIP: None
  selector:
    app: dgraph-zero
---
# Server Headless
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server
  labels:
    app: dgraph-server
spec:
  ports:
  - port: 7080
    targetPort: 7080
    name: server-grpc-int
  clusterIP: None
  selector:
    app: dgraph-server
---
##### Specific
# Server 0
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server-0-http-public
  labels:
    app: dgraph-server
spec:
  type: LoadBalancer
  ports:
  - port: 9080
    targetPort: 9080
    name: server-http
    nodePort: 30011
  selector:
    statefulset.kubernetes.io/pod-name: dgraph-server-0
---
# Server 1
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server-1-http-public
  labels:
    app: dgraph-server
spec:
  type: LoadBalancer
  ports:
  - port: 9080
    targetPort: 9080
    name: server-http
    nodePort: 30012
  selector:
    statefulset.kubernetes.io/pod-name: dgraph-server-1
---
# Server 2
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server-2-http-public
  labels:
    app: dgraph-server
spec:
  type: LoadBalancer
  ports:
  - port: 9080
    targetPort: 9080
    name: server-http
    nodePort: 30013
  selector:
    statefulset.kubernetes.io/pod-name: dgraph-server-2
---
########## StatefulSets
# Zero - StatefulSet
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: dgraph-zero
spec:
  selector:
    matchLabels:
      app: dgraph-zero
  serviceName: "dgraph-zero"
  replicas: 1
  template:
    metadata:
      labels:
        app: dgraph-zero
        type: zero
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: type
                operator: In
                values:
                - zero
            topologyKey: kubernetes.io/hostname
      containers:
      - name: zero
        image: dgraph/dgraph:latest
        ports:
        - containerPort: 6080
          name: zero-http
        - containerPort: 5080
          name: zero-grpc
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph zero --my=$(hostname -f):5080
      terminationGracePeriodSeconds: 60
      volumes:
      - name: datadir
        hostPath:
          path: /opt/dgraph
---
# Server - StatefulSet
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: dgraph-server
spec:
  selector:
    matchLabels:
      app: dgraph-server
  serviceName: "dgraph-server"
  replicas: 3
  template:
    metadata:
      labels:
        app: dgraph-server
        type: server
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: type
                operator: In
                values:
                - server
            topologyKey: kubernetes.io/hostname
      containers:
      - name: server
        image: dgraph/dgraph:latest
        ports:
        - containerPort: 7080
          name: server-grpc-int
        - containerPort: 8080
          name: server-http
        - containerPort: 9080
          name: server-grpc
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph server --my=$(hostname -f):7080 --memory_mb 8192 --zero dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080
      terminationGracePeriodSeconds: 60
      volumes:
      - name: datadir
        hostPath:
          path: /opt/dgraph
---
# Ratel Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dgraph-ratel-deployment
  labels:
    app: dgraph-ratel
spec:
  selector:
    matchLabels:
      app: dgraph-ratel
  replicas: 1
  template:
    metadata:
      labels:
        app: dgraph-ratel
    spec:
      containers:
      - name: dgraph-ratel
        image: dgraph/dgraph:latest
        ports:
        - containerPort: 8000
          name: ratel-http
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph-ratel -port 8000 -addr gb1-li-cortex-001.io.thehut.local:30003

Sometimes the predicate move is unsuccessful (not getting the error you highlighted above very often, but I do get this:

2018/03/01 17:45:01 tablet.go:178: size_diff 2445967
2018/03/01 17:45:01 tablet.go:178: size_diff 1468433
2018/03/01 17:45:01 tablet.go:71: Going to move predicate _ip_xid from 1 to 3
2018/03/01 17:45:01 tablet.go:64: Error while trying to move predicate _ip_xid from 1 to 3: rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.
2018/03/01 17:53:01 tablet.go:173: 

Having said that, there are situations when the predicates move fine, but the server is still stuck (see #2054 )

@pawanrawal
Copy link
Contributor

I am going to spin up a cluster using Kubernetes on AWS and give this a try today.

2018/03/01 17:45:01 tablet.go:64: Error while trying to move predicate _ip_xid from 1 to 3: rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.

This error is possible if you had mutations for the predicate going on at that point. Is that always the case? I have not seen a successful predicate move in any of your logs. Do you have consistent write workload or it could be that the client you are using is not aborting transactions on an error?

@pawanrawal
Copy link
Contributor

This and the other issue is definitely related to the connection being broken between the nodes. I am investigating why and how that happens. I can see the following logs after dgraph live has been running for a bit on a Kubernetes cluster.

Total Txns done:      339 RDFs per second:    8932 Time Elapsed: 6m24s, Aborts: 221
Total Txns done:      339 RDFs per second:    8886 Time Elapsed: 6m26s, Aborts: 221
Total Txns done:      339 RDFs per second:    8840 Time Elapsed: 6m28s, Aborts: 221
2018/03/02 11:30:22 batch.go:133: Error while mutating rpc error: code = Unavailable desc = transport is closing
2018/03/02 11:30:22 batch.go:133: Error while mutating rpc error: code = Unavailable desc = transport is closing
Total Txns done:      339 RDFs per second:    8821 Time Elapsed: 6m30s, Aborts: 223
2018/03/02 11:30:23 batch.go:133: Error while mutating rpc error: code = Unknown desc = No connection exists
2018/03/02 11:30:24 batch.go:133: Error while mutating rpc error: code = Unknown desc = No connection exists
2018/03/02 11:30:24 batch.go:133: Error while mutating rpc error: code = Unknown desc = No connection exists

@pawanrawal
Copy link
Contributor

pawanrawal commented Mar 2, 2018

I tried it on a cluster brought up using kops on AWS. Same machine specs as you have. I did see the intermittent issue with predicate move not completing and context deadline exceeded which is most probably a connection issue but mostly things went smoothly and predicate moves completed.

Is your kubernetes setup in-house? Are you sure that your networking is properly setup? I am just trying to find out a setup in which the problem occurs more frequently. I am also updating the nightly binaries to update the size of the tablet being moved among other things. I would recommend trying with dgraph/dgraph:master then.

@jimanvlad
Copy link

Hi @pawanrawal:

  1. Yes, we are continuously writing to the graph, that's the whole point of what we're trying to build. I would expect dgraph to pause ingestion for a bit until it moves predicates around, and then continue operation. Is this not the case?

  2. So for the below error, do we know the cause?

2018/03/01 19:45:06 oracle.go:381: Error while fetching minTs from group 1, err: rpc error: code = Unavailable desc = transport is closing
  1. You're saying you do see intermittent issues with a similar set-up in AWS, is that normal? Why would it happen?

  2. I will try with :master

  3. Yes, we spin our Kubernetes cluster in our own data-centre. What other tests can I perform, or what other information can I give you to confirm whether this is a networking issue? What I don't understand is how it works 'some of the time'... If there was a fundamental networking issue, you'd think that it wouldn't work at all.

@pawanrawal
Copy link
Contributor

pawanrawal commented Mar 5, 2018

  1. Yeah, that is exactly what happens.

  2. This error would happen if the grpc connection between Zero and group 1 server was closed.

  3. I saw the issue with a server which had 4GB RAM and the issue was because some containers had restarted because of going OOM. I tried this on a larger machine 16GB and didn't face any issues.

  4. Ok.

  5. What is interesting is that predicate move never worked (from your logs) but servers are able to communicate otherwise (we have an Echo GRPC which is working fine). Can you add the following environment variable to your dgraph server pods and share the logs. I am interested in a log when the predicate move fails with context deadline exceeded error.

        env:
        - name: GODEBUG
          value: http2debug=2

@Levatius
Copy link
Author

Levatius commented Mar 7, 2018

Hi again,

Regarding the predicate moves, it seems that our clients are indeed waiting for the zero to move them. However, the zero never seems to succeed with its moves (we get more conflicts than deadline exceeded errors in recent tests). We tried using GODEBUG, but we got ~500MB files for just 10 minutes of log data and couldn't see anything telling around the time of a move:

(zero)

Groups sorted by size: [{gid:3 size:276244} {gid:1 size:1071332} {gid:2 size:9993521}]

2018/03/07 15:14:00 tablet.go:170: size_diff 9717277
2018/03/07 15:14:00 tablet.go:87: Going to move predicate _predicate_ from 2 to 3
2018/03/07 15:14:00 oracle.go:84: purging below ts:5317, len(o.commits):135, len(o.aborts):12
2018/03/07 15:14:00 tablet.go:91: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.

(clients)

Predicate is being moved, please retry later


We then implemented a way to randomise how the predicates are distributed. However, it seems like our group sizes are still very unbalanced:

(zero)

Groups sorted by size: [{gid:3 size:4320255} {gid:1 size:20623424} {gid:2 size:140097589}]


We have found that our dgraph servers sometimes crash. We initially thought this might be due to resource issues but have since upgraded both our cpu and memory specifications for our nodes to no avail. When this does happen, we tend to see only one of the three go down.

Interestingly, in a recent test, the pipeline recovered from an initial crash but not a later one.
In the second crash, it seems that the recovering server could not become the leader of the group it had before (that group being gid:2, the largest):

(zero)

2018/03/07 16:16:15 zero.go:322: Got connection request: id:2 addr:"dgraph-server-1.dgraph-server.default.svc.cluster.local:7080"

2018/03/07 16:16:15 zero.go:419: Connected
2018/03/07 16:18:00 tablet.go:165:

Groups sorted by size: [{gid:3 size:4320255} {gid:1 size:20623424} {gid:2 size:140097589}]

2018/03/07 16:18:00 tablet.go:170: size_diff 135777334
2018/03/07 16:18:00 tablet.go:87: Going to move predicate _predicate_ from 2 to 3
2018/03/07 16:18:00 tablet.go:91: Error while trying to move predicate _predicate_ from 2 to 3: rpc error: code = Unknown desc = Server is not leader of this group

capture

After this last restart, the pipeline gets stuck (the other two servers will still accept alterations and queries but none will accept mutations [results in timeouts]).

@jimanvlad
Copy link

^ This server getting stuck is incredibly frustrating as we can't get the whole system to run for more than ~1hr.

@manishrjain manishrjain added kind/bug Something is broken. and removed investigate Requires further investigation labels Mar 7, 2018
@pawanrawal
Copy link
Contributor

We have found that our dgraph servers sometimes crash.

Do you have logs before the crash or could you check the pod for the reason for the crash?

In the second crash, it seems that the recovering server could not become the leader of the group it had before (that group being gid:2, the largest):

Do you have logs after the restart when this server couldn't become the leader? Since this is the only node serving the group it should be able to become the leader.

After this last restart, the pipeline gets stuck (the other two servers will still accept alterations and queries but none will accept mutations [results in timeouts]).

Since you have three servers all serving different groups, if one of them goes down then all mutations which touch the predicates on that server will get stuck. What we have to see is that why could the server not become a leader after a restart. You could also mitigate this problem by having replicas.

@pawanrawal
Copy link
Contributor

So right now during predicate move, if there are pending transactions, those are aborted and we do not go ahead with the predicate move. We should go ahead with the predicate move after cancelling the aborting the pending transactions. I will make that change, it should help rebalance the load for your cluster.

@Levatius
Copy link
Author

Levatius commented Mar 8, 2018

Thanks for getting back to us on this. We have managed to save down the logs for a server pre-crash:

logs-server-1.txt

Server event history:

  • Initially starts at 15:54:23, becomes leader of group 2.
  • Crashes at 16:24:25, what we see in the logs:

2018/03/08 16:24:25 attr: "_programme_created" groupId: 1 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef
/home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:67
github.com/dgraph-io/dgraph/worker.(*grpcWorker).ServeTask
/home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/task.go:1250
github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler
/home/travis/gopath/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2563
google.golang.org/grpc.(*Server).processUnaryRPC
/home/travis/gopath/src/google.golang.org/grpc/server.go:900
google.golang.org/grpc.(*Server).handleStream
/home/travis/gopath/src/google.golang.org/grpc/server.go:1122
google.golang.org/grpc.(*Server).serveStreams.func1.1
/home/travis/gopath/src/google.golang.org/grpc/server.go:617
runtime.goexit
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/runtime/asm_amd64.s:2337

  • Restarts at 16:24:29, seems to become follower of group 2 even though there is no leader.

Logs from other servers and zero:

logs-server-0.txt
logs-server-2.txt
logs-zero.txt

We will next look at adding the replicas as suggested. We would be very interested in this feature that allows the predicate moves to be forced through our continuous ingestion stream.

@pawanrawal
Copy link
Contributor

We would be very interested in this feature that allows the predicate moves to be forced through our continuous ingestion stream.

Sure, I am on it and will have something for you soon. Could you share details about how you spin up your kubernetes cluster so that I could replicate this issue?

@jimanvlad
Copy link

Deployment file below (slightly different than the one above, as we now have a way to connect to specific servers - dgraph-server-[0-2]-specific - so we can do the load balancing manually as per Lloyd's explanation above).

This is spun up in our own private datacentre, on CentOS 7 machines with 8 cores and 16 GB RAM for each server (of which there are three, as per the deployment). Kubernetes' networking layer is on Calico.

########## Services
##### Public
# Zero Public
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero-public
  labels:
    app: dgraph-zero
spec:
  type: LoadBalancer
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
    nodePort: 30006
  - port: 6080
    targetPort: 6080
    name: zero-http
    nodePort: 30005
  selector:
    app: dgraph-zero
---
# Server Public
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server-public
  labels:
    app: dgraph-server
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    targetPort: 8080
    name: server-http
    nodePort: 30003
  - port: 9080
    targetPort: 9080
    name: server-grpc
    nodePort: 30004
  selector:
    app: dgraph-server
---
# Ratel Public
apiVersion: v1
kind: Service
metadata:
  name: dgraph-ratel-public
  labels:
    app: dgraph-ratel
spec:
  type: LoadBalancer
  ports:
  - port: 8000
    targetPort: 8000
    name: ratel-http
    nodePort: 30007
  selector:
    app: dgraph-ratel
---
##### Headless
# Zero Headless
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero
  labels:
    app: dgraph-zero
spec:
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
  clusterIP: None
  selector:
    app: dgraph-zero
---
# Server Headless
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server
  labels:
    app: dgraph-server
spec:
  ports:
  - port: 7080
    targetPort: 7080
    name: server-grpc-int
  clusterIP: None
  selector:
    app: dgraph-server
---
##### Specific
# Server 0
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server-0-specific
  labels:
    app: dgraph-server
spec:
  type: LoadBalancer
  ports:
  - port: 9080
    targetPort: 9080
    name: server-grpc
    nodePort: 30011
  - port: 8080
    targetPort: 8080
    name: server-http
    nodePort: 30014
  selector:
    statefulset.kubernetes.io/pod-name: dgraph-server-0
---
# Server 1
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server-1-specific
  labels:
    app: dgraph-server
spec:
  type: LoadBalancer
  ports:
  - port: 9080
    targetPort: 9080
    name: server-grpc
    nodePort: 30012
  - port: 8080
    targetPort: 8080
    name: server-http
    nodePort: 30015
  selector:
    statefulset.kubernetes.io/pod-name: dgraph-server-1
---
# Server 2
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server-2-specific
  labels:
    app: dgraph-server
spec:
  type: LoadBalancer
  ports:
  - port: 9080
    targetPort: 9080
    name: server-grpc
    nodePort: 30013
  - port: 8080
    targetPort: 8080
    name: server-http
    nodePort: 30016
  selector:
    statefulset.kubernetes.io/pod-name: dgraph-server-2
---
########## StatefulSets
# Zero - StatefulSet
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: dgraph-zero
spec:
  selector:
    matchLabels:
      app: dgraph-zero
  serviceName: "dgraph-zero"
  replicas: 1
  template:
    metadata:
      labels:
        app: dgraph-zero
        type: zero
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "kubernetes.io/hostname"
                operator: In
                values:
                - gb1-li-cortex-007
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: type
                operator: In
                values:
                - zero
            topologyKey: kubernetes.io/hostname
      containers:
      - name: zero
        image: dgraph/dgraph:master
        ports:
        - containerPort: 6080
          name: zero-http
        - containerPort: 5080
          name: zero-grpc
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph zero --my=$(hostname -f):5080 |& tee -a /dgraph/logs_zero.txt
      terminationGracePeriodSeconds: 60
      volumes:
      - name: datadir
        hostPath:
          path: /opt/dgraph
---
# Server - StatefulSet
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: dgraph-server
spec:
  selector:
    matchLabels:
      app: dgraph-server
  serviceName: "dgraph-server"
  replicas: 3
  template:
    metadata:
      labels:
        app: dgraph-server
        type: server
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "kubernetes.io/hostname"
                operator: In
                values:
                - gb1-li-cortex-003
                - gb1-li-cortex-004
                - gb1-li-cortex-005
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: type
                operator: In
                values:
                - server
            topologyKey: kubernetes.io/hostname
      containers:
      - name: server
        image: dgraph/dgraph:master
        ports:
        - containerPort: 7080
          name: server-grpc-int
        - containerPort: 8080
          name: server-http
        - containerPort: 9080
          name: server-grpc
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph server --my=$(hostname -f):7080 --memory_mb 16384 --zero dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080 --posting_tables memorymap |& tee -a /dgraph/logs_server.txt
        resources:
          requests:
            memory: 24Gi
      terminationGracePeriodSeconds: 60
      volumes:
      - name: datadir
        hostPath:
          path: /opt/dgraph
---
# Ratel Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dgraph-ratel-deployment
  labels:
    app: dgraph-ratel
spec:
  selector:
    matchLabels:
      app: dgraph-ratel
  replicas: 1
  template:
    metadata:
      labels:
        app: dgraph-ratel
    spec:
      containers:
      - name: dgraph-ratel
        image: dgraph/dgraph:master
        ports:
        - containerPort: 8000
          name: ratel-http
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph-ratel -port 8000 -addr gb1-li-cortex-001.io.thehut.local:30003

Note that before the crash that Lloyd mentioned above, memory consumption and CPU usage was low (nowhere near the available resources).

Error again for your reference, and all 4 logs attached above.

2018/03/08 16:24:25 attr: "_programme_created" groupId: 1 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef
/home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:67
github.com/dgraph-io/dgraph/worker.(*grpcWorker).ServeTask
/home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/task.go:1250
github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler
/home/travis/gopath/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2563
google.golang.org/grpc.(*Server).processUnaryRPC
/home/travis/gopath/src/google.golang.org/grpc/server.go:900
google.golang.org/grpc.(*Server).handleStream
/home/travis/gopath/src/google.golang.org/grpc/server.go:1122
google.golang.org/grpc.(*Server).serveStreams.func1.1
/home/travis/gopath/src/google.golang.org/grpc/server.go:617
runtime.goexit
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/runtime/asm_amd64.s:2337

@pawanrawal
Copy link
Contributor

Are there any instructions that I can follow to setup this networking layer using Calico? I am just trying to reduce the number of variables here, kubernetes being one.

@jimanvlad
Copy link

jimanvlad commented Mar 9, 2018 via email

@jimanvlad
Copy link

jimanvlad commented Mar 13, 2018

We see successful predicate moves after #2215. We'll keep testing and see if the crashes or the infinite loops mentioned in #2054 stop as well.

Groups sorted by size: [{gid:3 size:24968897} {gid:1 size:32250160} {gid:2 size:35334754}]

2018/03/13 10:35:02 tablet.go:188: size_diff 10365857
2018/03/13 10:35:02 tablet.go:78: Going to move predicate: [_elysium_account_xid], size: [95 kB] from group 2 to 3
2018/03/13 10:35:03 tablet.go:113: Predicate move done for: [_elysium_account_xid] from group 2 to 3
2018/03/13 10:38:12 raft.go:556: While applying proposal: Tablet is already being served
2018/03/13 10:38:12 raft.go:556: While applying proposal: Tablet is already being served
2018/03/13 10:43:02 tablet.go:183: 

@pawanrawal
Copy link
Contributor

That is good to know. I am interested in this issue (seems like some sort of race condition) and would suggest creating a separate issue for it so that it can be tracked separately.

2018/03/08 16:24:25 attr: "_programme_created" groupId: 1 Request sent to wrong server.
github.com/dgraph-io/dgraph/x.AssertTruef
/home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:67
github.com/dgraph-io/dgraph/worker.(*grpcWorker).ServeTask
/home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/task.go:1250
github.com/dgraph-io/dgraph/protos/intern._Worker_ServeTask_Handler
/home/travis/gopath/src/github.com/dgraph-io/dgraph/protos/intern/internal.pb.go:2563
google.golang.org/grpc.(*Server).processUnaryRPC
/home/travis/gopath/src/google.golang.org/grpc/server.go:900
google.golang.org/grpc.(*Server).handleStream
/home/travis/gopath/src/google.golang.org/grpc/server.go:1122
google.golang.org/grpc.(*Server).serveStreams.func1.1
/home/travis/gopath/src/google.golang.org/grpc/server.go:617
runtime.goexit
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/runtime/asm_amd64.s:2337

@manishrjain manishrjain added the kind/bug Something is broken. label Mar 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

No branches or pull requests

5 participants