Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory consumption? #112

Closed
looztra opened this issue Apr 8, 2020 · 14 comments
Closed

Excessive memory consumption? #112

looztra opened this issue Apr 8, 2020 · 14 comments

Comments

@looztra
Copy link

looztra commented Apr 8, 2020

We are currently experimenting to use sloop.

We find it very useful but we found out that it was very greedy regarding memory.
After less than a full day, it is currently using 5Gb of memory :(

Is it the normal behaviour?

The last 3 hours
sloop-last-3hours-2020 04 08-16_19_41

The last 24 hours
sloop-last-24hours-2020 04 08-16_20_06

Here is our current configuration (no memory limits on purpose to see what's needed without being OOM killed)

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sloop
  labels:
    app.kubernetes.io/name: sloop
spec:
  serviceName: sloop
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: sloop
  template:
    metadata:
      labels:
        app.kubernetes.io/name: sloop
    spec:
      containers:
        - args:
            - --config=/sloop-config/sloop.json
          command:
            - /sloop
          image: FIXME/sloop
          name: sloop
          ports:
            - containerPort: 8080
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          resources:
            limits: {}
            requests:
              memory: 1.5Gi
              cpu: 50m
          volumeMounts:
            - mountPath: /data
              name: sloop-data
            - mountPath: /sloop-config
              name: sloop-config
            - mountPath: /tmp
              name: sloop-tmp
          securityContext:
            allowPrivilegeEscalation: false
            privileged: false
            runAsNonRoot: true
            runAsUser: 100
            runAsGroup: 1000
            readOnlyRootFilesystem: true
      securityContext:
        fsGroup: 1000
      volumes:
        - name: sloop-config
          configMap:
            name: sloop-config
        - name: sloop-tmp
          emptyDir:
            sizeLimit: 100Mi
      serviceAccountName: sloop
      terminationGracePeriodSeconds: 10
  volumeClaimTemplates:
    - metadata:
        name: sloop-data
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
@looztra
Copy link
Author

looztra commented Apr 8, 2020

I've restarted the pod and removed existing data.
Each 30 minutes (the default value for kube-watch-resync-interval), the container consumes 300Mo of memory

sloop-1h-2020 04 08-18_28_20

@looztra
Copy link
Author

looztra commented Apr 10, 2020

After some investigations, it seems to be related to Badger.

Running pprof on a local (workstation) sloop process gives :

Screenshot from 2020-04-10 15-11-05

When profiling a sloop running inside a container in the k8s instance gives :

Screenshot from 2020-04-10 15-11-58

This seems related to these Badger issues :

@sana-jawad
Copy link
Collaborator

Thanks @looztra for raising the issue. We know of this issue and its related to garbage collection. We are currently working in the fix which is almost ready. A PR would be coming next week with the fix.

@jarifibrahim
Copy link

Hey @sana-jawad and @looztra, I work on badger and I'm trying to reduce the memory consumption. I have a PR dgraph-io/badger#1308 which I expect would reduce the memory used by decompression but I haven't been able to reproduce the high memory usage issue.

It would be very kind of you if you can test my PR in sloop and confirm if the memory usage was reduced. Or if you have some steps that I can follow to reproduce the high memory usage, I'd be happy to do that.

@looztra
Copy link
Author

looztra commented Apr 17, 2020

Hard for me to provide a way to reproduce without a kubernetes cluster running.

I'd be happy to test this PR inside sloop (and run it against the cluster I used previously), but as I'm not a go dev, I'm not sure how to produce a sloop binary that would integrate the badger version associated to this PR.

Any hints on the steps needed to do that?

@jarifibrahim
Copy link

@looztra, I can help with that. Please look at https://github.com/salesforce/sloop#build-from-source . Follow all the steps mentioned over there but before you run make, you need to make two changes.

  1. Run
go get -v -u github.com/dgraph-io/badger/v2@0edfe98dbc31621145f8bfe3e7af86bde04bdbb5 

This will update the badger version in sloop. If this runs successfully, you should have change in go.mod and go.sum file.

  1. We have changed one of the APIs in badger so make the following change in sloop.
diff --git a/pkg/sloop/store/untyped/store.go b/pkg/sloop/store/untyped/store.go
index 7bb098e..eaa8e7a 100644
--- a/pkg/sloop/store/untyped/store.go
+++ b/pkg/sloop/store/untyped/store.go
@@ -9,11 +9,12 @@ package untyped
 
 import (
 	"fmt"
+	"os"
+	"time"
+
 	badger "github.com/dgraph-io/badger/v2"
 	"github.com/golang/glog"
 	"github.com/salesforce/sloop/pkg/sloop/store/untyped/badgerwrap"
-	"os"
-	"time"
 )
 
 type Config struct {
@@ -51,10 +52,6 @@ func OpenStore(factory badgerwrap.Factory, config *Config) (badgerwrap.DB, error
 		opts = badger.DefaultOptions(config.RootPath)
 	}
 
-	if config.BadgerEnableEventLogging {
-		opts = opts.WithEventLogging(true)
-	}
-
 	if config.BadgerMaxTableSize != 0 {
 		opts = opts.WithMaxTableSize(config.BadgerMaxTableSize)
 	}

After this, you can run make and you have a the latest sloop binary in your $GOPATH/bin.

@looztra
Copy link
Author

looztra commented Apr 17, 2020

Thank you very much for the instructions, I was able to build a sloop version with the fix. It is currently running.
A first memory dump 30 minutes after the start is very promising :

Screenshot from 2020-04-17 16-54-37

I will wait a few more hours and post new results after that.

@looztra
Copy link
Author

looztra commented Apr 17, 2020

looks good, the consumed memory stays low!

Screenshot from 2020-04-17 19-15-03

@sana-jawad
Copy link
Collaborator

sana-jawad commented Apr 19, 2020

@jarifibrahim I have tested the PR and it has reduced the memory consumption. Thanks for the pointer. I have noticed that the memory consumption is directly proportional to the rate of incoming data. I am going to try setting the flag for badger-keep-l0-in-memory to false. Any other pointers that can help in memory reduction?

@looztra try following values for sloop flags for less memory consumption.
badger-use-lsm-only-options: false
badger-keep-l0-in-memory:false
The PR for keeping sloop disk size in check when garbage collection limit is hit is also in review. It is also a factor that helps in reducing memory consumption.

@looztra
Copy link
Author

looztra commented Apr 20, 2020

We are especially monitoring the value of container_memory_working_set_bytes as it is the value watched by the OOM killer.

The value was growing of 300Mo every 2 hours, up to 6Go without the patch.

Now we observe values staying around 300Mo (with the same amount of watchable update count) so we are pretty happy without having to play with the flags you mentioned.

Screenshot from 2020-04-20 16-51-44

On the graph, the usage (container_memory_usage_bytes) value is what looks like the closest to the process_resident_memory_bytes

@jarifibrahim
Copy link

Hey @looztra and @sana-jawad, thank you for testing my PR. It was definitely helpful.

However, my change, I do this
https://github.com/dgraph-io/badger/blob/0edfe98dbc31621145f8bfe3e7af86bde04bdbb5/table/table.go#L643-L651
which means I took a byte slice from pool and reduced it's length to zero (it's capacity is still the same). Now, this zero length buffer is passed to snappy. If you look at the code below, you'll notice that snappy will allocate memory if length of buffer is zero (and we've given it a zero length buffer)
https://github.com/golang/snappy/blob/ff6b7dc882cf4cfba7ee0b9f7dcc1ac096c554aa/decode.go#L62-L67

So my PR shouldn't cause any reduction in memory usage. The reduction in memory was because of commit dgraph-io/badger@c3333a5 which disabled compression by default in badger.

I noticed that the go.mod in sloop is using badger v2.0.0 . We've released v2.0.3 which disabled compression by default. So the code that @looztra and @sana-jawad tested isn't using compression and hence the low memory usage.

I would suggest you update badger in sloop and use the latest version of badger.

@sana-jawad
Copy link
Collaborator

Thanks @jarifibrahim. Yes the upgrade to 2.0.2 was already in review. I will update it to move to 2.0.3.

@looztra
Copy link
Author

looztra commented May 24, 2020

For the record, the last infos in the README regarding memory tuning associated to the latest version published were really useful as we can now run sloop within the memory limits we chose (1Gi) without having to lower the maxLookBack to 1h.

@sana-jawad
Copy link
Collaborator

Thats great to know @looztra!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants