Excessive memory consumption? #112

looztra · 2020-04-08T14:24:04Z

We are currently experimenting to use sloop.

We find it very useful but we found out that it was very greedy regarding memory.
After less than a full day, it is currently using 5Gb of memory :(

Is it the normal behaviour?

The last 3 hours

The last 24 hours

Here is our current configuration (no memory limits on purpose to see what's needed without being OOM killed)

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sloop
  labels:
    app.kubernetes.io/name: sloop
spec:
  serviceName: sloop
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: sloop
  template:
    metadata:
      labels:
        app.kubernetes.io/name: sloop
    spec:
      containers:
        - args:
            - --config=/sloop-config/sloop.json
          command:
            - /sloop
          image: FIXME/sloop
          name: sloop
          ports:
            - containerPort: 8080
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          resources:
            limits: {}
            requests:
              memory: 1.5Gi
              cpu: 50m
          volumeMounts:
            - mountPath: /data
              name: sloop-data
            - mountPath: /sloop-config
              name: sloop-config
            - mountPath: /tmp
              name: sloop-tmp
          securityContext:
            allowPrivilegeEscalation: false
            privileged: false
            runAsNonRoot: true
            runAsUser: 100
            runAsGroup: 1000
            readOnlyRootFilesystem: true
      securityContext:
        fsGroup: 1000
      volumes:
        - name: sloop-config
          configMap:
            name: sloop-config
        - name: sloop-tmp
          emptyDir:
            sizeLimit: 100Mi
      serviceAccountName: sloop
      terminationGracePeriodSeconds: 10
  volumeClaimTemplates:
    - metadata:
        name: sloop-data
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi

The text was updated successfully, but these errors were encountered:

looztra · 2020-04-08T16:31:50Z

I've restarted the pod and removed existing data.
Each 30 minutes (the default value for kube-watch-resync-interval), the container consumes 300Mo of memory

looztra · 2020-04-10T13:15:21Z

After some investigations, it seems to be related to Badger.

Running pprof on a local (workstation) sloop process gives :

When profiling a sloop running inside a container in the k8s instance gives :

This seems related to these Badger issues :

sana-jawad · 2020-04-10T19:07:57Z

Thanks @looztra for raising the issue. We know of this issue and its related to garbage collection. We are currently working in the fix which is almost ready. A PR would be coming next week with the fix.

jarifibrahim · 2020-04-17T07:46:39Z

Hey @sana-jawad and @looztra, I work on badger and I'm trying to reduce the memory consumption. I have a PR dgraph-io/badger#1308 which I expect would reduce the memory used by decompression but I haven't been able to reproduce the high memory usage issue.

It would be very kind of you if you can test my PR in sloop and confirm if the memory usage was reduced. Or if you have some steps that I can follow to reproduce the high memory usage, I'd be happy to do that.

looztra · 2020-04-17T09:37:09Z

Hard for me to provide a way to reproduce without a kubernetes cluster running.

I'd be happy to test this PR inside sloop (and run it against the cluster I used previously), but as I'm not a go dev, I'm not sure how to produce a sloop binary that would integrate the badger version associated to this PR.

Any hints on the steps needed to do that?

jarifibrahim · 2020-04-17T09:53:24Z

@looztra, I can help with that. Please look at https://github.com/salesforce/sloop#build-from-source . Follow all the steps mentioned over there but before you run make, you need to make two changes.

Run

go get -v -u github.com/dgraph-io/badger/v2@0edfe98dbc31621145f8bfe3e7af86bde04bdbb5

This will update the badger version in sloop. If this runs successfully, you should have change in go.mod and go.sum file.

We have changed one of the APIs in badger so make the following change in sloop.

diff --git a/pkg/sloop/store/untyped/store.go b/pkg/sloop/store/untyped/store.go
index 7bb098e..eaa8e7a 100644
--- a/pkg/sloop/store/untyped/store.go
+++ b/pkg/sloop/store/untyped/store.go
@@ -9,11 +9,12 @@ package untyped
 
 import (
 	"fmt"
+	"os"
+	"time"
+
 	badger "github.com/dgraph-io/badger/v2"
 	"github.com/golang/glog"
 	"github.com/salesforce/sloop/pkg/sloop/store/untyped/badgerwrap"
-	"os"
-	"time"
 )
 
 type Config struct {
@@ -51,10 +52,6 @@ func OpenStore(factory badgerwrap.Factory, config *Config) (badgerwrap.DB, error
 		opts = badger.DefaultOptions(config.RootPath)
 	}
 
-	if config.BadgerEnableEventLogging {
-		opts = opts.WithEventLogging(true)
-	}
-
 	if config.BadgerMaxTableSize != 0 {
 		opts = opts.WithMaxTableSize(config.BadgerMaxTableSize)
 	}

After this, you can run make and you have a the latest sloop binary in your $GOPATH/bin.

looztra · 2020-04-17T14:55:36Z

Thank you very much for the instructions, I was able to build a sloop version with the fix. It is currently running.
A first memory dump 30 minutes after the start is very promising :

I will wait a few more hours and post new results after that.

looztra · 2020-04-17T17:15:57Z

looks good, the consumed memory stays low!

sana-jawad · 2020-04-19T06:42:56Z

@jarifibrahim I have tested the PR and it has reduced the memory consumption. Thanks for the pointer. I have noticed that the memory consumption is directly proportional to the rate of incoming data. I am going to try setting the flag for badger-keep-l0-in-memory to false. Any other pointers that can help in memory reduction?

@looztra try following values for sloop flags for less memory consumption.
badger-use-lsm-only-options: false
badger-keep-l0-in-memory:false
The PR for keeping sloop disk size in check when garbage collection limit is hit is also in review. It is also a factor that helps in reducing memory consumption.

looztra · 2020-04-20T14:57:11Z

We are especially monitoring the value of container_memory_working_set_bytes as it is the value watched by the OOM killer.

The value was growing of 300Mo every 2 hours, up to 6Go without the patch.

Now we observe values staying around 300Mo (with the same amount of watchable update count) so we are pretty happy without having to play with the flags you mentioned.

On the graph, the usage (container_memory_usage_bytes) value is what looks like the closest to the process_resident_memory_bytes

jarifibrahim · 2020-04-20T15:30:08Z

Hey @looztra and @sana-jawad, thank you for testing my PR. It was definitely helpful.

However, my change, I do this
https://github.com/dgraph-io/badger/blob/0edfe98dbc31621145f8bfe3e7af86bde04bdbb5/table/table.go#L643-L651
which means I took a byte slice from pool and reduced it's length to zero (it's capacity is still the same). Now, this zero length buffer is passed to snappy. If you look at the code below, you'll notice that snappy will allocate memory if length of buffer is zero (and we've given it a zero length buffer)
https://github.com/golang/snappy/blob/ff6b7dc882cf4cfba7ee0b9f7dcc1ac096c554aa/decode.go#L62-L67

So my PR shouldn't cause any reduction in memory usage. The reduction in memory was because of commit dgraph-io/badger@c3333a5 which disabled compression by default in badger.

I noticed that the go.mod in sloop is using badger v2.0.0 . We've released v2.0.3 which disabled compression by default. So the code that @looztra and @sana-jawad tested isn't using compression and hence the low memory usage.

I would suggest you update badger in sloop and use the latest version of badger.

sana-jawad · 2020-04-20T18:14:31Z

Thanks @jarifibrahim. Yes the upgrade to 2.0.2 was already in review. I will update it to move to 2.0.3.

looztra · 2020-05-24T17:41:04Z

For the record, the last infos in the README regarding memory tuning associated to the latest version published were really useful as we can now run sloop within the memory limits we chose (1Gi) without having to lower the maxLookBack to 1h.

sana-jawad · 2020-05-26T17:56:24Z

Thats great to know @looztra!

sana-jawad closed this as completed May 26, 2020

patrick-ogrady mentioned this issue Sep 14, 2020

[storage] Overhaul Badger Defaults coinbase/mesh-sdk-go#152

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory consumption? #112

Excessive memory consumption? #112

looztra commented Apr 8, 2020

looztra commented Apr 8, 2020

looztra commented Apr 10, 2020

sana-jawad commented Apr 10, 2020

jarifibrahim commented Apr 17, 2020

looztra commented Apr 17, 2020

jarifibrahim commented Apr 17, 2020

looztra commented Apr 17, 2020

looztra commented Apr 17, 2020

sana-jawad commented Apr 19, 2020 •

edited

Loading

looztra commented Apr 20, 2020

jarifibrahim commented Apr 20, 2020

sana-jawad commented Apr 20, 2020

looztra commented May 24, 2020

sana-jawad commented May 26, 2020

Excessive memory consumption? #112

Excessive memory consumption? #112

Comments

looztra commented Apr 8, 2020

looztra commented Apr 8, 2020

looztra commented Apr 10, 2020

sana-jawad commented Apr 10, 2020

jarifibrahim commented Apr 17, 2020

looztra commented Apr 17, 2020

jarifibrahim commented Apr 17, 2020

looztra commented Apr 17, 2020

looztra commented Apr 17, 2020

sana-jawad commented Apr 19, 2020 • edited Loading

looztra commented Apr 20, 2020

jarifibrahim commented Apr 20, 2020

sana-jawad commented Apr 20, 2020

looztra commented May 24, 2020

sana-jawad commented May 26, 2020

sana-jawad commented Apr 19, 2020 •

edited

Loading