Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbounded memory growth / leak in pipelines #209

Closed
derekperkins opened this issue Jan 13, 2023 · 4 comments
Closed

Unbounded memory growth / leak in pipelines #209

derekperkins opened this issue Jan 13, 2023 · 4 comments
Assignees
Labels
bug Something isn't working dmap investigation release:v0.5.x Issues related with 0.5.x

Comments

@derekperkins
Copy link
Contributor

Here's the RAM usage over the course of 2.5 hours, showing unbounded memory growth. This happens on every pod until they gets OOMKilled. There are about 100 RPS to the grpc api, on 30 separate pods. Each request makes on average 1 pipeline Exec call with about 100 keys per call. I've run pprof repeatedly that gives different results than my OS metrics are telling me. I'm only seeing this when using the pipeline, our other service just does single Get/Put calls and maintains steady memory usage just a little over the DMap allocation.

There are 2 DMaps, with the following config.

	// create a new Olric configuration
	cfg := config.New("lan") // default configuration
	cfg.ServiceDiscovery = map[string]any{
		"plugin": k8sDisc,
	}
	cfg.ReplicationMode = config.AsyncReplicationMode
	cfg.LogLevel = "WARN"
	cfg.LogVerbosity = 1

	cfg.DMaps.Custom[name] = config.DMap{
		EvictionPolicy:  config.LRUEviction,
		MaxIdleDuration: 24 * time.Hour,
		MaxInuse:        50_000_000,
	}

image
image
image

pprof -alloc_objects

go tool pprof -inuse_objects http://localhost:9090/debug/pprof/heap

Type: inuse_objects
Time: Jan 13, 2023 at 12:03am (MST)

Showing nodes accounting for 2137797, 71.48% of 2990871 total
Dropped 129 nodes (cum <= 14954)
Showing top 10 nodes out of 128
      flat  flat%   sum%        cum   cum%
    460136 15.38% 15.38%     460136 15.38%  github.com/buraksezer/olric.mapToRoutingTable
    415070 13.88% 29.26%     415070 13.88%  github.com/go-redis/redis/v8/internal/proto.(*Reader).readStringReply
    218460  7.30% 36.57%     218460  7.30%  net.sockaddrToTCP
    213004  7.12% 43.69%     213004  7.12%  github.com/go-redis/redis/v8/internal/proto.NewWriter (inline)
    210928  7.05% 50.74%     210928  7.05%  bufio.NewReaderSize (inline)
    162753  5.44% 56.18%     162753  5.44%  time.NewTicker
    136300  4.56% 60.74%     136300  4.56%  github.com/go-redis/redis/v8/internal/pool.NewConnPool
    131075  4.38% 65.12%     131075  4.38%  syscall.anyToSockaddr
     98315  3.29% 68.41%      98315  3.29%  net.newFD
     91756  3.07% 71.48%     636946 21.30%  github.com/go-redis/redis/v8/internal/pool.NewConn

inuse_objects

pprof.main_linux64.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz

pprof -alloc_objects

go tool pprof -alloc_objects http://localhost:9090/debug/pprof/heap

Type: alloc_objects
Time: Jan 13, 2023 at 12:05am (MST)

(pprof) top
Showing nodes accounting for 2447758068, 86.92% of 2816051715 total
Dropped 1298 nodes (cum <= 14080258)
Showing top 10 nodes out of 117
      flat  flat%   sum%        cum   cum%
1412298640 50.15% 50.15% 2140089146 76.00%  github.com/go-redis/redis/v8.sliceParser
 412719257 14.66% 64.81% 2143579060 76.12%  github.com/go-redis/redis/v8/internal/proto.(*Reader).ReadReply
 318266246 11.30% 76.11%  318266246 11.30%  github.com/go-redis/redis/v8/internal/proto.(*Reader).readStringReply
 150051797  5.33% 81.44%  150051797  5.33%  github.com/buraksezer/olric.mapToRoutingTable
  49612998  1.76% 83.20%   49612998  1.76%  strconv.formatBits
  28112132     1% 84.20%   77594057  2.76%  strconv.AppendInt
  22158716  0.79% 84.98%   42573491  1.51%  github.com/go-redis/redis/v8.(*Client).Pipeline
  20414775  0.72% 85.71%   20414775  0.72%  github.com/go-redis/redis/v8.(*Pipeline).init (inline)
  17542143  0.62% 86.33%   33190294  1.18%  vitess.io/vitess/go/vt/vtgate/evalengine.ToNative
  16581364  0.59% 86.92%   48094774  1.71%  github.com/buraksezer/olric/internal/discovery.NewMemberFromMetadata

alloc_objects

pprof.main_linux64.alloc_objects.alloc_space.inuse_objects.inuse_space.011.pb.gz

pprof -inuse_space

go tool pprof -inuse_space http://localhost:9090/debug/pprof/heap

Type: inuse_space
Time: Jan 13, 2023 at 12:11am (MST)

Showing nodes accounting for 1.11GB, 91.93% of 1.21GB total
Dropped 166 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 115
      flat  flat%   sum%        cum   cum%
    0.38GB 31.36% 31.36%     0.38GB 31.36%  bufio.NewReaderSize (inline)
    0.26GB 21.61% 52.96%     0.26GB 21.61%  github.com/tidwall/redcon.AppendBulk
    0.23GB 19.30% 72.26%     0.23GB 19.30%  bufio.NewWriterSize
    0.08GB  6.92% 79.18%     0.19GB 15.57%  github.com/tidwall/redcon.NewReader
    0.05GB  4.39% 83.57%     0.05GB  4.39%  github.com/go-redis/redis/v8/internal/pool.NewConnPool
    0.04GB  3.44% 87.01%     0.04GB  3.44%  github.com/buraksezer/olric.mapToRoutingTable
    0.02GB  2.02% 89.03%     0.02GB  2.02%  runtime.malg
    0.01GB  1.01% 90.03%     0.01GB  1.01%  net.newFD
    0.01GB  0.97% 91.00%     0.01GB  0.97%  github.com/go-redis/redis/v8/internal/proto.NewWriter
    0.01GB  0.93% 91.93%     0.01GB  0.93%  github.com/buraksezer/olric/internal/kvstore/table.New

inuse_space

pprof.main_linux64.alloc_objects.alloc_space.inuse_objects.inuse_space.014.pb.gz

pprof -alloc_space

go tool pprof -alloc_space http://localhost:9090/debug/pprof/heap

Type: alloc_space
Time: Jan 13, 2023 at 12:09am (MST)

Showing nodes accounting for 125.66GB, 77.90% of 161.31GB total
Dropped 1268 nodes (cum <= 0.81GB)
Showing top 10 nodes out of 152
      flat  flat%   sum%        cum   cum%
   42.78GB 26.52% 26.52%    56.50GB 35.03%  github.com/go-redis/redis/v8.sliceParser
   36.81GB 22.82% 49.34%    36.81GB 22.82%  github.com/buraksezer/olric.mapToRoutingTable
    8.84GB  5.48% 54.82%    10.01GB  6.21%  github.com/tidwall/redcon.appendPrefix
    7.66GB  4.75% 59.57%    13.63GB  8.45%  github.com/tidwall/redcon.AppendBulkString
    7.31GB  4.53% 64.10%     7.31GB  4.53%  github.com/go-redis/redis/v8/internal/proto.(*Reader).readStringReply
    6.53GB  4.05% 68.14%    56.63GB 35.10%  github.com/go-redis/redis/v8/internal/proto.(*Reader).ReadReply
    6.35GB  3.94% 72.08%     6.40GB  3.97%  github.com/tidwall/redcon.AppendBulk
    4.75GB  2.94% 75.02%     4.75GB  2.94%  bufio.NewReaderSize (inline)
    2.36GB  1.46% 76.49%     2.36GB  1.46%  bufio.NewWriterSize
    2.28GB  1.41% 77.90%     4.62GB  2.86%  github.com/tidwall/redcon.NewReader

alloc_space

pprof.main_linux64.alloc_objects.alloc_space.inuse_objects.inuse_space.013.pb.gz

@derekperkins
Copy link
Contributor Author

Here are some screenshots from our continuous profiler showing 24 hours of data (averaged over 250 samples) from yesterday with olric and the same day last week without olric running.

In Use Heap Profiles

Before Olric

image

After Olric

image

Alloc Heap Profiles

Before Olric

image

After Olric

image

CPU Profiles

I expected an increase in CPU given that there would be serialization during lookups compared to the local cache with ristretto, but wasn't really expecting the 5% CPU to turn into > 80% CPU for the whole service dedicated for Olric. A decent chunk of that is from increased GC from all the allocations.

Before Olric

image

After Olric

image

@derekperkins
Copy link
Contributor Author

The good news is that this also supports that it's primarily an issue with pipelines. In a separate service that only does single gets/puts and no pipelines that has been running in production longer, while there is definitely an increase in heap usage and GC, it doesn't completely dominate the entire service, of which olric is just a small part.

In Use Heap Profiles

Before Olric

image

After Olric

image

Alloc Heap Profiles

Before Olric

image

After Olric

image

CPU Profiles

Before Olric

image

After Olric

image

@buraksezer
Copy link
Owner

@derekperkins thank you for the detailed bug report. I applied a fix to overcome the problem. Previously, EmbeddedClient.Pipeline was creating a new cluster client for every new pipeline. Reusing can partly overcome this problem but all pipelines should use the same underlying cluster client.

The other problem was the routing table fetch interval. ClusterClient implementation retrieves the routing table periodically to route requests to the partition owner. The client can work without this "smart pick" logic but it increases the latency. The interval was too short: 15 seconds. I set a new default value (1 min) and added the WithRoutingTableFetchInterval function for setting an appropriate interval for the environment.

Now you can use EmbeddedClient.Pipeline to create a new pipeline. EmbeddedClient implementation creates only one ClusterClient and reuses it. You can also use ClusterClient.Pipeline to create a new pipeline. It's already used by EmbeddedClient.Pipeline function.

So my theory was you have many pipeline instances in runtime and they had their own ClusterClient instances. Even worse, all these cluster clients sent thousands of requests to retrieve the same routing table. Parsing and setting a new routing table in the client consumed lots of memory.

There is a lot of room to improve the pipeline implementation but the latest version should fix the memory leak problem. Could you upgrade to Olric v0.5.2 and observe the system?

@derekperkins
Copy link
Contributor Author

v0.5.2 has been running for about 10 hours now and looks solid. I can confirm that these fixes solved the memory leak.

image

In Use Heap Profile

image

Alloc Heap Profile

image

CPU Profile

image

As you mentioned, there is definitely still room for improvement, but at least this is stable enough to keep running.

I'm seeing an increased number of of i/o timeout errors, and I need to tweak the error handling because as is, it's taking up a lot of resources just inside our code as well.

Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dmap investigation release:v0.5.x Issues related with 0.5.x
Projects
None yet
Development

No branches or pull requests

2 participants