Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large allocations when value size << table size, possible memory leak #106

Closed
JohnStarich opened this issue Jul 7, 2021 · 12 comments
Closed
Assignees
Labels
bug Something isn't working investigation release:v0.3.x Issues related with v0.3.x release:v0.4.x Issues related with 0.4.x

Comments

@JohnStarich
Copy link
Contributor

We're hitting out of memory errors with high memory usage for only ~200 keys: 26.64 GiB! The keys are fairly small, no more than 8k max, so this was pretty surprising. Further, the total sum of all allocated slabs returned by Stats() only shows a small fraction of the actually in-use memory.

For one sample: 40 MiB slab total, 18 GiB container total with 134 keys on [email protected].

Of course, some of this could be our app – so I dug in more and captured profiles. 😄

Data collection

Here's top heap:

Type: inuse_space
Time: Jun 29, 2021 at 12:16pm (CDT)
Active filters:
   focus=github\.com/buraksezer/olric/internal/storage\.newTable
Showing nodes accounting for 7199.49MB, 99.87% of 7208.76MB total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                         7156.94MB 99.59% |   github.com/buraksezer/olric/internal/storage.(*Storage).Put /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/storage.go:98 (inline)
                                           29.56MB  0.41% |   github.com/buraksezer/olric/internal/storage.New /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/storage.go:52 (inline)
 7186.49MB 99.69% 99.69%  7186.49MB 99.69%                | github.com/buraksezer/olric/internal/storage.newTable /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/table.go:63
----------------------------------------------------------+-------------
                                            8.50MB   100% |   github.com/buraksezer/olric/internal/storage.(*Storage).Put /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/storage.go:98 (inline)
    8.50MB  0.12% 99.81%     8.50MB  0.12%                | github.com/buraksezer/olric/internal/storage.newTable /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/table.go:51
----------------------------------------------------------+-------------
                                            4.50MB   100% |   github.com/buraksezer/olric/internal/storage.(*Storage).Put /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/storage.go:98 (inline)
    4.50MB 0.062% 99.87%     4.50MB 0.062%                | github.com/buraksezer/olric/internal/storage.newTable /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/table.go:52
----------------------------------------------------------+-------------

Which points to these lines:

nt := newTable(s.Inuse() * 2)
s.tables = append(s.tables, nt)

I also ran a CPU profile for good measure, and found storage.Inuse() was taking up a large percentage of time compared to the typical HTTP traffic this app expected:

Type: cpu
Time: Jun 29, 2021 at 12:21pm (CDT)
Duration: 30.11s, Total samples = 30.52s (101.37%)
Showing nodes accounting for 30.52s, 100% of 30.52s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            28.26s   100% |   github.com/buraksezer/olric/internal/storage.(*Storage).Put /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/storage.go:98
    28.23s 92.50% 92.50%     28.26s 92.60%                | github.com/buraksezer/olric/internal/storage.(*Storage).Inuse /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/storage.go:333
                                             0.03s  0.11% |   runtime.asyncPreempt /usr/local/go/src/runtime/preempt_amd64.s:8
----------------------------------------------------------+-------------
                                             1.40s   100% |   github.com/buraksezer/olric/internal/storage.(*Storage).Put /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/storage.go:98
     1.40s  4.59% 97.08%      1.40s  4.59%                | github.com/buraksezer/olric/internal/storage.(*Storage).Inuse /go/pkg/mod/github.com/buraksezer/[email protected]/internal/storage/storage.go:332
----------------------------------------------------------+-------------
                                             0.07s   100% |   runtime.netpoll /usr/local/go/src/runtime/netpoll_epoll.go:126
     0.07s  0.23% 97.31%      0.07s  0.23%                | runtime.epollwait /usr/local/go/src/runtime/sys_linux_amd64.s:725
----------------------------------------------------------+-------------

52e1d700-d8d8-11eb-8700-28f25093f181

Steps to reproduce

I made a test and benchmark that cause this same issue (the heap profile shows the same lines affected).

Each of them have several tuning knobs (constants) you can use to experiment, if you like. I found a high table size and small value byte size triggered the issue most easily.

Click to expand for the test:

In dmap_put_test.go:

const mebibyte = 1 << 20

func TestDMap_PutMany(t *testing.T) {
	const (
		keyCount  = 1000 // Number of keys to set in 1 round.
		valueSize = 10   // Value size in bytes.
		// Cumulative value memory = keyCount * valueSize
		tableSize = 10 * mebibyte

		sampleInterval = 5 * time.Second
		statusInterval = 10 * time.Second
	)

	cfg := testSingleReplicaConfig()
	cfg.TableSize = tableSize
	db, err := newDB(cfg)
	if err != nil {
		t.Fatalf("Expected nil. Got: %v", err)
	}
	dm, err := db.NewDMap("mymap")
	if err != nil {
		t.Fatalf("Expected nil. Got: %v", err)
	}

	before := readMem()
	t.Logf("Before Memory: %.2f MiB", toMiB(before.Alloc))
	expectedMemDiff := toMiB(keyCount * valueSize)
	t.Logf("Expected memory increase of %.2f MiB", expectedMemDiff)

	ctx, cancel := context.WithCancel(context.Background())
	t.Cleanup(cancel)
	go runMemStatusTicker(ctx, t, sampleInterval, statusInterval)

	value := make([]byte, valueSize)
	for i := 0; i < keyCount; i++ {
		err := dm.Put(bkey(i), value)
		if err != nil {
			panic(err)
		}
	}
	runtime.GC()
	after := readMem()
	writePprofHeap(t)

	t.Logf("After Memory: %.2f MiB", toMiB(after.Alloc))
	actualMemDiff := toMiB(after.Alloc) - toMiB(before.Alloc)
	t.Logf("Expected memory increase of %.2f MiB, got %.2f MiB. actual/expected ratio: %.2f", expectedMemDiff, actualMemDiff, actualMemDiff/expectedMemDiff)
}

func toMiB(i uint64) float64 {
	return float64(i) / mebibyte
}

func readMem() runtime.MemStats {
	var mem runtime.MemStats
	runtime.ReadMemStats(&mem)
	return mem
}

func writePprofHeap(t *testing.T) {
	f, err := ioutil.TempFile("", "")
	if err != nil {
		t.Fatal(err)
	}
	runtime.GC()
	if err := pprof.WriteHeapProfile(f); err != nil {
		t.Fatal(err)
	}
	_ = f.Close()
	t.Log("Pprof written to:", f.Name())
}

func max(values ...uint64) uint64 {
	maxVal := values[0]
	for _, val := range values[1:] {
		if val > maxVal {
			maxVal = val
		}
	}
	return maxVal
}

func runMemStatusTicker(ctx context.Context, t *testing.T, sampleInterval, statusInterval time.Duration) {
	sampleTick := time.NewTicker(sampleInterval)
	defer sampleTick.Stop()
	statusTick := time.NewTicker(statusInterval)
	defer statusTick.Stop()

	const maxSamples = 10
	samples := make([]uint64, maxSamples)
	i := 0
	for {
		select {
		case <-ctx.Done():
			return
		case <-sampleTick.C:
			mem := readMem()
			samples[i%maxSamples] = mem.Alloc
			i++
		case <-statusTick.C:
			t.Logf("Current memory: %.2f MiB", toMiB(max(samples...)))
		}
	}
}

And the benchmark:

func BenchmarkDMap_PutMany(b *testing.B) {
	b.ReportAllocs()
	const (
		valueSize = 10 // Value size in bytes.
		tableSize = 10 * mebibyte

		sampleInterval = 5 * time.Second
		statusInterval = 10 * time.Second
	)

	cfg := testSingleReplicaConfig()
	cfg.TableSize = tableSize
	cfg.LogOutput = ioutil.Discard
	db, err := newDB(cfg)
	if err != nil {
		b.Fatalf("Expected nil. Got: %v", err)
	}
	dm, err := db.NewDMap("mymap")
	if err != nil {
		b.Fatalf("Expected nil. Got: %v", err)
	}

	value := make([]byte, valueSize)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		err := dm.Put(bkey(i), value)
		if err != nil {
			panic(err)
		}
	}
	b.StopTimer()
	runtime.GC()
	var mem runtime.MemStats
	runtime.ReadMemStats(&mem)
	b.ReportMetric(toMiB(mem.Alloc), "heap-MiB")
}

Which produced this output for me for a 1m bench time:

go test . -run '^$' -bench BenchmarkDMap_PutMany -v -benchtime=1m

goos: darwin
goarch: amd64
pkg: github.com/buraksezer/olric
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
BenchmarkDMap_PutMany
BenchmarkDMap_PutMany-8         22833872              4389 ns/op              3410 heap-MiB         1640 B/op         26 allocs/op
PASS
ok      github.com/buraksezer/olric     106.409s

The 3 GiB heap is telling me there's effectively a leak in there. I imagine compaction could help take care of it, but some of the memory persists long-term.

Theories

It seems this is most easily triggered when the table size is much larger than the average value size.

I've explored some possibilities so far:

  • Race condition when determining if a new table should be allocated
    • I attempted to find a data race on the affected lines but I can't seem to trigger it with go test -race. It might still be a race in the protocol somehow, which the race detector isn't catching, but I tend to trust the race detector.
  • Incorrect formula for determining the next table size. This one seems most likely to me, though I don't know enough about olric's table allocation to say with any certainty.

Hope that helps. I'll try and keep this issue updated if I discover anything new.

@buraksezer buraksezer self-assigned this Jul 7, 2021
@buraksezer buraksezer added bug Something isn't working investigation release:v0.3.x Issues related with v0.3.x release:v0.4.x Issues related with 0.4.x labels Jul 7, 2021
@buraksezer
Copy link
Owner

Hi @JohnStarich

Thank you for detailed analysis. I ran TestDMap_PutMany and BenchmarkDMap_PutMany and tried to debug the storage engine. It seems that the storage engine works as expected. Actually, this allocation behavior looks a bit problematic to me too but this can be caused by a misconfiguration.

tableSize = 10 * mebibyte

This is too big, if you have ~200 keys with max 8k value size. tableSize identifies that the initial allocation for a fragment of DMap on the partitions. So if you have 271 partitions on a single node, it will pre-allocate 2.7G on heap with 10M initial table size. So what is your partition count?

If your key count and size is relatively small, keeping tableSize small is a good idea. Olric will allocate more memory if you need. There is only one downside of this: if you insert too many keys (say 1M) while your tableSize is small, Olric is going to create new byte slices on the heap very frequently and this also triggers compaction. So small tableSize decreases the throughput of the cluster.

This behavior has to be documented. It's my bad. Sorry for the inconvenience. I'm also open to discuss the storage engine design.

The engine uses Golang's built-in map to store indexes and a byte slice is used for the entries.

Memory layout of a storage entry. It includes 21 bytes of metadata:

KEY-LENGTH(uint8) | KEY(bytes) | TTL(uint64) | Timestamp(uint64) | VALUE-LENGTH(uint32) | VALUE(bytes)

Index:

map[hash(key)<uint64>]offset<uint64>

@JohnStarich
Copy link
Contributor Author

Thank you @buraksezer for looking into this!

Our real table size uses the default (of 1 MiB, I believe), but I exaggerated various parts of the test and benchmark to make the problem more evident.

I'm actually in the process of testing a much smaller table size than the default to see if it helps with the leak. Unfortunately, I think that will only slightly mitigate the underlying issue. Hoping that will at least slow down the OOM errors.

@JohnStarich
Copy link
Contributor Author

This behavior has to be documented. It's my bad. Sorry for the inconvenience. I'm also open to discuss the storage engine design.

No worries, I appreciate your help! Thanks for the data layout, that's good to know 👍

Could you go over how tables are sized and allocated? That may help us tune olric such that it makes this much less likely.

@JohnStarich
Copy link
Contributor Author

So what is your partition count?

Just re-read and noticed your question: I'm actually not sure what our partition count is. We do have 5 embedded olric nodes, does that correlate with 5 partitions? I'll try and extract more metrics during my pre-production tests.

@buraksezer
Copy link
Owner

buraksezer commented Jul 8, 2021

Just re-read and noticed your question: I'm actually not sure what our partition count is. We do have 5 embedded olric nodes, does that correlate with 5 partitions? I'll try and extract more metrics during my pre-production tests.

Partition count is one of the key concepts of Olric. Olric divides the key space into partitions and maintains the partitions among the nodes. These partitions are distributed by a consistent hashing algorithm among the servers. There is no other measure to determine partition ownership. Partition count is 271 by default and directly taken from Hazelcast. You can choose any other number as partition count but prime numbers is good for fair distribution of keys among partitions.

So if you have 5 nodes and your partition count is 271, every node has ~54 partitions to host. If you have add new nodes, some of the partitions and their replicas is going to move to the new nodes. This flow is almost the same for the replicas.

When you start to insert keys to the cluster, Olric allocates 1 MiB (default tableSize) of memory on heap for every partition. So ~54 MiB of memory should be allocated initially by the nodes in your 5-nodes cluster.

I ran the benchmark many times and I still think it works as expected.

goos: darwin
goarch: amd64
pkg: github.com/buraksezer/olric
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
BenchmarkDMap_PutMany
BenchmarkDMap_PutMany-8         70533109              1482 ns/op              9194 heap-MiB           9225 heap-inuse-MiB            566 B/op          9 allocs/op
PASS
ok      github.com/buraksezer/olric     107.986s

It inserts ~70M keys to a single node. There are 7 tables initially. Olric tries to expand them a few times: the engine creates double-sized byte slices and moves everything to the new ones. If everything is okay, the old tables should be freed by the GC. But it takes time and the runtime may reuse the allocated memory. So the heap size grows over and over if you insert 70M keys in a few minutes.

This is my observation but it doesn't explain your initial case. I tried to reproduce it with olricd and olric-load but everything looks normal. ~50MB of memory usage is reported by macOS. It looks like that there is a subtle bug in the engine code. I'm on it.

Is it possible to share your Olric configuration?

@JohnStarich
Copy link
Contributor Author

@buraksezer Ah, that helps thank you! We're using the default partition count then, and we're spreading them across 6 (I misremembered – thought it was 5 earlier) kubernetes pods with the following config. (These comments were my interpretation of the config options, apologies if they're incorrect 😄)

cfg := olricConfig.New("lan")
cfg.ReplicaCount = 2                                                 // Ensure data is on at least 2 nodes, so it has a chance to survive a pod shutdown.
cfg.ReadRepair = true                                                // If DB is inconsistent (e.g. wrong replica count), attempt to repair.
cfg.ReplicationMode = olricConfig.AsyncReplicationMode               // Run write and delete operations in the background.
cfg.MemberlistConfig.ProtocolVersion = memberlist.ProtocolVersionMax // Since all clients use the same code, use the maximum protocol.
cfg.MaxJoinAttempts = int(5 * time.Minute / cfg.JoinRetryInterval)   // Retry joining for up to 5 minutes.

cfg.LogOutput = logger.Writer() // for some reason, setting Logger doesn't capture logs from memberlist
cfg.ServiceDiscovery = map[string]interface{}{"plugin": provider}

I've also tested adding:

cfg.TableSize = 1 << 18     // 256 KiB

But this doesn't seem to have any effect, sadly.

I ran the benchmark many times and I still think it works as expected.

That's good to hear. I'm struggling to reproduce these same conditions in a controlled fashion, but they do occur in the real environments very consistently. I'll keep trying to reproduce without the abnormal table sizes.

One of the odd things I've noticed in this environment is the container's memory leaps upward in large jumps:
image

Whenever I've taken a memory profile on those pods, they show the vast majority of memory in use was allocated by newTable(s.Inuse() * 2) 🤷

@buraksezer
Copy link
Owner

Configuration looks good to me.

One of the odd things I've noticed in this environment is the container's memory leaps upward in large jumps:

Current design of the storage engine is read optimized. It allocates large chunks of memory on the heap and inserts the key/value pairs. It will never try to allocate memory gradually.

Unfortunately, compaction is an expensive and slow operation. I have prototyped a new design and it performs better than the current one. It also significantly better for write-intensive tasks. It may be major topic for v0.5.0.

I have also fixed a few bugs in the compaction code. These bugs should not make an affect for your case. I wonder that how many keys you inserted to draw that graph. Average value size is 8k, right?

For one sample: 40 MiB slab total, 18 GiB container total with 134 keys on [email protected].

I never managed to reproduce that. I have been inserting 1M keys (value size is 10 bytes) and it consumes ~350Mb. Are sure that you don't have a few very large k/v pairs?

@buraksezer
Copy link
Owner

Whenever I've taken a memory profile on those pods, they show the vast majority of memory in use was allocated by newTable(s.Inuse() * 2) 🤷

newTable is malloc for Olric but we don't have a free function. Compaction creates a significant pressure on memory and GC. We need to improve it but this is a significant task for a minor version bump.

By the way, I have used mmap to allocate memory outside the heap but compaction is still too expensive and slow. I have also tried to use jemalloc but it didn't make any difference.

@JohnStarich
Copy link
Contributor Author

JohnStarich commented Jul 9, 2021

Are sure that you don't have a few very large k/v pairs?

Turns out many of these values were much bigger than I had measured before. To find the value sizes, I've wrapped a new gob serializer with metrics:

func main() {
	...
	collector := metrics.NewCollector(prometheus.DefaultRegisterer)
	cfg.Serializer = wrapSerializerMetrics(cfg.Serializer, collector)
	...
}

type metricsSerializer struct {
	serializer serializer.Serializer
	collector  *metrics.Collector
}

func wrapSerializerMetrics(s serializer.Serializer, collector *metrics.Collector) serializer.Serializer {
	if s == nil {
		s = serializer.NewGobSerializer()
	}
	return &metricsSerializer{
		serializer: s,
		collector:  collector,
	}
}

func (m *metricsSerializer) Marshal(v interface{}) ([]byte, error) {
	data, err := m.serializer.Marshal(v)
	m.collector.ObserveSerializerMarshalSize(len(data))
	return data, err
}

func (m *metricsSerializer) Unmarshal(data []byte, v interface{}) error {
	err := m.serializer.Unmarshal(data, v)
	m.collector.ObserveSerializerUnmarshalSize(len(data))
	return err
}

With this, now I have a better picture of the data size per key.
Under continuous load, my 90% quantile is >256KiB per key. I imagine in production it's much higher than this, too.

Critically, my assumption for key size was completely flawed. 😓 I'll take this as a lesson learned, and sorry for the wild goose chase to find a non-existent leak. This just looks like we're tossing way more data into olric than we have resources for. I'll find a way to prune everything we're not using to get more reasonable sizes.

Would it be worthwhile to collaborate on a Prometheus integration or similar (like more Stats())? I imagine value size metrics would be invaluable data for newcomers to easily spot issues like this.

@JohnStarich
Copy link
Contributor Author

By the way, I have used mmap to allocate memory outside the heap but compaction is still too expensive and slow. I have also tried to use jemalloc but it didn't make any difference.

Interesting, I hadn't heard of off-heap libraries like this before - very cool! As an aside: It may be harder to debug if it's outside the Go heap too, since it could be invisible to Go & pprof.
One alternative I've seen is in zap's buffer pool implementation. Essentially keeping tabs on large byte slices and mapping memory to those on-demand.

@buraksezer
Copy link
Owner

Hi @JohnStarich

I have just released Olric v0.3.9. It includes a minor fix in the compaction code to clean stale tables more effectively.

Critically, my assumption for key size was completely flawed. 😓 I'll take this as a lesson learned, and sorry for the wild goose chase to find a non-existent leak. This just looks like we're tossing way more data into olric than we have resources for. I'll find a way to prune everything we're not using to get more reasonable sizes.

Never mind. 😉 I discovered new things in this research.

Would it be worthwhile to collaborate on a Prometheus integration or similar (like more Stats())?

Absolutely. I had started an exporter several months ago but I was not able to finish it due to lack of spare time. It works as a sidecar but I'm open for an easier integration with Prometheus. Please open an issue for your feature request. 😉

I imagine value size metrics would be invaluable data for newcomers to easily spot issues like this.

Olric v0.4.x has network statistics. ReadBytesTotal would be useful to inspect the problem for this case. We may also add more stats for DMaps such as MeanValueSize or something. By the way, the latest beta version of v0.4.x is much more stable than the previous ones. So you may want to upgrade for the new Stats().

One alternative I've seen is in zap's buffer pool implementation. Essentially keeping tabs on large byte slices and mapping memory to those on-demand.

I'll take a look at this. Thank you!

@JohnStarich
Copy link
Contributor Author

Thanks @buraksezer! 👏

We're still sitting on v0.3.7 with our smaller-values fix in place. I discovered a separate issue upgrading to v0.3.8 surrounding Gets/Puts taking way longer than expected (max 5s went up to 1m before we canceled our contexts), but I'll open an issue if we come back around to it. For now, it's working! 😄

Absolutely. I had started an exporter several months ago but I was not able to finish it due to lack of spare time. It works as a sidecar but I'm open for an easier integration with Prometheus. Please open an issue for your feature request. 😉

Can do 👍 Seems like #57 may cover it, so I'll keep tabs on it there.

Olric v0.4.x has network statistics

Perfect 💯 I'll take a look when I circle back to improvements again, hopefully soon.

Thanks again for digging into the problem with me, greatly appreciated. I'll close, since we've managed to fix it on our end too.

I'll probably need to take a deeper look at the encoder magic: looks like gob is encoding a much larger value than I initially suspected, which could cause future mishaps. But you've made it very easy to swap out the Serializer, so should be no trouble 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working investigation release:v0.3.x Issues related with v0.3.x release:v0.4.x Issues related with 0.4.x
Projects
None yet
Development

No branches or pull requests

2 participants