-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large allocations when value size << table size, possible memory leak #106
Comments
Hi @JohnStarich Thank you for detailed analysis. I ran tableSize = 10 * mebibyte This is too big, if you have ~200 keys with max 8k value size. If your key count and size is relatively small, keeping This behavior has to be documented. It's my bad. Sorry for the inconvenience. I'm also open to discuss the storage engine design. The engine uses Golang's built-in map to store indexes and a byte slice is used for the entries. Memory layout of a storage entry. It includes 21 bytes of metadata:
Index:
|
Thank you @buraksezer for looking into this! Our real table size uses the default (of 1 MiB, I believe), but I exaggerated various parts of the test and benchmark to make the problem more evident. I'm actually in the process of testing a much smaller table size than the default to see if it helps with the leak. Unfortunately, I think that will only slightly mitigate the underlying issue. Hoping that will at least slow down the OOM errors. |
No worries, I appreciate your help! Thanks for the data layout, that's good to know 👍 Could you go over how tables are sized and allocated? That may help us tune olric such that it makes this much less likely. |
Just re-read and noticed your question: I'm actually not sure what our partition count is. We do have 5 embedded olric nodes, does that correlate with 5 partitions? I'll try and extract more metrics during my pre-production tests. |
Partition count is one of the key concepts of Olric. Olric divides the key space into partitions and maintains the partitions among the nodes. These partitions are distributed by a consistent hashing algorithm among the servers. There is no other measure to determine partition ownership. Partition count is 271 by default and directly taken from Hazelcast. You can choose any other number as partition count but prime numbers is good for fair distribution of keys among partitions. So if you have 5 nodes and your partition count is 271, every node has ~54 partitions to host. If you have add new nodes, some of the partitions and their replicas is going to move to the new nodes. This flow is almost the same for the replicas. When you start to insert keys to the cluster, Olric allocates 1 MiB (default tableSize) of memory on heap for every partition. So ~54 MiB of memory should be allocated initially by the nodes in your 5-nodes cluster. I ran the benchmark many times and I still think it works as expected.
It inserts ~70M keys to a single node. There are 7 tables initially. Olric tries to expand them a few times: the engine creates double-sized byte slices and moves everything to the new ones. If everything is okay, the old tables should be freed by the GC. But it takes time and the runtime may reuse the allocated memory. So the heap size grows over and over if you insert 70M keys in a few minutes. This is my observation but it doesn't explain your initial case. I tried to reproduce it with olricd and olric-load but everything looks normal. ~50MB of memory usage is reported by macOS. It looks like that there is a subtle bug in the engine code. I'm on it. Is it possible to share your Olric configuration? |
@buraksezer Ah, that helps thank you! We're using the default partition count then, and we're spreading them across 6 (I misremembered – thought it was 5 earlier) kubernetes pods with the following config. (These comments were my interpretation of the config options, apologies if they're incorrect 😄) cfg := olricConfig.New("lan")
cfg.ReplicaCount = 2 // Ensure data is on at least 2 nodes, so it has a chance to survive a pod shutdown.
cfg.ReadRepair = true // If DB is inconsistent (e.g. wrong replica count), attempt to repair.
cfg.ReplicationMode = olricConfig.AsyncReplicationMode // Run write and delete operations in the background.
cfg.MemberlistConfig.ProtocolVersion = memberlist.ProtocolVersionMax // Since all clients use the same code, use the maximum protocol.
cfg.MaxJoinAttempts = int(5 * time.Minute / cfg.JoinRetryInterval) // Retry joining for up to 5 minutes.
cfg.LogOutput = logger.Writer() // for some reason, setting Logger doesn't capture logs from memberlist
cfg.ServiceDiscovery = map[string]interface{}{"plugin": provider} I've also tested adding: cfg.TableSize = 1 << 18 // 256 KiB But this doesn't seem to have any effect, sadly.
That's good to hear. I'm struggling to reproduce these same conditions in a controlled fashion, but they do occur in the real environments very consistently. I'll keep trying to reproduce without the abnormal table sizes. One of the odd things I've noticed in this environment is the container's memory leaps upward in large jumps: Whenever I've taken a memory profile on those pods, they show the vast majority of memory in use was allocated by |
Configuration looks good to me.
Current design of the storage engine is read optimized. It allocates large chunks of memory on the heap and inserts the key/value pairs. It will never try to allocate memory gradually. Unfortunately, compaction is an expensive and slow operation. I have prototyped a new design and it performs better than the current one. It also significantly better for write-intensive tasks. It may be major topic for v0.5.0. I have also fixed a few bugs in the compaction code. These bugs should not make an affect for your case. I wonder that how many keys you inserted to draw that graph. Average value size is 8k, right?
I never managed to reproduce that. I have been inserting 1M keys (value size is 10 bytes) and it consumes ~350Mb. Are sure that you don't have a few very large k/v pairs? |
By the way, I have used |
Turns out many of these values were much bigger than I had measured before. To find the value sizes, I've wrapped a new gob serializer with metrics: func main() {
...
collector := metrics.NewCollector(prometheus.DefaultRegisterer)
cfg.Serializer = wrapSerializerMetrics(cfg.Serializer, collector)
...
}
type metricsSerializer struct {
serializer serializer.Serializer
collector *metrics.Collector
}
func wrapSerializerMetrics(s serializer.Serializer, collector *metrics.Collector) serializer.Serializer {
if s == nil {
s = serializer.NewGobSerializer()
}
return &metricsSerializer{
serializer: s,
collector: collector,
}
}
func (m *metricsSerializer) Marshal(v interface{}) ([]byte, error) {
data, err := m.serializer.Marshal(v)
m.collector.ObserveSerializerMarshalSize(len(data))
return data, err
}
func (m *metricsSerializer) Unmarshal(data []byte, v interface{}) error {
err := m.serializer.Unmarshal(data, v)
m.collector.ObserveSerializerUnmarshalSize(len(data))
return err
} With this, now I have a better picture of the data size per key. Critically, my assumption for key size was completely flawed. 😓 I'll take this as a lesson learned, and sorry for the wild goose chase to find a non-existent leak. This just looks like we're tossing way more data into olric than we have resources for. I'll find a way to prune everything we're not using to get more reasonable sizes. Would it be worthwhile to collaborate on a Prometheus integration or similar (like more |
Interesting, I hadn't heard of off-heap libraries like this before - very cool! As an aside: It may be harder to debug if it's outside the Go heap too, since it could be invisible to Go & pprof. |
Hi @JohnStarich I have just released Olric v0.3.9. It includes a minor fix in the compaction code to clean stale tables more effectively.
Never mind. 😉 I discovered new things in this research.
Absolutely. I had started an exporter several months ago but I was not able to finish it due to lack of spare time. It works as a sidecar but I'm open for an easier integration with Prometheus. Please open an issue for your feature request. 😉
Olric v0.4.x has network statistics.
I'll take a look at this. Thank you! |
Thanks @buraksezer! 👏 We're still sitting on v0.3.7 with our smaller-values fix in place. I discovered a separate issue upgrading to v0.3.8 surrounding Gets/Puts taking way longer than expected (max 5s went up to 1m before we canceled our contexts), but I'll open an issue if we come back around to it. For now, it's working! 😄
Can do 👍 Seems like #57 may cover it, so I'll keep tabs on it there.
Perfect 💯 I'll take a look when I circle back to improvements again, hopefully soon. Thanks again for digging into the problem with me, greatly appreciated. I'll close, since we've managed to fix it on our end too. I'll probably need to take a deeper look at the encoder magic: looks like |
We're hitting out of memory errors with high memory usage for only ~200 keys:
26.64 GiB
! The keys are fairly small, no more than 8k max, so this was pretty surprising. Further, the total sum of all allocated slabs returned by Stats() only shows a small fraction of the actually in-use memory.For one sample:
40 MiB
slab total,18 GiB
container total with134
keys on[email protected]
.Of course, some of this could be our app – so I dug in more and captured profiles. 😄
Data collection
Here's top heap:
Which points to these lines:
olric/internal/storage/storage.go
Lines 98 to 99 in 7e13bd0
I also ran a CPU profile for good measure, and found
storage.Inuse()
was taking up a large percentage of time compared to the typical HTTP traffic this app expected:Steps to reproduce
I made a test and benchmark that cause this same issue (the heap profile shows the same lines affected).
Each of them have several tuning knobs (constants) you can use to experiment, if you like. I found a high table size and small value byte size triggered the issue most easily.
Click to expand for the test:
In
dmap_put_test.go
:And the benchmark:
Which produced this output for me for a 1m bench time:
The
3 GiB
heap is telling me there's effectively a leak in there. I imagine compaction could help take care of it, but some of the memory persists long-term.Theories
It seems this is most easily triggered when the table size is much larger than the average value size.
I've explored some possibilities so far:
go test -race
. It might still be a race in the protocol somehow, which the race detector isn't catching, but I tend to trust the race detector.Hope that helps. I'll try and keep this issue updated if I discover anything new.
The text was updated successfully, but these errors were encountered: