-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix loading huge series into RAM when points are overwritten #6556
Conversation
By analyzing the blame information on this pull request, we identified @mark-rushakoff and @joelegasse to be potential reviewers |
a[i] = b[j] | ||
i++ | ||
j++ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but we should probably save UnixNano()
since we're calling it twice. Maybe atime
& btime
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait, I was thinking those were time.Time
values. Ok, it probably doesn't make a big difference if they're just grabbing int64
values underneath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a small savings to save the slice indexing call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: you could save a few lines with:
if a[i].UnixNano() > b[j].UnixNano() {
a[i], b[j] = b[j], a[i]
} else if a[i].UnixNano() == b[j].UnixNano(){
a[i] = b[j]
j++
}
i++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@e-dard That's nicer. I'll update it to use:
var i, j int
for ; i < len(a) && j < len(b); i++ {
av, bv := a[i].UnixNano(), b[j].UnixNano()
if av > bv {
a[i], b[j] = b[j], a[i]
} else if av == bv {
a[i] = b[j]
j++
}
}
Overall it lgtm. Just a few minor comments. |
LGTM 👍 |
In some query scenarios, if there are a lot of points on disk spread across many blocks in TSM files and a point is overwritten near the begginning of the shard's timerange, the full series could be loaded into RAM triggering OOMs and huge allocations. The issue was that the KeyCursor code that handles overwriting points had a simple implementation that just deduped the whole series in this case. This falls over when the series is quite large. Instead, the KeyCursor has been changed to only decode blocks with updated points. It then keeps track of what section of the blocks have been read so they are not re-read when the later points are decoded. Since the points in a block are always sorted, the code was also changed to remove the Deduplicate calls since they end up reallocating the slice. Instead, we do a sorted merge and re-use the slice as much as we can.
If a large series contains a point that is overwritten, the compactor would load the whole series into RAM during a full compaction. If the series was large, it could cause very large RAM spikes and OOMs. The change reworks the compactor to merge blocks more incrementally similar to the fix done in #6556.
If a large series contains a point that is overwritten, the compactor would load the whole series into RAM during a full compaction. If the series was large, it could cause very large RAM spikes and OOMs. The change reworks the compactor to merge blocks more incrementally similar to the fix done in #6556. Fixes #6557
If a large series contains a point that is overwritten, the compactor would load the whole series into RAM during a full compaction. If the series was large, it could cause very large RAM spikes and OOMs. The change reworks the compactor to merge blocks more incrementally similar to the fix done in #6556. Fixes #6557
If a large series contains a point that is overwritten, the compactor would load the whole series into RAM during a full compaction. If the series was large, it could cause very large RAM spikes and OOMs. The change reworks the compactor to merge blocks more incrementally similar to the fix done in #6556. Fixes #6557
Required for all non-trivial PRs
In some query scenarios, if there are a lot of points on disk for the same series spread across many blocks in TSM files and a point is overwritten near the beginning of the shard's time range, the full series could be loaded into RAM triggering OOMs and huge allocations. I believe this same problem exists in the compactor, but this PR only fixes the query path. A separate fix will be needed for compactions.
The issue was that the
KeyCursor
code that handles overwriting points had a simple implementation that just deduped the whole series in this case. This falls over when the series is quite large.Instead, the
KeyCursor
has been changed to only decode blocks with updated points. It then keeps track of what section of the blocks have been read so they are not re-read when the later points are decoded.Since the points in a block are always sorted, the code was also changed to remove the
Deduplicate
calls since they end up reallocating the slice (as well as creating an equally sizedmap
). Instead, we do a sorted merge and re-use the slice as much as we can.To reproduce this issue, I wrote 50M points to a single series and overwrote the first point a few times until the TSM file blocks were in the state that this issue surfaced. Then I ran a
select count(value) from cpu
query to could every point in the series.Some before and after system stats:
Before
After
Query time reduced from 40s to 11.3s and RSS from 6GB to 137MB.