-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: disk space not reclaimed even though live bytes are deleted #90149
Comments
I took a look at the
There are multiple stores on that one node, each with ~45-50 GB of zombie tables. Unfortunately, the observability story here isn't as good as it could be. The last time we investigated something like this, the goroutine dumps were useful (typically one or more blocked goroutines in Pebble iterator codepaths). Here's an issue tracking observability improvements. |
maybe my request for KV to investigate was premature: there are zero zombie memtables and no open sstable iterators? that’s unexpected… i suppose it’s possible from an iterator leak if there’s been no flushes and the leaked iterator hasn’t been positioned? |
Interestingly, the table cache size is more or less monotonically increasing. In the LSM printout above, we have 38K tables in the cache, yet only 195 + 8603 tables in the LSM (live + zombie). I find this surprising, but it may not be causal. Gist of table cache size over time: Edit: This was a red herring. The nodes are running with four stores, and the table cache is shared. Adding the |
I took another |
Nothing stood out from the goroutine dumps. Per Jackson's comment, it's also a bit weird that the zombie memtable count and the table iterator counts are both zero. There are also no Pebble snapshots open. It might be good for someone from KV to take a quick look at this from the TTL perspective, just as a sense check. As a next step for Storage, it's probably worth trying to run a patched binary that will give us some more insight into where we're leaking the refcounts on the tables that (I assume) is preventing the tables from being deleted. |
There's definitely something off with the ztbls. I found this even before running the workload. Fresh, stock cluster, nothing to do with TTL. To repro: # New cluster.
$ ./bin/roachprod create travers-test -n 8 -c gce --gce-machine-type n1-standard-16
$ ./bin/roachprod stage travers-test release v22.2.0-beta.3
$ ./bin/roachprod start travers-test
# Fetch ztbls from each node.
$ date; for i in $(seq 1 8); do echo $i; curl -s http://travers-test-000$i.roachprod.crdb.io:26258/debug/lsm | grep ztbl; done
Tue Oct 18 21:02:05 PDT 2022
1
ztbl 15 36 K
2
ztbl 220 368 K
3
ztbl 210 357 K
4
ztbl 170 419 K
5
ztbl 170 536 K
6
ztbl 170 516 K
7
ztbl 180 595 K
8
ztbl 170 480 K I would not expect these to be zero for an extended period of time. Here's n2's LSM:
Only 4 tables in the live set, but 165 tables that are already zombied. This is without even having run any DDL or DML. I then start writing data (I adapted the Python script from the Slack thread into a little Go workload), I start to see the ztbl count increase (albeit very slowly). $ date; for i in $(seq 1 8); do echo $i; curl -s http://travers-test-000$i.roachprod.crdb.io:26258/debug/lsm | grep ztbl; done
Tue Oct 18 21:07:52 PDT 2022
1
ztbl 35 7.2 M
2
ztbl 220 368 K
3
ztbl 220 2.5 M
4
ztbl 180 11 M
5
ztbl 185 4.4 M
6
ztbl 195 6.5 M
7
ztbl 205 6.2 M
8
ztbl 195 7.5 M Not saying this is related to the TTL issue, but this feels odd enough to dig into. Getting late here. Will pick this up again tomorrow in the AM. |
Any chance this has to do with the iterator reuse changes in 22.2? Or is it just a plain ol' leak? |
Verified this is not an issue in 22.1.9. I don't think this is particularly surprising (the regression was in 22.2 development). Working on tracking it down. |
22.2 also had substantial development on how the ttljob works, but they were all at the SQL level. Just throwing in an idea in case it could help: do we know for sure that the problem is limited to row-level-TTL? Would it help to write a test that has the same setup, but then tries to delete rows with an external script, rather than with row-level-TTL? |
Based on what I'm seeing, there's at least one regression around ref counting of Pebble SSTables present on I'm hoping that whatever that issue may be, it is being tickled even more by row-level-TTL / GC, which is resulting in the issue documented here (large, monotonically increasing ztbl counts). Can't rule out row-level-TTL just yet. Best case we fix that, and it fixes the issue as observed with row-level-TTL. I'm going to keep posting my findings on the ztbl leak here for now. Open to spinning up a new issue for that, if that's easier.
My attempts at reproducing this with a similar workload and setup to the one provided in the Slack thread didn't yield anything fruitful. I will revisit that once I find out what's going on with this "non-zero ztbls immediately on startup" line of investigation. |
I was able to use some nightly builds to bisect down to a smaller range of commits to check.
First bad: This translates to the following two Cockroach SHAs: There's a ~days worth of commits in there to pick through: fc7ae33 Merge #83361 There are a few |
Ok - I'm confident that the issue was introduced in #82799, back in June. Going to pick through that again now for iterator leaks. cc: @erikgrinaker |
Thanks for bisecting this down Nick! That's rather surprising, since that PR only deals with infrastructure for external SST files, and doesn't even really enable any of it in production code. I'll have a closer look tomorrow unless you beat me to it. |
I'll take a stab while you 💤, and post back here when I log off. It could also be the case that we started using some new Pebble API here, and that's where the leak is coming from. |
Last update for the evening ... I wasn't able to make too much more progress. There's something holding onto Pebble read states for long periods of time. I've seen cases where the read state stuck around for 60+ seconds, while the database was under import load, so there would have been plenty of compactions causing updates to the manifest. A read state hanging around would prevent the dead tables from being deleted. In the case of one of these read states, I instrumented Pebble to print the stacks of goroutines that had incremented the ref count on the read state, and then those that decremented the count. I saw two goroutines performing replica consistency checks. These threads too over a minute to complete. I wonder if it is expected behavior for those threads to run for that long. Maybe? This made me suspect that this is red herring. Digging into the commit I bisected down to probably makes the most sense. I'll leave that to you Erik. I have a feeling you may be right about it being the re-use / cloning. It's fairly easy to spot this issue - spin up a few nodes with roachprod, wipe, then start. Hitting |
Seems to be caused by the diff --git a/pkg/storage/sst_writer.go b/pkg/storage/sst_writer.go
index be4e4c7d28..f920b99bf6 100644
--- a/pkg/storage/sst_writer.go
+++ b/pkg/storage/sst_writer.go
@@ -136,7 +136,7 @@ func (fw *SSTWriter) ClearRawRange(start, end roachpb.Key, pointKeys, rangeKeys
return err
}
}
- if rangeKeys && fw.supportsRangeKeys {
+ if rangeKeys && fw.supportsRangeKeys && false {
fw.DataSize += int64(len(start)) + int64(len(end))
if err := fw.fw.RangeKeyDelete(fw.scratch, endRaw); err != nil {
return err Or in the original cf162f9: diff --git a/pkg/storage/sst_writer.go b/pkg/storage/sst_writer.go
index f7c3a16d98..84f2e9e9d8 100644
--- a/pkg/storage/sst_writer.go
+++ b/pkg/storage/sst_writer.go
@@ -206,7 +206,8 @@ func (fw *SSTWriter) ExperimentalClearAllRangeKeys(start roachpb.Key, end roachp
// SST, to avoid dropping unnecessary range tombstones. However, this may not
// be safe, because the caller may want to ingest the SST including the range
// tombstone into an engine that does have range keys that should be cleared.
- return fw.fw.RangeKeyDelete(EncodeMVCCKeyPrefix(start), EncodeMVCCKeyPrefix(end))
+ //return fw.fw.RangeKeyDelete(EncodeMVCCKeyPrefix(start), EncodeMVCCKeyPrefix(end))
+ return nil
} |
Thanks for digging into it! Now the question .. why does does that tombstone result in zombied tables. |
Posting some more data. Looking at the contents of the tables that aren't being deleted, all of the ones I've looked at have range keys present. For example, in this LSM, there are only three tables in L6 in the live set: EOF
--- L0 ---
--- L1 ---
--- L2 ---
--- L3 ---
--- L4 ---
--- L5 ---
--- L6 ---
000547:331157<#0-#0>[/Local/RangeID/1/r/RangeGCThreshold/0,0#0,SET-/Table/39/1/"\xc5\xfc,wv\xdfMw\x8a\xccCu\x93\xce\"\xc9"/0/1666278048.039970776,0#0,SET]
000199:2030<#0-#0>[/Table/47/1/""/0/1666278047.240654686,0#0,SET-/Table/47/1/"\xbc"/0/1666278047.240654686,0#0,SET]
000284:1018<#0-#0>[/Table/48/1/0/0/1666278047.069367782,0#0,SET-/Table/48/1/0/0/1666278047.069367782,0#0,SET] Yet we have 263 tables on disk: $ find ~/local/travers-test-2/data -name '*.sst' | wc -l
263 Here's the lifecycle of some of the tables that aren't in the working set but are still on disk. Table 169/4
next-file-num: 12
last-seq-num: 21
added: L6 000004:1446<#16-#16>[/Local/RangeID/33/r""/0,0#16,RANGEKEYDEL-/Local/RangeID/33/s""/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
added: L0 000009:1169<#17-#17>[/Local/RangeID/33/u""/0,0#17,RANGEDEL-/Local/RangeID/33/v""/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
added: L0 000005:1527<#18-#18>[/Local/Range/Table/32/0,0#18,RANGEKEYDEL-/Local/Range/Table/33/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
added: L6 000006:1200<#19-#19>[/Local/Lock/Intent/Local/Range/Table/32/0,0#19,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/33/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
added: L6 000007:1175<#20-#20>[/Local/Lock/Intent/Table/32/0,0#20,RANGEKEYDEL-/Local/Lock/Intent/Table/33/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
added: L6 000008:1151<#21-#21>[/Table/32/0,0#21,RANGEKEYDEL-/Table/33/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z) Created in an ingestion into L7: I221020 15:00:48.123766 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 16 [JOB 5] ingesting: sstable created 000004
I221020 15:00:48.123792 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 17 [JOB 5] ingesting: sstable created 000009
I221020 15:00:48.123802 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 18 [JOB 5] ingesting: sstable created 000005
I221020 15:00:48.123814 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 19 [JOB 5] ingesting: sstable created 000006
I221020 15:00:48.123824 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 20 [JOB 5] ingesting: sstable created 000007
I221020 15:00:48.123833 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 21 [JOB 5] ingesting: sstable created 000008
I221020 15:00:48.131327 464 3@pebble/event.go:665 ⋮ [n2,s2,pebble] 28 [JOB 5] ingested L6:000004 (1.4 K), L0:000009 (1.1 K), L0:000005 (1.5 K), L6:000006 (1.2 K), L6:000007 (1.1 K), L6:000008 (1.1 K) Removed in an elision-only compaction: I221020 15:00:48.137726 654 3@pebble/event.go:625 ⋮ [n2,s2,pebble] 32 [JOB 11] compacting(elision-only) L6 [000004] (1.4 K) + L6 [] (0 B)
I221020 15:00:48.137851 654 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 35 [JOB 11] compacting: sstable created 000013
I221020 15:00:48.141138 654 3@pebble/event.go:629 ⋮ [n2,s2,pebble] 36 [JOB 11] compacted(elision-only) L6 [000004] (1.4 K) + L6 [] (0 B) -> L6 [000013] (1.1 K), in 0.0s (0.0s total), output rate 553 K/s The resulting table was stripped of the range del and the range key dels: 696/6
next-file-num: 14
last-seq-num: 27
deleted: L6 000004
added: L6 000013:1172<#0-#0>[/Local/RangeID/33/r/RangeGCThreshold/0,0#0,SET-/Local/RangeID/33/r/RangeVersion/0,0#0,SET] (2022-10-20T15:00:48Z) From that initial ingestion, all of the tables that contain $ find ~/local/travers-test-2/data -name '*.sst' | sort | head -n 10
/home/nickt/local/travers-test-2/data/000004.sst
/home/nickt/local/travers-test-2/data/000005.sst
/home/nickt/local/travers-test-2/data/000006.sst
/home/nickt/local/travers-test-2/data/000007.sst
/home/nickt/local/travers-test-2/data/000008.sst
/home/nickt/local/travers-test-2/data/000014.sst
/home/nickt/local/travers-test-2/data/000015.sst
/home/nickt/local/travers-test-2/data/000016.sst
/home/nickt/local/travers-test-2/data/000017.sst
/home/nickt/local/travers-test-2/data/000018.sst Table 627/5
next-file-num: 13
last-seq-num: 24
deleted: L0 000005
deleted: L0 000009
deleted: L6 000011
added: L6 000012:1754<#0-#0>[/Local/RangeID/33/u/RaftHardState/0,0#0,SET-/Local/Store"uptm"/0,0#0,SET] (2022-10-20T15:00:48Z) I221020 15:00:48.131407 650 3@pebble/event.go:625 ⋮ [n2,s2,pebble] 29 [JOB 10] compacting(default) L0 [000009 000005] (2.6 K) + L6 [000011] (1.5 K)
I221020 15:00:48.131530 650 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 30 [JOB 10] compacting: sstable created 000012
I221020 15:00:48.137659 650 3@pebble/event.go:629 ⋮ [n2,s2,pebble] 31 [JOB 10] compacted(default) L0 [000009 000005] (2.6 K) + L6 [000011] (1.5 K) -> L6 [000012] (1.7 K), in 0.0s (0.0s total), output rate 951 K/s
I221020 15:00:48.137797 653 3@pebble/event.go:661 ⋮ [n2,s2,pebble] 33 [JOB 10] sstable deleted 000009
I221020 15:00:48.137854 653 3@pebble/event.go:661 ⋮ [n2,s2,pebble] 34 [JOB 10] sstable deleted 000011 Putting this all together, the only tables that remain in the working set of no range keys. All of the tables that are present on disk that are not in the working set do have range keys. $ for table in $(find ~/local/travers-test-2/data/ -name '*.sst' | sort); do if [[ $(./cockroach debug pebble sstable scan $table | grep RANGEKEYDEL | wc -l) == 0 ]]; then echo "$table: NO RANGE KEYS"; fi; done
/home/nickt/local/travers-test-2/data/000199.sst: NO RANGE KEYS
/home/nickt/local/travers-test-2/data/000284.sst: NO RANGE KEYS
/home/nickt/local/travers-test-2/data/000547.sst: NO RANGE KEYS |
A manifest version uses reference counting to ensure that obsolete SSTables are only removed once it is safe to do so (i.e. no iterators or snapshots are open referencing tables in a older manifest version). In cockroachdb#1771, a new slice of `LevelMetadata` was added to support range keys. A slice already existed for point keys (plus range dels). Currently, when a version's ref count reaches zero, only the files present in the point key `LevelMetadata` slice have their ref counts decremented. If a file also contains range keys, the reference count never becomes zero and the file becomes "zombied", and can never be deleted from disk. Decrement the ref counts on files present in the range keys `LevelMetadata` slice, allowing the ref counts to be correctly zeroed. Add regression test to exercise the bug observed in cockroachdb/cockroach#90149. Fix cockroachdb/cockroach#90149.
90406: vendor: bump Pebble to cc0dd372b6d1 r=jbowens a=nicktrav ``` cc0dd372 manifest: unref files with range keys when version refcnt is zero 653147d0 metamorphic: Flush before close if DisableWAL is true ff933d45 scripts: default to a 30m timeout in non-stress run-crossversion-meta.sh 24f67a4b db: fix FormatPrePebblev1Marked migration 1424c12d metamorphic: Copy data/wal dirs from initial state 5ed983e5 sstable: adjust sizeEstimate logic for forthcoming value blocks ``` Release note: None. Epic: None. Touches #90149. Co-authored-by: Nick Travers <[email protected]>
A manifest version uses reference counting to ensure that obsolete SSTables are only removed once it is safe to do so (i.e. no iterators or snapshots are open referencing tables in a older manifest version). In #1771, a new slice of `LevelMetadata` was added to support range keys. A slice already existed for point keys (plus range dels). Currently, when a version's ref count reaches zero, only the files present in the point key `LevelMetadata` slice have their ref counts decremented. If a file also contains range keys, the reference count never becomes zero and the file becomes "zombied", and can never be deleted from disk. Decrement the ref counts on files present in the range keys `LevelMetadata` slice, allowing the ref counts to be correctly zeroed. Add regression test to exercise the bug observed in cockroachdb/cockroach#90149. Fix cockroachdb/cockroach#90149.
``` 0d8cae91 manifest: unref files with range keys when version refcnt is zero 8ed3a9e4 metamorphic: Flush before close if DisableWAL is true 4f0ec6bd db: fix FormatPrePebblev1Marked migration ab1e28f6 metamorphic: Copy data/wal dirs from initial state ``` Release note: None. Release justification: Fixes a high severity issue. Epic: None. Informs cockroachdb#90149.
``` 0d8cae91 manifest: unref files with range keys when version refcnt is zero 8ed3a9e4 metamorphic: Flush before close if DisableWAL is true 4f0ec6bd db: fix FormatPrePebblev1Marked migration ab1e28f6 metamorphic: Copy data/wal dirs from initial state ``` Release note: None. Release justification: Fixes a high severity issue. Epic: None. Informs cockroachdb#90149.
Fix is now on master and the 22.2 / 22.2.0 release branches. Going to keep this open until we verify that the original problem (observed with the TTL workload) is fixed. |
Can we modify the ClearRange roachtest to also verify that Pebble compacts away the garbage? |
We could - thoughts on what the assertion would be? |
That the used store space is below some reasonable threshold, or that it dropped some share from peak. |
a relevant tbg TODO:
|
We're satisfied that the issues we were seeing in the original description have been addressed. |
Describe the problem
@glennfawcett ran an experiment with row-level TTL. Even though rows are deleted by the job, and the KV layer seemingly thinks the space is being freed, the disk usage continues steadily increasing.
To Reproduce
See instructions in https://cockroachlabs.slack.com/archives/C028YAH6Y3V/p1665959426411949
Expected behavior
Disk space should be freed.
Additional data / screenshots
Environment:
Jira issue: CRDB-20616
The text was updated successfully, but these errors were encountered: