Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: disk space not reclaimed even though live bytes are deleted #90149

Closed
rafiss opened this issue Oct 18, 2022 · 29 comments · Fixed by cockroachdb/pebble#2022
Closed

storage: disk space not reclaimed even though live bytes are deleted #90149

rafiss opened this issue Oct 18, 2022 · 29 comments · Fixed by cockroachdb/pebble#2022
Assignees
Labels
A-row-level-ttl A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team

Comments

@rafiss
Copy link
Collaborator

rafiss commented Oct 18, 2022

Describe the problem

@glennfawcett ran an experiment with row-level TTL. Even though rows are deleted by the job, and the KV layer seemingly thinks the space is being freed, the disk usage continues steadily increasing.

To Reproduce

See instructions in https://cockroachlabs.slack.com/archives/C028YAH6Y3V/p1665959426411949

Expected behavior

Disk space should be freed.

Additional data / screenshots

image

Environment:

  • CockroachDB version v22.2.0-beta.3

Jira issue: CRDB-20616

@rafiss rafiss added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. GA-blocker A-row-level-ttl branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 branch-release-22.2.0 labels Oct 18, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label Oct 18, 2022
@nicktrav
Copy link
Collaborator

I took a look at the /debug/LSM output for one node and it looks like there are a large number of Zombie tables ztlb (tables that are no longer in the working set of the LSM, but "something" is holding a reference to them_. This prevents the tables from being physically deleted from filesystem:

Store 1:
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    16 M       -    83 G       -       -       -       -    85 G       -       -       -     1.0
      0         0     0 B    0.00    85 G   5.6 G     534     0 B       0    10 G   8.4 K     0 B       0     0.1
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      4        16    54 M    0.90    15 G   288 M      36   499 M     132   112 G    35 K   112 G       1     7.3
      5        54   295 M    0.94    14 G   1.5 G     438   1.5 G     520    66 G    12 K    67 G       1     4.6
      6       125   1.6 G       -    21 G   2.5 G      30   272 M      39    79 G   6.3 K    89 G       1     3.8
  total       195   2.0 G       -    95 G   9.9 G   1.0 K   2.2 G     691   362 G    62 K   268 G       3     3.8
  flush      1784
compact     14035     0 B     0 B       0          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype     12972     129     231     691      12       0       0  (default, delete, elision, move, read, rewrite, multi-level)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl      8603    46 G
 bcache     495 K   7.3 G   96.0%  (score == hit-rate)
 tcache      38 K    24 M   99.9%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -   97.6%  (score == utility)

There are multiple stores on that one node, each with ~45-50 GB of zombie tables.

Unfortunately, the observability story here isn't as good as it could be. The last time we investigated something like this, the goroutine dumps were useful (typically one or more blocked goroutines in Pebble iterator codepaths).

Here's an issue tracking observability improvements.

@nicktrav nicktrav self-assigned this Oct 18, 2022
@jbowens
Copy link
Collaborator

jbowens commented Oct 18, 2022

maybe my request for KV to investigate was premature: there are zero zombie memtables and no open sstable iterators? that’s unexpected… i suppose it’s possible from an iterator leak if there’s been no flushes and the leaked iterator hasn’t been positioned?

@nicktrav
Copy link
Collaborator

nicktrav commented Oct 18, 2022

Interestingly, the table cache size is more or less monotonically increasing. In the LSM printout above, we have 38K tables in the cache, yet only 195 + 8603 tables in the LSM (live + zombie). I find this surprising, but it may not be causal.

Gist of table cache size over time:
https://gist.github.com/nicktrav/8843e073b7497c1cb8e294fec964878b

Edit: This was a red herring. The nodes are running with four stores, and the table cache is shared. Adding the ztbl counts and the live table count we get the table cache count. Gist.

@nicktrav
Copy link
Collaborator

I took another debug.zip and tsdump and put them both here.

@nicktrav
Copy link
Collaborator

Nothing stood out from the goroutine dumps. Per Jackson's comment, it's also a bit weird that the zombie memtable count and the table iterator counts are both zero. There are also no Pebble snapshots open.

It might be good for someone from KV to take a quick look at this from the TTL perspective, just as a sense check.

As a next step for Storage, it's probably worth trying to run a patched binary that will give us some more insight into where we're leaking the refcounts on the tables that (I assume) is preventing the tables from being deleted.

@nicktrav
Copy link
Collaborator

There's definitely something off with the ztbls. I found this even before running the workload. Fresh, stock cluster, nothing to do with TTL.

To repro:

# New cluster.
$ ./bin/roachprod create travers-test -n 8 -c gce --gce-machine-type n1-standard-16
$ ./bin/roachprod stage travers-test release v22.2.0-beta.3
$ ./bin/roachprod start travers-test

# Fetch ztbls from each node.
$ date; for i in $(seq 1 8); do echo $i; curl -s http://travers-test-000$i.roachprod.crdb.io:26258/debug/lsm | grep ztbl; done
Tue Oct 18 21:02:05 PDT 2022
1
   ztbl        15    36 K
2
   ztbl       220   368 K
3
   ztbl       210   357 K
4
   ztbl       170   419 K
5
   ztbl       170   536 K
6
   ztbl       170   516 K
7
   ztbl       180   595 K
8
   ztbl       170   480 K

I would not expect these to be zero for an extended period of time.

Here's n2's LSM:

Store 2:
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   8.3 K       -   507 K       -       -       -       -   520 K       -       -       -     1.0
      0         0     0 B    0.00   512 K   243 K     143     0 B       0   143 K      33     0 B       0     0.3
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      4         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      5         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      6         4   131 K       -   469 K    84 K      55   1.5 K       1   3.6 M      79   3.9 M       1     7.9
  total         4   131 K       -   848 K   328 K     198   1.5 K       1   4.6 M     112   3.9 M       1     5.5
  flush        33
compact       121     0 B     0 B       0          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype        65       0      55       1       0       0       0  (default, delete, elision, move, read, rewrite, multi-level)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl       165   290 K
 bcache     1.0 K   678 K   82.3%  (score == hit-rate)
 tcache       167   111 K   93.5%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -   47.5%  (score == utility)

Only 4 tables in the live set, but 165 tables that are already zombied. This is without even having run any DDL or DML.

I then start writing data (I adapted the Python script from the Slack thread into a little Go workload), I start to see the ztbl count increase (albeit very slowly).

$ date; for i in $(seq 1 8); do echo $i; curl -s http://travers-test-000$i.roachprod.crdb.io:26258/debug/lsm | grep ztbl; done
Tue Oct 18 21:07:52 PDT 2022
1
   ztbl        35   7.2 M
2
   ztbl       220   368 K
3
   ztbl       220   2.5 M
4
   ztbl       180    11 M
5
   ztbl       185   4.4 M
6
   ztbl       195   6.5 M
7
   ztbl       205   6.2 M
8
   ztbl       195   7.5 M

Not saying this is related to the TTL issue, but this feels odd enough to dig into.

Getting late here. Will pick this up again tomorrow in the AM.

@erikgrinaker
Copy link
Contributor

Any chance this has to do with the iterator reuse changes in 22.2? Or is it just a plain ol' leak?

@nicktrav
Copy link
Collaborator

Verified this is not an issue in 22.1.9. I don't think this is particularly surprising (the regression was in 22.2 development). Working on tracking it down.

@rafiss
Copy link
Collaborator Author

rafiss commented Oct 19, 2022

22.2 also had substantial development on how the ttljob works, but they were all at the SQL level.

Just throwing in an idea in case it could help: do we know for sure that the problem is limited to row-level-TTL? Would it help to write a test that has the same setup, but then tries to delete rows with an external script, rather than with row-level-TTL?

@nicktrav
Copy link
Collaborator

do we know for sure that the problem is limited to row-level-TTL?

Based on what I'm seeing, there's at least one regression around ref counting of Pebble SSTables present on master (and the 22.2 release branch), and this is not related to row-level-TTL.

I'm hoping that whatever that issue may be, it is being tickled even more by row-level-TTL / GC, which is resulting in the issue documented here (large, monotonically increasing ztbl counts). Can't rule out row-level-TTL just yet. Best case we fix that, and it fixes the issue as observed with row-level-TTL.

I'm going to keep posting my findings on the ztbl leak here for now. Open to spinning up a new issue for that, if that's easier.

Would it help to write a test that has the same setup, but then tries to delete rows with an external script, rather than with row-level-TTL?

My attempts at reproducing this with a similar workload and setup to the one provided in the Slack thread didn't yield anything fruitful. I will revisit that once I find out what's going on with this "non-zero ztbls immediately on startup" line of investigation.

@nicktrav
Copy link
Collaborator

nicktrav commented Oct 19, 2022

I was able to use some nightly builds to bisect down to a smaller range of commits to check.

cockroach-v22.2.0-alpha.1-172-g448352ce4e.linux-amd64 - BAD
cockroach-v22.2.0-alpha.00000000-2096-g748b89a3fb.linux-amd64 - BAD
cockroach-v22.2.0-alpha.00000000-1181-g4e81340338.linux-amd64 - BAD
cockroach-v22.2.0-alpha.00000000-358-g5a54758ce8.linux-amd64 - GOOD
cockroach-v22.2.0-alpha.00000000-765-g47df1bc3a2.linux-amd64 - GOOD
cockroach-v22.2.0-alpha.00000000-834-g13cb2f6c40.linux-amd64 - GOOD
cockroach-v22.2.0-alpha.00000000-984-ga611410e3b.linux-amd64 - BAD
cockroach-v22.2.0-alpha.00000000-968-gfc7ae339c3.linux-amd64 - BAD
cockroach-v22.2.0-alpha.00000000-895-g7a1ced4a49.linux-amd64 - GOOD
cockroach-v22.2.0-alpha.00000000-935-g7af6c0d3e2.linux-amd64 - GOOD

First bad: cockroach-v22.2.0-alpha.00000000-968-gfc7ae339c3.linux-amd64

This translates to the following two Cockroach SHAs:

There's a ~days worth of commits in there to pick through:

fc7ae33 Merge #83361
68bd121 Merge #82248
12462e8 Merge #83188 #83191 #83339 #83349
b4fecad Merge #83340
324c837 Merge #83010
b75ebd1 Merge #82873 #83282 🔴
d95c088 Merge #82616 #83153 🔴
5a74621 Merge #82927
851e329 Merge #82799 🔴
c09d7fc Merge #83031 🟢
4ff65df Merge #82695 🟢
7af6c0d Merge #82938 #82992 #83008 #83290 #83299 🟢

There are a few pkg/storage changes in there. More to follow ...

@nicktrav
Copy link
Collaborator

Ok - I'm confident that the issue was introduced in #82799, back in June. Going to pick through that again now for iterator leaks. cc: @erikgrinaker

@erikgrinaker
Copy link
Contributor

Thanks for bisecting this down Nick! That's rather surprising, since that PR only deals with infrastructure for external SST files, and doesn't even really enable any of it in production code. I'll have a closer look tomorrow unless you beat me to it.

@nicktrav
Copy link
Collaborator

nicktrav commented Oct 19, 2022

I'll take a stab while you 💤, and post back here when I log off.

It could also be the case that we started using some new Pebble API here, and that's where the leak is coming from.

@nicktrav
Copy link
Collaborator

Last update for the evening ... I wasn't able to make too much more progress.

There's something holding onto Pebble read states for long periods of time. I've seen cases where the read state stuck around for 60+ seconds, while the database was under import load, so there would have been plenty of compactions causing updates to the manifest. A read state hanging around would prevent the dead tables from being deleted.

In the case of one of these read states, I instrumented Pebble to print the stacks of goroutines that had incremented the ref count on the read state, and then those that decremented the count. I saw two goroutines performing replica consistency checks. These threads too over a minute to complete.

I wonder if it is expected behavior for those threads to run for that long. Maybe? This made me suspect that this is red herring.

Digging into the commit I bisected down to probably makes the most sense. I'll leave that to you Erik. I have a feeling you may be right about it being the re-use / cloning.

It's fairly easy to spot this issue - spin up a few nodes with roachprod, wipe, then start. Hitting /debug/lsm on a few nodes you should see non-zero ztbls (with an 8 node cluster, I see around 150-200 ztbls on various nodes). Note that you shouldn't need to do anything other than start the cluster.

@erikgrinaker
Copy link
Contributor

Seems to be caused by the RangeKeyDelete range key tombstones that we lay down at the bottom of Raft snapshots. This patch "fixes" the leak:

diff --git a/pkg/storage/sst_writer.go b/pkg/storage/sst_writer.go
index be4e4c7d28..f920b99bf6 100644
--- a/pkg/storage/sst_writer.go
+++ b/pkg/storage/sst_writer.go
@@ -136,7 +136,7 @@ func (fw *SSTWriter) ClearRawRange(start, end roachpb.Key, pointKeys, rangeKeys
                        return err
                }
        }
-       if rangeKeys && fw.supportsRangeKeys {
+       if rangeKeys && fw.supportsRangeKeys && false {
                fw.DataSize += int64(len(start)) + int64(len(end))
                if err := fw.fw.RangeKeyDelete(fw.scratch, endRaw); err != nil {
                        return err

Or in the original cf162f9:

diff --git a/pkg/storage/sst_writer.go b/pkg/storage/sst_writer.go
index f7c3a16d98..84f2e9e9d8 100644
--- a/pkg/storage/sst_writer.go
+++ b/pkg/storage/sst_writer.go
@@ -206,7 +206,8 @@ func (fw *SSTWriter) ExperimentalClearAllRangeKeys(start roachpb.Key, end roachp
        // SST, to avoid dropping unnecessary range tombstones. However, this may not
        // be safe, because the caller may want to ingest the SST including the range
        // tombstone into an engine that does have range keys that should be cleared.
-       return fw.fw.RangeKeyDelete(EncodeMVCCKeyPrefix(start), EncodeMVCCKeyPrefix(end))
+       //return fw.fw.RangeKeyDelete(EncodeMVCCKeyPrefix(start), EncodeMVCCKeyPrefix(end))
+       return nil
 }

@nicktrav
Copy link
Collaborator

Thanks for digging into it! Now the question .. why does does that tombstone result in zombied tables.

@nicktrav
Copy link
Collaborator

Posting some more data.

Looking at the contents of the tables that aren't being deleted, all of the ones I've looked at have range keys present.

For example, in this LSM, there are only three tables in L6 in the live set:

EOF
--- L0 ---
--- L1 ---
--- L2 ---
--- L3 ---
--- L4 ---
--- L5 ---
--- L6 ---
  000547:331157<#0-#0>[/Local/RangeID/1/r/RangeGCThreshold/0,0#0,SET-/Table/39/1/"\xc5\xfc,wv\xdfMw\x8a\xccCu\x93\xce\"\xc9"/0/1666278048.039970776,0#0,SET]
  000199:2030<#0-#0>[/Table/47/1/""/0/1666278047.240654686,0#0,SET-/Table/47/1/"\xbc"/0/1666278047.240654686,0#0,SET]
  000284:1018<#0-#0>[/Table/48/1/0/0/1666278047.069367782,0#0,SET-/Table/48/1/0/0/1666278047.069367782,0#0,SET]

Yet we have 263 tables on disk:

$ find ~/local/travers-test-2/data -name '*.sst' | wc -l
263

Here's the lifecycle of some of the tables that aren't in the working set but are still on disk.

Table 000004.

169/4
  next-file-num: 12
  last-seq-num:  21
  added:         L6 000004:1446<#16-#16>[/Local/RangeID/33/r""/0,0#16,RANGEKEYDEL-/Local/RangeID/33/s""/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
  added:         L0 000009:1169<#17-#17>[/Local/RangeID/33/u""/0,0#17,RANGEDEL-/Local/RangeID/33/v""/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
  added:         L0 000005:1527<#18-#18>[/Local/Range/Table/32/0,0#18,RANGEKEYDEL-/Local/Range/Table/33/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
  added:         L6 000006:1200<#19-#19>[/Local/Lock/Intent/Local/Range/Table/32/0,0#19,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/33/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
  added:         L6 000007:1175<#20-#20>[/Local/Lock/Intent/Table/32/0,0#20,RANGEKEYDEL-/Local/Lock/Intent/Table/33/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)
  added:         L6 000008:1151<#21-#21>[/Table/32/0,0#21,RANGEKEYDEL-/Table/33/0,0#72057594037927935,RANGEDEL] (2022-10-20T15:00:48Z)

Created in an ingestion into L7:

I221020 15:00:48.123766 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 16  [JOB 5] ingesting: sstable created 000004
I221020 15:00:48.123792 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 17  [JOB 5] ingesting: sstable created 000009
I221020 15:00:48.123802 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 18  [JOB 5] ingesting: sstable created 000005
I221020 15:00:48.123814 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 19  [JOB 5] ingesting: sstable created 000006
I221020 15:00:48.123824 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 20  [JOB 5] ingesting: sstable created 000007
I221020 15:00:48.123833 464 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 21  [JOB 5] ingesting: sstable created 000008
I221020 15:00:48.131327 464 3@pebble/event.go:665 ⋮ [n2,s2,pebble] 28  [JOB 5] ingested L6:000004 (1.4 K), L0:000009 (1.1 K), L0:000005 (1.5 K), L6:000006 (1.2 K), L6:000007 (1.1 K), L6:000008 (1.1 K)

Removed in an elision-only compaction:

I221020 15:00:48.137726 654 3@pebble/event.go:625 ⋮ [n2,s2,pebble] 32  [JOB 11] compacting(elision-only) L6 [000004] (1.4 K) + L6 [] (0 B)
I221020 15:00:48.137851 654 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 35  [JOB 11] compacting: sstable created 000013
I221020 15:00:48.141138 654 3@pebble/event.go:629 ⋮ [n2,s2,pebble] 36  [JOB 11] compacted(elision-only) L6 [000004] (1.4 K) + L6 [] (0 B) -> L6 [000013] (1.1 K), in 0.0s (0.0s total), output rate 553 K/s

The resulting table was stripped of the range del and the range key dels:

696/6
  next-file-num: 14
  last-seq-num:  27
  deleted:       L6 000004
  added:         L6 000013:1172<#0-#0>[/Local/RangeID/33/r/RangeGCThreshold/0,0#0,SET-/Local/RangeID/33/r/RangeVersion/0,0#0,SET] (2022-10-20T15:00:48Z)

From that initial ingestion, all of the tables that contain RANGEKEYDELs are still present on disk. The file 000009 (which only had a RANGEDEL was deleted):

$ find ~/local/travers-test-2/data -name '*.sst' | sort | head -n 10
/home/nickt/local/travers-test-2/data/000004.sst
/home/nickt/local/travers-test-2/data/000005.sst
/home/nickt/local/travers-test-2/data/000006.sst
/home/nickt/local/travers-test-2/data/000007.sst
/home/nickt/local/travers-test-2/data/000008.sst
/home/nickt/local/travers-test-2/data/000014.sst
/home/nickt/local/travers-test-2/data/000015.sst
/home/nickt/local/travers-test-2/data/000016.sst
/home/nickt/local/travers-test-2/data/000017.sst
/home/nickt/local/travers-test-2/data/000018.sst

Table 000009 being deleted:

627/5
  next-file-num: 13
  last-seq-num:  24
  deleted:       L0 000005
  deleted:       L0 000009
  deleted:       L6 000011
  added:         L6 000012:1754<#0-#0>[/Local/RangeID/33/u/RaftHardState/0,0#0,SET-/Local/Store"uptm"/0,0#0,SET] (2022-10-20T15:00:48Z)
I221020 15:00:48.131407 650 3@pebble/event.go:625 ⋮ [n2,s2,pebble] 29  [JOB 10] compacting(default) L0 [000009 000005] (2.6 K) + L6 [000011] (1.5 K)
I221020 15:00:48.131530 650 3@pebble/event.go:657 ⋮ [n2,s2,pebble] 30  [JOB 10] compacting: sstable created 000012
I221020 15:00:48.137659 650 3@pebble/event.go:629 ⋮ [n2,s2,pebble] 31  [JOB 10] compacted(default) L0 [000009 000005] (2.6 K) + L6 [000011] (1.5 K) -> L6 [000012] (1.7 K), in 0.0s (0.0s total), output rate 951 K/s
I221020 15:00:48.137797 653 3@pebble/event.go:661 ⋮ [n2,s2,pebble] 33  [JOB 10] sstable deleted 000009
I221020 15:00:48.137854 653 3@pebble/event.go:661 ⋮ [n2,s2,pebble] 34  [JOB 10] sstable deleted 000011

Putting this all together, the only tables that remain in the working set of no range keys. All of the tables that are present on disk that are not in the working set do have range keys.

$ for table in $(find ~/local/travers-test-2/data/ -name '*.sst' | sort); do if [[ $(./cockroach debug pebble sstable scan $table | grep RANGEKEYDEL | wc -l) == 0 ]]; then echo "$table: NO RANGE KEYS"; fi; done
/home/nickt/local/travers-test-2/data/000199.sst: NO RANGE KEYS
/home/nickt/local/travers-test-2/data/000284.sst: NO RANGE KEYS
/home/nickt/local/travers-test-2/data/000547.sst: NO RANGE KEYS

@nicktrav nicktrav reopened this Oct 20, 2022
@rafiss rafiss reopened this Oct 20, 2022
nicktrav added a commit to nicktrav/pebble that referenced this issue Oct 20, 2022
A manifest version uses reference counting to ensure that obsolete
SSTables are only removed once it is safe to do so (i.e. no iterators or
snapshots are open referencing tables in a older manifest version).

In cockroachdb#1771, a new slice of `LevelMetadata` was added to support range
keys. A slice already existed for point keys (plus range dels).
Currently, when a version's ref count reaches zero, only the files
present in the point key `LevelMetadata` slice have their ref counts
decremented. If a file also contains range keys, the reference count
never becomes zero and the file becomes "zombied", and can never be
deleted from disk.

Decrement the ref counts on files present in the range keys
`LevelMetadata` slice, allowing the ref counts to be correctly zeroed.

Add regression test to exercise the bug observed in
cockroachdb/cockroach#90149.

Fix cockroachdb/cockroach#90149.
craig bot pushed a commit that referenced this issue Oct 20, 2022
90406: vendor: bump Pebble to cc0dd372b6d1 r=jbowens a=nicktrav

```
cc0dd372 manifest: unref files with range keys when version refcnt is zero
653147d0 metamorphic: Flush before close if DisableWAL is true
ff933d45 scripts: default to a 30m timeout in non-stress run-crossversion-meta.sh
24f67a4b db: fix FormatPrePebblev1Marked migration
1424c12d metamorphic: Copy data/wal dirs from initial state
5ed983e5 sstable: adjust sizeEstimate logic for forthcoming value blocks
```

Release note: None.

Epic: None.

Touches #90149.

Co-authored-by: Nick Travers <[email protected]>
nicktrav added a commit to cockroachdb/pebble that referenced this issue Oct 20, 2022
A manifest version uses reference counting to ensure that obsolete
SSTables are only removed once it is safe to do so (i.e. no iterators or
snapshots are open referencing tables in a older manifest version).

In #1771, a new slice of `LevelMetadata` was added to support range
keys. A slice already existed for point keys (plus range dels).
Currently, when a version's ref count reaches zero, only the files
present in the point key `LevelMetadata` slice have their ref counts
decremented. If a file also contains range keys, the reference count
never becomes zero and the file becomes "zombied", and can never be
deleted from disk.

Decrement the ref counts on files present in the range keys
`LevelMetadata` slice, allowing the ref counts to be correctly zeroed.

Add regression test to exercise the bug observed in
cockroachdb/cockroach#90149.

Fix cockroachdb/cockroach#90149.
nicktrav added a commit to nicktrav/cockroach that referenced this issue Oct 20, 2022
```
0d8cae91 manifest: unref files with range keys when version refcnt is zero
8ed3a9e4 metamorphic: Flush before close if DisableWAL is true
4f0ec6bd db: fix FormatPrePebblev1Marked migration
ab1e28f6 metamorphic: Copy data/wal dirs from initial state
```

Release note: None.
Release justification: Fixes a high severity issue.
Epic: None.
Informs cockroachdb#90149.
nicktrav added a commit to nicktrav/cockroach that referenced this issue Oct 20, 2022
```
0d8cae91 manifest: unref files with range keys when version refcnt is zero
8ed3a9e4 metamorphic: Flush before close if DisableWAL is true
4f0ec6bd db: fix FormatPrePebblev1Marked migration
ab1e28f6 metamorphic: Copy data/wal dirs from initial state
```

Release note: None.
Release justification: Fixes a high severity issue.
Epic: None.
Informs cockroachdb#90149.
@nicktrav
Copy link
Collaborator

Fix is now on master and the 22.2 / 22.2.0 release branches. Going to keep this open until we verify that the original problem (observed with the TTL workload) is fixed.

@erikgrinaker
Copy link
Contributor

Can we modify the ClearRange roachtest to also verify that Pebble compacts away the garbage?

@nicktrav
Copy link
Collaborator

We could - thoughts on what the assertion would be?

@erikgrinaker
Copy link
Contributor

That the used store space is below some reasonable threshold, or that it dropped some share from peak.

@jbowens
Copy link
Collaborator

jbowens commented Oct 21, 2022

a relevant tbg TODO:

		// TODO(tschottdorf): verify that disk space usage drops below to <some small amount>, but that
		// may not actually happen (see https://github.com/cockroachdb/cockroach/issues/29290).

@nicktrav
Copy link
Collaborator

We're satisfied that the issues we were seeing in the original description have been addressed.

@nicktrav nicktrav changed the title sql/ttljob: disk space not reclaimed even though live bytes are deleted by ttljob storage: disk space not reclaimed even though live bytes are deleted Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-row-level-ttl A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants