Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: investigate SingleDelete for raft log truncation #8979

Open
dt opened this issue Aug 31, 2016 · 10 comments
Open

kvserver: investigate SingleDelete for raft log truncation #8979

dt opened this issue Aug 31, 2016 · 10 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior. o-perf-efficiency Related to performance efficiency T-kv KV Team

Comments

@dt
Copy link
Member

dt commented Aug 31, 2016

@tbg on 2023-04-10:

Re-using this issue to focus exclusively on SingleDelete for the raft log, see #8979 (comment).


On #4321, @Simkha mentioned SingleDelete and its handling in compactions:

The following link points to a comment in CompactionIterator::NextFromInput(), the comment explains when presence of SingleDelete would compact out (i.e. clear) the stored key and the SingleDelete record. This iterator is used during meltable flush.

https://github.com/facebook/rocksdb/blob/2fc2fd92a94dba8fbad5679fce4522c4769aab27/db/compaction_iterator.cc#L227-L260

If you can adapt your code to meet the restrictions of SingleDelete, then I think, it could be useful.

https://github.com/facebook/rocksdb/wiki/Single-Delete

Jira issue: CRDB-6156

Epic CRDB-39898

@dt dt added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Aug 31, 2016
@tbg
Copy link
Member

tbg commented Aug 31, 2016

See
#4196 (comment)

On Wed, Aug 31, 2016 at 10:45 AM David Taylor [email protected]
wrote:

On #4321 #4321, @Simkha
https://github.com/simkha mentioned SingleDelete:

This info could be helpful for you. I was studying rocksdb sources and
thought I could share this little info.

The following link points to a comment in
CompactionIterator::NextFromInput(), the comment explains when presence of
SingleDelete would compact out (i.e. clear) the stored key and the
SingleDelete record. This iterator is used during meltable flush.

https://github.com/facebook/rocksdb/blob/2fc2fd92a94dba8fbad5679fce4522c4769aab27/db/compaction_iterator.cc#L227-L260

If you can adapt your code to meet the restrictions of SingleDelete, then
I think, it could be useful.

https://github.com/facebook/rocksdb/wiki/Single-Delete


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#8979, or mute the thread
https://github.com/notifications/unsubscribe-auth/AE135HPcrF3fzqoBMh68n2XHzck8DzB4ks5qlZOTgaJpZM4JxsgK
.

-- Tobias

@petermattis
Copy link
Collaborator

I think we might be able to use SingleDelete for non-versioned keys such as the transaction keys. Currently, those keys are written multiple times (via Put) and then deleted a single time. But we can change that so that we call SingleDelete before overwriting. This would result in a series of Put, SingleDelete, Put, SingleDelete, Put, SingleDelete operations for the key which I believe is kosher.

@ghost
Copy link

ghost commented Sep 1, 2016

Interestingly, according to CompactionIterator, currently, the keys deleted with DB::Delete() will only be deleted during compaction, they will not be deleted during memtable flush (unlike SingleDelete). You can tell this by compaction_ != nullptr check.

For completeness, here's the call stack leading to CompactionIterator during memtable flush:

- DBImpl::BackgroundFlush in db/db_impl.cc
- DBImpl::FlushMemTableToOutputFile
- FlushJob::Run -- db/flush_job.cc
- FlushJob::WriteLevel0Table
- BuildTable -- db/builder.cc
- CompactionIterator::Next
- CompactionIterator::NextFromInput -- db/compaction_iterator.cc

@ghost
Copy link

ghost commented Sep 2, 2016

Note that DB::Delete()d keys are probably deleted somewhere else in the code because body of the kTypeDeletion case in CompactionIterator::NextFromInput only drops the deletion entry (by skipping over it), not the key-value itself.

@ghost
Copy link

ghost commented Oct 19, 2016

The following text from here seems to be pointing to what is the real difference between SingleDelete and Delete:

if you never overwrite existing keys, you can try to use DB::SingleDelete() instead of Delete() to kill tombstones sooner. Tombstones will be dropped after it meets the original keys, rather than compacted to the last level.

@petermattis petermattis added this to the Later milestone Feb 22, 2017
@nvanbenschoten nvanbenschoten added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Aug 17, 2018
@nvanbenschoten
Copy link
Member

Expanding on @petermattis's comment #8979 (comment), we should explore whether SingleDelete can be used for Raft log entries during log truncations. The impetus for this is that Raft log entries should almost never get very deep into the LSM because log truncations will often follow in close succession. Following this logic, it would be ideal if the deletion tombstones laid down by log truncations also avoided getting very deep into the LSM.

At the moment, this is not the case because log truncations use Delete operations to clean up log entry keys. The effect of this is that even if log entries are removed from the LSM quickly, log entry tombstones stay in the LSM all the way until they hit the bottom. There are a few reasons why this is undersirable. One is that it contributes to write amplification and causes RocksDB to perform extra work. Another is that we've seen that if we're not careful, these tombstones can get in the way of other operations and cause significant slowdown. SingleDelete operations would allow us to avoid both of these problems to some degree.

In order to use SingleDelete operations for Raft log truncation, we'd have to ensure that no log entry is ever written to RocksDB twice without an intermediate SingleDelete. This seems reasonable, as we have all the necessary information (e.g. prevLastIndex) to enforce this invariant in Replica.append. In fact, we already have an optimization that depends on us making the distinction between whether a log entry is new or whether it's overwriting an existing entry:

if ent.Index > prevLastIndex {
err = engine.MVCCBlindPut(ctx, batch, &diff, key, hlc.Timestamp{}, value, nil /* txn */)
} else {
err = engine.MVCCPut(ctx, batch, &diff, key, hlc.Timestamp{}, value, nil /* txn */)
}

We'd also have to be careful about how SingleDelete interacts with existing Delete and RangeDelete tombstones. For instance, at the moment it looks like SingleDelete will actually remove Delete tombstones, which is absolutely not what we want and something we'd have to be careful about if migrating to the use of SingleDelete.

It's also possible that recent changes to DeleteRange have made it the ideal choice for raft log truncation instead. We should experiment with both options.

@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@nvanbenschoten
Copy link
Member

Another interesting use of this is in cleaning up MVCCMetadata keys that are removed when resolving intents. If we were careful about updating these keys then we could use SingleDelete operations to remove them.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 2, 2019
Informs cockroachdb#8979.

This commit doesn't actually use the operation yet, but any experiments
will require this plumbing, so it made sense to pull it out and properly
test it.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 2, 2019
Informs cockroachdb#8979.

This commit doesn't actually use the operation yet, but any experiments
will require this plumbing, so it made sense to pull it out and properly
test it.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 2, 2019
Informs cockroachdb#8979.

This commit doesn't actually use the operation yet, but any experiments
will require this plumbing, so it made sense to pull it out and properly
test it.

Release note: None
craig bot pushed a commit that referenced this issue Jan 2, 2019
33436: cli/interactive_tests: unskip test_url_db_override r=knz a=knz

Fixes #30796.

... and use a more clearly guaranteed-bad DNS name for the expected
failure. Suggested by @petermattis.

Release note: None

33440: libroach/engine: plumb support for SingleDelete operation r=nvanbenschoten a=nvanbenschoten

Informs #8979.

This commit doesn't actually use the operation yet, but any experiments
will require this plumbing, so it made sense to pull it out and properly
test it.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@sumeerbhola
Copy link
Collaborator

The separated lock table uses SingleDelete for locks (MVCCMetadata) when possible. The remaining place where this could be used is for the raft log.

@jlinder jlinder added the T-storage Storage Team label Jun 16, 2021
@tbg tbg changed the title storage: investigate SingleDelete kvserver: investigate SingleDelete for raft log truncation Apr 10, 2023
@tbg tbg added T-kv-replication and removed T-storage Storage Team labels Apr 10, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 10, 2023

cc @cockroachdb/replication

@tbg tbg added A-kv-replication Relating to Raft, consensus, and coordination. and removed A-storage Relating to our storage engine (Pebble) on-disk storage. labels Apr 10, 2023
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
@tbg tbg added the o-perf-efficiency Related to performance efficiency label Nov 5, 2024
@pav-kv
Copy link
Collaborator

pav-kv commented Nov 6, 2024

Slack thread

Ran a little experiment with #134355 using SingleDelete for raft log compactions. Benchmark: kv0/enc=false/nodes=3.

Before After
Screenshot 2024-11-06 at 00 20 48 Screenshot 2024-11-06 at 00 20 42
Screenshot 2024-11-06 at 00 21 11 Screenshot 2024-11-06 at 00 21 03
Screenshot 2024-11-06 at 00 21 44 Screenshot 2024-11-06 at 00 21 33
KV QPS: Screenshot 2024-11-06 at 00 30 29 KV QPS: Screenshot 2024-11-06 at 00 30 13

There is no much gain in KV QPS on this run, but there is a tangible difference in tombstones and compaction/flush metrics. Don't know if it's within an expected margin, or shows a real improvement. Would be interesting to run some more experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior. o-perf-efficiency Related to performance efficiency T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

7 participants