storage: ensure application of sideloaded entries is durable on followers before truncation #38566

ajwerner · 2019-06-28T21:49:08Z

This is a theoretical issue stumbled upon during the implemented of #37426 that seems scary enough to warrant a discussion. After an entry's write batch which updates the truncated state has been applied to the storage engine we truncate the sideloaded storage up to the index implied by the truncated state update. This code comes with a note (which could probably be clearer) that we assume that the entry corresponding to this index has been applied durably. For the leaseholder this is certainly the case as it will not propose a truncation with an index that it has not applied durably. It seems generally likely that the follower will have sync'ed some new entry between the application of the index to be truncated and the application of this truncating command, but what enforces that property on followers? If, for whatever reason, raft messages were slow to be received and processed on some follower could it be the case that the truncation entry and the index to which it refers end up in the same Ready? Is it the case that the truncation won't be issued until all followers have applied up to that index?

cockroach/pkg/storage/replica_proposal.go

Lines 606 to 618 in dd9784d

    
           // Truncate the sideloaded storage. Note that this is safe only if the new truncated state 
        
           // is durably on disk (i.e.) synced. This is true at the time of writing but unfortunately 
        
           // could rot. 
        
           { 
        
           	log.Eventf(ctx, "truncating sideloaded storage up to (and including) index %d", newTruncState.Index) 
        
           	if size, _, err := r.raftMu.sideloaded.TruncateTo(ctx, newTruncState.Index+1); err != nil { 
        
           		// We don't *have* to remove these entries for correctness. Log a 
        
           		// loud error, but keep humming along. 
        
           		log.Errorf(ctx, "while removing sideloaded files during log truncation: %s", err) 
        
           	} else { 
        
           		rResult.RaftLogDelta -= size 
        
           	} 
        
           }

cc @tbg

Jira issue: CRDB-5626

tbg · 2019-07-12T08:27:14Z

Yes, this is a real concern (I filed this previously in #36414, but I'm going to close that issue).

This is annoying, but we should fix it somehow. One way to do it is to sniff whether TruncateTo will actually do anything (i.e. is there even a nonempty truncated storage, perhaps without resorting to disk I/O which is already incurred today) and if so, throw in an extra Sync before the actual TruncateTo. The potential perf penalty of this would only matter during operations that do ingest SSTs, and it's unclear whether it would constitute a perf hit.

The optimal thing we could do is pretty much out there. You want to offload the actual sideoad truncations to some other goroutine, and that goroutine will wait until it's observed a sync (unclear how it would do that, we could rework the batching mechanism to allow joining (creating) a batch as a non-leader), at least for a bit, and then do its work.

Or easier, in the context of @nvanbenschoten's latest proposal in #38322, the offload goroutine would poll the persisted applied index and only when that's past the truncation index, it does its job.

This would all be more straightforward if the truncation decisions were made locally, then we'd just generally make sure that we only truncate the persistently applied part, and the only thing to make sure of is that the memtable will get dumped at least at some frequency (i.e. if load stops, we still persist the applications somewhat eagerly, i.e. at least after a minute or so).

nvanbenschoten · 2020-06-30T13:04:27Z

We may have just hit this in a testing cluster.

This is annoying, but we should fix it somehow. One way to do it is to sniff whether TruncateTo will actually do anything (i.e. is there even a nonempty truncated storage, perhaps without resorting to disk I/O which is already incurred today) and if so, throw in an extra Sync before the actual TruncateTo. The potential perf penalty of this would only matter during operations that do ingest SSTs, and it's unclear whether it would constitute a perf hit.

After re-familiarizing myself with all of this, I agree that this strikes a good balance between being targetted and being easy to implement. Since it doesn't look like #38322 is going to land any time soon, I think it's our best bet to fix this in the interim period.

ajwerner added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 28, 2019

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 12, 2019

tbg mentioned this issue Jul 12, 2019

storage: log truncation needs apply with sync #36414

Closed

jlinder added the T-kv KV Team label Jun 16, 2021

sumeerbhola mentioned this issue Apr 11, 2022

perf: kv0/kv95 regression around Feb 25 [loosely coupled log truncation] #78412

Open

24 tasks

sumeerbhola mentioned this issue May 25, 2022

kvserver: separate raft log #16624

Open

erikgrinaker added T-kv-replication and removed T-kv KV Team labels May 31, 2022

pav-kv mentioned this issue Oct 30, 2023

crash looping -- sideloaded file not found #113135

Closed

pav-kv self-assigned this Nov 9, 2023

pav-kv mentioned this issue Dec 8, 2023

kvserver: sync before removing sideloaded files #114191

Merged

craig bot closed this as completed in c4a61c5 Dec 15, 2023

This was referenced Dec 15, 2023

release-23.2: kvserver: sync before removing sideloaded files #116574

Merged

release-23.2.0-rc: kvserver: sync before removing sideloaded files #116575

Closed

pav-kv mentioned this issue Nov 29, 2024

logstore: sideloaded storage is not atomic #136416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: ensure application of sideloaded entries is durable on followers before truncation #38566

storage: ensure application of sideloaded entries is durable on followers before truncation #38566

ajwerner commented Jun 28, 2019 •

edited by cockroach-jira-scripts

Loading

tbg commented Jul 12, 2019

nvanbenschoten commented Jun 30, 2020

storage: ensure application of sideloaded entries is durable on followers before truncation #38566

storage: ensure application of sideloaded entries is durable on followers before truncation #38566

Comments

ajwerner commented Jun 28, 2019 • edited by cockroach-jira-scripts Loading

tbg commented Jul 12, 2019

nvanbenschoten commented Jun 30, 2020

ajwerner commented Jun 28, 2019 •

edited by cockroach-jira-scripts

Loading