-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: ensure application of sideloaded entries is durable on followers before truncation #38566
Comments
Yes, this is a real concern (I filed this previously in #36414, but I'm going to close that issue). This is annoying, but we should fix it somehow. One way to do it is to sniff whether The optimal thing we could do is pretty much out there. You want to offload the actual sideoad truncations to some other goroutine, and that goroutine will wait until it's observed a sync (unclear how it would do that, we could rework the batching mechanism to allow joining (creating) a batch as a non-leader), at least for a bit, and then do its work. Or easier, in the context of @nvanbenschoten's latest proposal in #38322, the offload goroutine would poll the persisted applied index and only when that's past the truncation index, it does its job. This would all be more straightforward if the truncation decisions were made locally, then we'd just generally make sure that we only truncate the persistently applied part, and the only thing to make sure of is that the memtable will get dumped at least at some frequency (i.e. if load stops, we still persist the applications somewhat eagerly, i.e. at least after a minute or so). |
We may have just hit this in a testing cluster.
After re-familiarizing myself with all of this, I agree that this strikes a good balance between being targetted and being easy to implement. Since it doesn't look like #38322 is going to land any time soon, I think it's our best bet to fix this in the interim period. |
This is a theoretical issue stumbled upon during the implemented of #37426 that seems scary enough to warrant a discussion. After an entry's write batch which updates the truncated state has been applied to the storage engine we truncate the sideloaded storage up to the index implied by the truncated state update. This code comes with a note (which could probably be clearer) that we assume that the entry corresponding to this index has been applied durably. For the leaseholder this is certainly the case as it will not propose a truncation with an index that it has not applied durably. It seems generally likely that the follower will have sync'ed some new entry between the application of the index to be truncated and the application of this truncating command, but what enforces that property on followers? If, for whatever reason, raft messages were slow to be received and processed on some follower could it be the case that the truncation entry and the index to which it refers end up in the same Ready? Is it the case that the truncation won't be issued until all followers have applied up to that index?
cockroach/pkg/storage/replica_proposal.go
Lines 606 to 618 in dd9784d
cc @tbg
Jira issue: CRDB-5626
The text was updated successfully, but these errors were encountered: