Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: ensure application of sideloaded entries is durable on followers before truncation #38566

Closed
ajwerner opened this issue Jun 28, 2019 · 2 comments · Fixed by #114191
Closed
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Jun 28, 2019

This is a theoretical issue stumbled upon during the implemented of #37426 that seems scary enough to warrant a discussion. After an entry's write batch which updates the truncated state has been applied to the storage engine we truncate the sideloaded storage up to the index implied by the truncated state update. This code comes with a note (which could probably be clearer) that we assume that the entry corresponding to this index has been applied durably. For the leaseholder this is certainly the case as it will not propose a truncation with an index that it has not applied durably. It seems generally likely that the follower will have sync'ed some new entry between the application of the index to be truncated and the application of this truncating command, but what enforces that property on followers? If, for whatever reason, raft messages were slow to be received and processed on some follower could it be the case that the truncation entry and the index to which it refers end up in the same Ready? Is it the case that the truncation won't be issued until all followers have applied up to that index?

// Truncate the sideloaded storage. Note that this is safe only if the new truncated state
// is durably on disk (i.e.) synced. This is true at the time of writing but unfortunately
// could rot.
{
log.Eventf(ctx, "truncating sideloaded storage up to (and including) index %d", newTruncState.Index)
if size, _, err := r.raftMu.sideloaded.TruncateTo(ctx, newTruncState.Index+1); err != nil {
// We don't *have* to remove these entries for correctness. Log a
// loud error, but keep humming along.
log.Errorf(ctx, "while removing sideloaded files during log truncation: %s", err)
} else {
rResult.RaftLogDelta -= size
}
}

cc @tbg

Jira issue: CRDB-5626

@ajwerner ajwerner added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 28, 2019
@tbg
Copy link
Member

tbg commented Jul 12, 2019

Yes, this is a real concern (I filed this previously in #36414, but I'm going to close that issue).

This is annoying, but we should fix it somehow. One way to do it is to sniff whether TruncateTo will actually do anything (i.e. is there even a nonempty truncated storage, perhaps without resorting to disk I/O which is already incurred today) and if so, throw in an extra Sync before the actual TruncateTo. The potential perf penalty of this would only matter during operations that do ingest SSTs, and it's unclear whether it would constitute a perf hit.

The optimal thing we could do is pretty much out there. You want to offload the actual sideoad truncations to some other goroutine, and that goroutine will wait until it's observed a sync (unclear how it would do that, we could rework the batching mechanism to allow joining (creating) a batch as a non-leader), at least for a bit, and then do its work.

Or easier, in the context of @nvanbenschoten's latest proposal in #38322, the offload goroutine would poll the persisted applied index and only when that's past the truncation index, it does its job.

This would all be more straightforward if the truncation decisions were made locally, then we'd just generally make sure that we only truncate the persistently applied part, and the only thing to make sure of is that the memtable will get dumped at least at some frequency (i.e. if load stops, we still persist the applications somewhat eagerly, i.e. at least after a minute or so).

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 12, 2019
@nvanbenschoten
Copy link
Member

We may have just hit this in a testing cluster.

This is annoying, but we should fix it somehow. One way to do it is to sniff whether TruncateTo will actually do anything (i.e. is there even a nonempty truncated storage, perhaps without resorting to disk I/O which is already incurred today) and if so, throw in an extra Sync before the actual TruncateTo. The potential perf penalty of this would only matter during operations that do ingest SSTs, and it's unclear whether it would constitute a perf hit.

After re-familiarizing myself with all of this, I agree that this strikes a good balance between being targetted and being easy to implement. Since it doesn't look like #38322 is going to land any time soon, I think it's our best bet to fix this in the interim period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants