-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vsr: fix vopr crash on view_durable and view discrepancy #1619
Conversation
Looking at view_durable here is correct. Looking only at a view is optimization. I _think_ that should be a correct optimization, but so far VOPR disagrees. I want to dig deeper into that seed, but let's apply this quick fix in order to unmask potentially more interesting failures. Seed: 15663507130681778245
assert(self.view > self.view_durable()); | ||
assert(self.view_durable_updating()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, it can be the case that view > view_durable, but we are not updating the view.
For example, we we concurrently bump the view twice, the second update might be lost:
23a2aae#diff-e95e4dc6c1aae349e798cfeed22ad01f830cce0e654e6b4b765b3e7b0c677b6aR4969
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm confused... the second update is lost b/c we were are status=recovering_head?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Referring to
tigerbeetle/src/vsr/replica.zig
Line 6901 in 6defd9d
if (self.status == .recovering_head) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I apply the following diff to always chain-update view_durable:
diff --git a/src/vsr/replica.zig b/src/vsr/replica.zig
index d6bb4523..eae27635 100644
--- a/src/vsr/replica.zig
+++ b/src/vsr/replica.zig
@@ -6846,6 +6846,7 @@ pub fn ReplicaType(
assert(self.log_view > self.log_view_durable() or self.view > self.view_durable());
// The primary must only persist the SV headers after repairs are done.
// Otherwise headers could be nacked, truncated, then restored after a crash.
+ assert(!((self.log_view == self.view and self.replica == self.primary_index(self.view) and self.status == .view_change)));
assert(self.log_view < self.view or self.replica != self.primary_index(self.view) or
self.status == .normal or self.status == .recovering);
assert(self.view_headers.array.count() > 0);
@@ -6908,7 +6909,7 @@ pub fn ReplicaType(
(self.replica != self.primary_index(self.view) or self.status == .normal);
assert(!(update_dvc and update_sv));
- if (update_dvc or update_sv) self.view_durable_update();
+ if (update) self.view_durable_update();
// Reset SVC timeout in case the view-durable update took a long time.
if (self.view_change_status_timeout.ticking) self.view_change_status_timeout.reset();
I get a crash on 15663507130681778245
on the newly added assert, which is a specialisation of the assert on the next line.
So what happens is:
- something triggers a view change and replica goes and starts writing its new view to disk
- by the time update finishes, another view jump happned
- but for this view, the replica is a primary
- and it would be incorrect for the primary to update view_durable right now, because that would also write view headers, but would-be primary doesn't necessary have all the headers required yet.
- that's why in this case the update is "lost", and is re-done explicitly in
transition_to_normal_from_view_change_status
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raced! I am looking rather at the following line:
tigerbeetle/src/vsr/replica.zig
Line 6911 in 6defd9d
if (update_dvc or update_sv) self.view_durable_update(); |
I've convinced myself that that's the whole fix we need to do here. It is indeed the case that view_durable update might lag behind view update, and be triggered asynchronously in
|
Looking at view_durable here is correct. Looking only at a view is optimization. I think that should be a correct optimization, but so far VOPR disagrees.
I want to dig deeper into that seed, but let's apply this quick fix in order to unmask potentially more interesting failures.
Seed: 15663507130681778245