Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsr: fix vopr crash on view_durable and view discrepancy #1619

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

matklad
Copy link
Member

@matklad matklad commented Feb 28, 2024

Looking at view_durable here is correct. Looking only at a view is optimization. I think that should be a correct optimization, but so far VOPR disagrees.

I want to dig deeper into that seed, but let's apply this quick fix in order to unmask potentially more interesting failures.

Seed: 15663507130681778245

Looking at view_durable here is correct. Looking only at a view is
optimization. I _think_ that should be a correct optimization, but so
far VOPR disagrees.

I want to dig deeper into that seed, but let's apply this quick fix in
order to unmask potentially more interesting failures.

Seed: 15663507130681778245
Comment on lines -8408 to -8409
assert(self.view > self.view_durable());
assert(self.view_durable_updating());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, it can be the case that view > view_durable, but we are not updating the view.

For example, we we concurrently bump the view twice, the second update might be lost:

23a2aae#diff-e95e4dc6c1aae349e798cfeed22ad01f830cce0e654e6b4b765b3e7b0c677b6aR4969

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm confused... the second update is lost b/c we were are status=recovering_head?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Referring to

if (self.status == .recovering_head) return;
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I apply the following diff to always chain-update view_durable:

diff --git a/src/vsr/replica.zig b/src/vsr/replica.zig
index d6bb4523..eae27635 100644
--- a/src/vsr/replica.zig
+++ b/src/vsr/replica.zig
@@ -6846,6 +6846,7 @@ pub fn ReplicaType(
             assert(self.log_view > self.log_view_durable() or self.view > self.view_durable());
             // The primary must only persist the SV headers after repairs are done.
             // Otherwise headers could be nacked, truncated, then restored after a crash.
+            assert(!((self.log_view == self.view and self.replica == self.primary_index(self.view) and self.status == .view_change)));
             assert(self.log_view < self.view or self.replica != self.primary_index(self.view) or
                 self.status == .normal or self.status == .recovering);
             assert(self.view_headers.array.count() > 0);
@@ -6908,7 +6909,7 @@ pub fn ReplicaType(
                 (self.replica != self.primary_index(self.view) or self.status == .normal);
             assert(!(update_dvc and update_sv));
 
-            if (update_dvc or update_sv) self.view_durable_update();
+            if (update) self.view_durable_update();
 
             // Reset SVC timeout in case the view-durable update took a long time.
             if (self.view_change_status_timeout.ticking) self.view_change_status_timeout.reset();

I get a crash on 15663507130681778245 on the newly added assert, which is a specialisation of the assert on the next line.

So what happens is:

  • something triggers a view change and replica goes and starts writing its new view to disk
  • by the time update finishes, another view jump happned
  • but for this view, the replica is a primary
  • and it would be incorrect for the primary to update view_durable right now, because that would also write view headers, but would-be primary doesn't necessary have all the headers required yet.
  • that's why in this case the update is "lost", and is re-done explicitly in transition_to_normal_from_view_change_status.

Copy link
Member Author

@matklad matklad Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raced! I am looking rather at the following line:

if (update_dvc or update_sv) self.view_durable_update();

@matklad
Copy link
Member Author

matklad commented Feb 28, 2024

I've convinced myself that that's the whole fix we need to do here. It is indeed the case that view_durable update might lag behind view update, and be triggered asynchronously in

transition_to_normal_from_view_change_status()
``

`view_durable_update` updates _both_ view number and view_headers, and the latter is sensitive and is not always possible. 

@matklad matklad added this pull request to the merge queue Feb 28, 2024
Merged via the queue into main with commit 9d9ec49 Feb 28, 2024
27 checks passed
@matklad matklad deleted the matklad/rely-on-view-durable branch February 28, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants