-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional AppendEntries Requests in HandleAppendEntriesRequest Reduce TLC Model Checking Efficiency #16
Comments
Hi @liang636600, thanks for your detailed and thoughtful reports (this and #17). It was my intent when writing this spec that the steps be relatively fine-grained. If the spec with fine-grained steps was proven to be correct, then implementations could confidently group steps together into larger atomic operations, but they wouldn't be required to for correctness. This gives implementations some level of flexibility. I discussed this in Section 8.1 of my dissertation. Your goal here seems to be in running TLC efficiently over this spec. BTW, that's something that I abandoned after early attempts during my PhD work (mentioned in section 8.2), but this was long ago. You seem to be more capable and TLC may have improved since, so good luck! Having a spec/model that TLC can run over seems valuable for working out issues in the spec/model and for extending the spec with any confidence. However, getting TLC to run efficiently seems to conflict with my goal of having fine-grained, atomic steps in the spec. Do you have a sense for how many changes you'd want to make for TLC's benefit? My hunch is it's more than these two. See also the discussions in Issue #1 and PR #4. I don't know of a good way to resolve this tension. Even if we satisfy both goals, I don't want to make major changes to this somewhat published and somewhat official spec at this point. The options seem to be:
Forking comes with a maintenance burden, but it seems like the least bad option here. I'd be a poor maintainer for the TLC model (since I don't plan to run it). Do you want to fork the spec into a true-to-the-paper but TLC-optimized model? We could link to it from this repo's README for anyone interested in using TLC, though I'd treat the fork as a "community" effort instead of an "official" thing (at least initially). |
I’d be happy to fork the spec into a TLC-optimized model that stays true to the paper. Once it's ready, I'll submit a PR. |
Hi, @ongardie , I found that the action HandleAppendEntriesRequest(i, j, m) generates many unnecessary intermediate states when processing AERequest messages, which results in decreased efficiency for TLC model checking. I will explain the details below.
An example of generating unnecessary states
Scenario: as shown in fig 1, suppose the follower and leader have the same log entry at index 2. The leader is now planning to replicate the log entry at index 3 to the follower using AppendEntries.
Fig 1. The leader replicates the log entry at index 3 to the follower.
HandleAppendEntriesRequest in this project: If HandleAppendEntriesRequest (abbreviated as HandleAERequest) does not encounter a conflict, the follower simply appends the log entry at index 3 to its log. However, there is no response sent back to the leader, which is a significant reason for generating unnecessary states. The leader must then send another AERequest to the follower, who will only respond after it has already received this log entry.
We can conclude that replicating a log entry from the leader to the follower requires the leader to send two AERequest messages, though one of them is redundant. In fact, the process can be completed with just a single AERequest message.
raft.tla/raft.tla
Lines 383 to 387 in 5f55322
Harmful impact: Due to the incorrect handling logic in HandleAERequest, a no-conflict log replication is not completed properly, causing the leader to send another redundant AERequest message during execution. This results in numerous unnecessary states in TLC model checking, reducing its efficiency.
For example, Figure 2 shows part of the state graph generated by TLC. Node 1 represents the leader preparing to send an AERequest to the follower for replicate the log entry at index 3. Node 2 shows the follower completing the log replication but failing to respond, while Node 3 represents the follower finally responding after processing another AERequest. Since the follower didn’t respond promptly after the first AERequest, Node 2 becomes unnecessary in the state graph.
The situation worsens when the Next action from Node 2 leads to unnecessary transitions, such as a Timeout (Node 5) or a Restart (Node 4). Both Node 4 and Node 5 can further generate additional states through the Next action, potentially creating even more redundant states. These are unnecessary because Node 3 can generate all relevant states, semantically covering those from Node 2. Yet, TLC checks both, leading to inefficiencies when only Node 3’s states are needed.
Fig 2. HandleAERequest generates numerous unnecessary states.
Similar Scenario: This issue also occurs in the HandleAERequest method when handling conflict scenario.
Suggested fix: We propose a simple modification by updating the HandleAERequest method to include a Reply method that generates an AEResponse message, both in no-conflict and conflict cases.
Conclusion: For a no-conflict log entry replication, the leader needs to send an additional AERequest message. This behavior is inconsistent with the description in the Raft paper and the implementation in many systems. It also leads to the generation of unnecessary and redundant states during TLC model checking, reducing its efficiency.
I'm looking forward to your confirmation, and would be happy to help fix the issue if needed.
The text was updated successfully, but these errors were encountered: