Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional AppendEntries Requests in HandleAppendEntriesRequest Reduce TLC Model Checking Efficiency #16

Open
liang636600 opened this issue Oct 12, 2024 · 2 comments

Comments

@liang636600
Copy link

Hi, @ongardie , I found that the action HandleAppendEntriesRequest(i, j, m) generates many unnecessary intermediate states when processing AERequest messages, which results in decreased efficiency for TLC model checking. I will explain the details below.

An example of generating unnecessary states

Scenario: as shown in fig 1, suppose the follower and leader have the same log entry at index 2. The leader is now planning to replicate the log entry at index 3 to the follower using AppendEntries.
handleAERequest
Fig 1. The leader replicates the log entry at index 3 to the follower.

HandleAppendEntriesRequest in this project: If HandleAppendEntriesRequest (abbreviated as HandleAERequest) does not encounter a conflict, the follower simply appends the log entry at index 3 to its log. However, there is no response sent back to the leader, which is a significant reason for generating unnecessary states. The leader must then send another AERequest to the follower, who will only respond after it has already received this log entry.
We can conclude that replicating a log entry from the leader to the follower requires the leader to send two AERequest messages, though one of them is redundant. In fact, the process can be completed with just a single AERequest message.

raft.tla/raft.tla

Lines 383 to 387 in 5f55322

\/ \* no conflict: append entry
/\ m.mentries /= << >>
/\ Len(log[i]) = m.mprevLogIndex
/\ log' = [log EXCEPT ![i] =
Append(log[i], m.mentries[1])]

Harmful impact: Due to the incorrect handling logic in HandleAERequest, a no-conflict log replication is not completed properly, causing the leader to send another redundant AERequest message during execution. This results in numerous unnecessary states in TLC model checking, reducing its efficiency.
For example, Figure 2 shows part of the state graph generated by TLC. Node 1 represents the leader preparing to send an AERequest to the follower for replicate the log entry at index 3. Node 2 shows the follower completing the log replication but failing to respond, while Node 3 represents the follower finally responding after processing another AERequest. Since the follower didn’t respond promptly after the first AERequest, Node 2 becomes unnecessary in the state graph.
The situation worsens when the Next action from Node 2 leads to unnecessary transitions, such as a Timeout (Node 5) or a Restart (Node 4). Both Node 4 and Node 5 can further generate additional states through the Next action, potentially creating even more redundant states. These are unnecessary because Node 3 can generate all relevant states, semantically covering those from Node 2. Yet, TLC checks both, leading to inefficiencies when only Node 3’s states are needed.

stateGraph

Fig 2. HandleAERequest generates numerous unnecessary states.
Similar Scenario: This issue also occurs in the HandleAERequest method when handling conflict scenario.

Suggested fix: We propose a simple modification by updating the HandleAERequest method to include a Reply method that generates an AEResponse message, both in no-conflict and conflict cases.

Conclusion: For a no-conflict log entry replication, the leader needs to send an additional AERequest message. This behavior is inconsistent with the description in the Raft paper and the implementation in many systems. It also leads to the generation of unnecessary and redundant states during TLC model checking, reducing its efficiency.
I'm looking forward to your confirmation, and would be happy to help fix the issue if needed.

@ongardie
Copy link
Owner

Hi @liang636600, thanks for your detailed and thoughtful reports (this and #17). It was my intent when writing this spec that the steps be relatively fine-grained. If the spec with fine-grained steps was proven to be correct, then implementations could confidently group steps together into larger atomic operations, but they wouldn't be required to for correctness. This gives implementations some level of flexibility. I discussed this in Section 8.1 of my dissertation.

Your goal here seems to be in running TLC efficiently over this spec. BTW, that's something that I abandoned after early attempts during my PhD work (mentioned in section 8.2), but this was long ago. You seem to be more capable and TLC may have improved since, so good luck! Having a spec/model that TLC can run over seems valuable for working out issues in the spec/model and for extending the spec with any confidence.

However, getting TLC to run efficiently seems to conflict with my goal of having fine-grained, atomic steps in the spec. Do you have a sense for how many changes you'd want to make for TLC's benefit? My hunch is it's more than these two. See also the discussions in Issue #1 and PR #4.

I don't know of a good way to resolve this tension. Even if we satisfy both goals, I don't want to make major changes to this somewhat published and somewhat official spec at this point. The options seem to be:

  • drop my goal of fine-grained steps (but younger Diego's rationale seems valid),
  • try to make one spec serve both purposes (but that may make the spec much harder to understand), or
  • fork/branch/copy the spec to use as a TLC model.

Forking comes with a maintenance burden, but it seems like the least bad option here. I'd be a poor maintainer for the TLC model (since I don't plan to run it). Do you want to fork the spec into a true-to-the-paper but TLC-optimized model? We could link to it from this repo's README for anyone interested in using TLC, though I'd treat the fork as a "community" effort instead of an "official" thing (at least initially).

@liang636600
Copy link
Author

I’d be happy to fork the spec into a TLC-optimized model that stays true to the paper. Once it's ready, I'll submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants