-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
closedts: raft closed timestamp regression #70894
Comments
Sentry issue: COCKROACHDB-3HX |
I just managed to reproduce this, just running this file:
and this diff diff --git a/pkg/util/hlc/hlc.go b/pkg/util/hlc/hlc.go
index 075b461ecb..34abfd342d 100644
--- a/pkg/util/hlc/hlc.go
+++ b/pkg/util/hlc/hlc.go
@@ -15,6 +15,7 @@ import (
"sync/atomic"
"time"
+ "github.com/cockroachdb/cockroach/pkg/util/envutil"
"github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/cockroach/pkg/util/syncutil"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -186,11 +187,13 @@ func (m *HybridManualClock) Resume() {
m.mu.Unlock()
}
+var hackClockJump = envutil.EnvOrDefaultDuration("COCKROACH_CLOCK_JUMP", 0)
+
// UnixNano returns the local machine's physical nanosecond
// unix epoch timestamp as a convenience to create a HLC via
// c := hlc.NewClock(hlc.UnixNano, ...).
func UnixNano() int64 {
- return timeutil.Now().UnixNano()
+ return timeutil.Now().UnixNano() + hackClockJump.Nanoseconds()
}
// NewClock creates a new hybrid logical clock associated with the given on top of e84001d.
|
Here's n1's log: https://gist.github.com/tbg/1dfc6ee1d324a2971307e3165448275d. The other two nodes survived. |
Got this again so I think this repro is pretty good. In fact I'm going to have to disable the assertion so that I can go about what I was actually looking into (#74909) :-) |
Note, when I disable the assertions it still crashes:
In other words, something pretty bad is going on. |
Before this patch, the following scenario was possible: - a node is stopped - the clock jumps backwards - the node is restarted - a replica from the node proposes a command with a closed timestamp lower than timestamps closed before the restart. This causes an assertion to fire on application. The problem is that, after the restart, the propBuf, which is in charge of assigning closed timestamps to proposals, doesn't have info on what had been closed prior to the restart. The propBuf maintains the b.assignedClosedTimestamp field, which is supposed to be in advance of r.mu.state.RaftClosedTimestamp on leaseholder replicas, but nobody initializes that field on restart. For ranges with epoch-based leases, I believe we don't have a problem because the range will need to acquire a new lease after restart before proposing any commands - and lease acquisitions initialize b.assignedClosedTimestamp. But for expiration-based leases, I believe a lease from before the restart might be used(*). (*) This is probably a bad thing which we should discuss separately. I think we don't want leases to be used after a restart for multiple reasons and we have the r.minLeaseProposedTS[1] guard that supposed to protect against using them (in addition to the epoch protection we have for epoch-based leases, I think). But I believe this protection can be elided by a backwards clock jump - we refuse to use leases acquired *before* minLeaseProposedTS, and minLPTS is assigned to time.Now() on start; if the clock went backwards, the leases will appear to be good. [1] https://github.com/cockroachdb/cockroach/blob/6664d0c34df0fea61de4fff1e97987b7de609b9e/pkg/kv/kvserver/replica.go#L468 Release note: A bug causing nodes to repeatedly crash with a "raft closed timestamp regression" was fixed. The bug occurred when the node in question had been stopped and the machine clock jumped backwards before the node was restarted. Fixes cockroachdb#70894
Pinging this issue since we had a few more sentry reports of this error. |
I think #75298 has something to say about this. Tobi can I pass it to you? :P |
Tried repro #70894 (comment) a few times, no luck catching it so far. Also tried with a negative time shift. The description of #75298 mentions that only expiration-based leases are affected, so I tried also making (almost) all leases expiration-based to increase chances for the repro to catch this. No luck either. Looking closer at the failure report #90682, the lease is epoch-based, so maybe both types are affected. |
Sentry issue: COCKROACHDB-6BJ |
I can reproduce this using #97173, looking into it. |
Looking at what I wrote there again, I don't know why that would avoid the issue. We can still have the same problem:
So I don't know what shifted to hide this problem in my reproductions, but it should still exist (at least when invalid LAIs are introduced). |
Ah - I know. By sorting by closed timestamp, the LAI=1234 block would now pick the smallest possible closed timestamp for slot LAI=1234. One of them had to be smaller than whatever we assigned to LAI=1235 (or higher). So we avoided the assertion, but not for any sound reason. We just reduced the chances of it a ton. Which explains why I haven't been seeing it any more except that one time. I should've looked into that one repro more but didn't: it might've been a different mode, perhaps more relevant to the issue at hand. |
@tbg How about trying to catch the Is closed timestamp assigned once, during the first proposal flush time? Or is it reassigned on: a) reproposal, b) reproposal with a higher |
Earlier assertions sound good.
It's currently not reassigned, but just because that ended up being slightly more convenient in the current code, not because there's some reason not to. (But there also isn't a reason to, the less change across proposal reuse the better). I'm going to unassign for now, we'll pick this back up in https://cockroachlabs.atlassian.net/browse/CRDB-25287. |
We didn't get around to getting to the bottom of this. The comment1 explains the understanding I have: there is a monotonicity requirement between MLAI and closed timestamp assignment - the assigned (MLAI,CT) pairs need to be totally ordered by the component-wise order (i.e. if MLAI > MLAI', then CT > CT'). This wasn't true in the experiment I ran (I was artificially introducing backward jumps in MLAI) and I suspect it may not always be true in practice either, though I'm hard-pressed to find an example. One thing to look into could be special commands like lease requests, etc, which don't fully use the LAI mechanism. Could these cause this reversion? I don't think so, but still worth double checking. Similarly, I don't think rejected commands (illegal lease index, invalid proposer lease) can trigger closed timestamp updates. Pavel's suggestion above has merit, we can verify the total order before commands enter raft, and this could give us an idea of where the problem arises. It's also totally possible that the problem occurs in practice only due to some corruption in the system ("Ex falso sequitur quodlibet") or that we've since "accidentally" fixed it in the course of simplifying One immediate thing we might do is to adopt Footnotes |
From a recent repro:
All proposals seem to be made under the same leadership and lease. Observations on the proposals:
Probably self-inflicted by the test-only LAI override that returns 1 in some cases when it wanted to return a LAI > 1. Probably the original ordering had entry 12 and 14 submitted at LAI 3 (since its closed timestamp is the biggest of all), but the test interceptor randomly assigned LAI 1 to it. This is a bug in the test harness. The failures reported via Sentry fail at higher indices and LAIs though, so the issue is real. The repro is irrelevant. |
Latest occurrence of this panic in
Possibly same test harness bug as #70894 (comment). Entries 12 and 14 were probably intended at MLI 3, but the test assigned MLI=1 to them, and one accidentally got submitted before entry 12 (which originally had MLI=1). The fix would be to disallow the MLI=1 injection until the LAI has definitely passed 1. Then we'll avoid the unintended command reordering. Once we have this fix, nothing stops us from generalizing it: the MLI can be overridden to any value <= submitted LAI, and the command should not be applied in this case. |
Likely closed as part of #123442. Sentry will tell us if / when we see this again. |
#123442 saw a regression in closed timestamp sidetransport, while this issue is about a regression in the log. Not sure these are the same. |
cc: @arulajmani - didn't we decide these were similar enough to warrant closing? |
Re-opening just in case. |
Sentry is getting crash-loop reports from a handful of cluster about closed timestamp regressions
Example
The assertion includes the tail of the Raft log, which should be very valuable in understanding what is going wrong, but that's redacted in Sentry.
Until version 21.1.4, it was possible to hit this assertion on lease acquisitions. But that was fixed (and backported) through #66512 (comment)
Indeed, some crashes from older version appear to be on lease acquisitions. But we also have crashes on newer version coming from commands proposed by other types of requests.
I've looked at reports from a couple of clusters, and the only commonality I see, which is also weird in itself, is that in multi-node cluster, only 2 nodes report the crash, when you'd expect all the (3) replicas to crash just the same. But, to make it extra confusing, there's also single-node clusters where this happens, so it both looks and doesn't look like a replication issue.
For analyzing the Sentry reports, one thing that helps is going to the very first set of crashes from a cluster (i.e. first crash from every node). Out of those, one crash will correspond to the leaseholder, and that one will have extra information in it: the request that corresponds to the command with the regression. Getting that first crash in Sentry is tricky: you have to get to a view that has a paginated view of all the events for one cluster (example) and then you have to go to the last page by guessing the "cursur" URL argument. The argument is encoded as
0%3A
followed by the actual index.Looking at a few of these requests causing the regression, I don't see a pattern. I've seen
EndTxn
and I've seen a batch ofMerge
s.But one thing I do see and find curious is that I've looked at a few cases, and in every one, the regression was for hours. I don't know what to make of that. I'm thinking that a clock juimp perhaps has something to do with it, although I don't see how.
Jira issue: CRDB-10274
The text was updated successfully, but these errors were encountered: