closedts: raft closed timestamp regression #70894

andreimatei · 2021-09-29T22:24:46Z

Sentry is getting crash-loop reports from a handful of cluster about closed timestamp regressions
Example

store_raft.go:524: log.Fatal: ???: raft closed timestamp regression in cmd: "_3\"*\x83\xe6\xd2s" (term: 23, index: 415262); batch state: 1626099041.832244381,0, command: 1626092389.971770216,0, lease: repl=(n2,s2):3 seq=17 start=1626091869.496126607,0 exp=1626099053.832244381,4775 pro=1626099044.832244381,4775, req: ConditionalPut [×,×), EndTxn(commit:true modified-span (node-liveness)) [×], [txn: fa86e74a], [can-forward-ts], applying at LAI: 183761.

The assertion includes the tail of the Raft log, which should be very valuable in understanding what is going wrong, but that's redacted in Sentry.

Until version 21.1.4, it was possible to hit this assertion on lease acquisitions. But that was fixed (and backported) through #66512 (comment)
Indeed, some crashes from older version appear to be on lease acquisitions. But we also have crashes on newer version coming from commands proposed by other types of requests.

I've looked at reports from a couple of clusters, and the only commonality I see, which is also weird in itself, is that in multi-node cluster, only 2 nodes report the crash, when you'd expect all the (3) replicas to crash just the same. But, to make it extra confusing, there's also single-node clusters where this happens, so it both looks and doesn't look like a replication issue.

For analyzing the Sentry reports, one thing that helps is going to the very first set of crashes from a cluster (i.e. first crash from every node). Out of those, one crash will correspond to the leaseholder, and that one will have extra information in it: the request that corresponds to the command with the regression. Getting that first crash in Sentry is tricky: you have to get to a view that has a paginated view of all the events for one cluster (example) and then you have to go to the last page by guessing the "cursur" URL argument. The argument is encoded as 0%3A followed by the actual index.

Looking at a few of these requests causing the regression, I don't see a pattern. I've seen EndTxn and I've seen a batch of Merges.

But one thing I do see and find curious is that I've looked at a few cases, and in every one, the regression was for hours. I don't know what to make of that. I'm thinking that a clock juimp perhaps has something to do with it, although I don't see how.

Jira issue: CRDB-10274

The text was updated successfully, but these errors were encountered:

sentry-io · 2021-09-29T22:48:57Z

Sentry issue: COCKROACHDB-3HX

tbg · 2022-01-17T13:51:27Z

I just managed to reproduce this, just running this file:

#!/bin/bash

set -euo pipefail

make buildshort
roachprod destroy local || true
roachprod create -n 3 local

roachprod put local cockroach
roachprod start local:1-3
roachprod run local:1 -- ./cockroach workload init kv
roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-3}

roachprod stop local
env COCKROACH_CLOCK_JUMP=8h roachprod start local

roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-3}

roachprod stop local
roachprod start local

roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-3}

and this diff

diff --git a/pkg/util/hlc/hlc.go b/pkg/util/hlc/hlc.go
index 075b461ecb..34abfd342d 100644
--- a/pkg/util/hlc/hlc.go
+++ b/pkg/util/hlc/hlc.go
@@ -15,6 +15,7 @@ import (
 	"sync/atomic"
 	"time"
 
+	"github.com/cockroachdb/cockroach/pkg/util/envutil"
 	"github.com/cockroachdb/cockroach/pkg/util/log"
 	"github.com/cockroachdb/cockroach/pkg/util/syncutil"
 	"github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -186,11 +187,13 @@ func (m *HybridManualClock) Resume() {
 	m.mu.Unlock()
 }
 
+var hackClockJump = envutil.EnvOrDefaultDuration("COCKROACH_CLOCK_JUMP", 0)
+
 // UnixNano returns the local machine's physical nanosecond
 // unix epoch timestamp as a convenience to create a HLC via
 // c := hlc.NewClock(hlc.UnixNano, ...).
 func UnixNano() int64 {
-	return timeutil.Now().UnixNano()
+	return timeutil.Now().UnixNano() + hackClockJump.Nanoseconds()
 }
 
 // NewClock creates a new hybrid logical clock associated with the given

on top of e84001d.

‹/System/{NodeLive…-tsd}›,raft] 43 ???: raft closed timestamp regression in cmd: ">\xa3\x9a\x17#pN\xf9" (term: 8, index: 122); batch state: 164>

tbg · 2022-01-17T13:52:58Z

Here's n1's log: https://gist.github.com/tbg/1dfc6ee1d324a2971307e3165448275d. The other two nodes survived.

tbg · 2022-01-17T14:05:37Z

Got this again so I think this repro is pretty good. In fact I'm going to have to disable the assertion so that I can go about what I was actually looking into (#74909) :-)

tbg · 2022-01-17T14:08:34Z

Note, when I disable the assertions it still crashes:

F220117 14:07:42.887954 249 kv/kvserver/replica.go:1269 ⋮ [n1,s1,r3/1:‹/System/{NodeLive…-tsd}›] 142 on-disk and in-memory state diverged: [Raft
ClosedTimestamp.WallTime: 1642428459884897696 != 1642457252063853439]

In other words, something pretty bad is going on.

Before this patch, the following scenario was possible: - a node is stopped - the clock jumps backwards - the node is restarted - a replica from the node proposes a command with a closed timestamp lower than timestamps closed before the restart. This causes an assertion to fire on application. The problem is that, after the restart, the propBuf, which is in charge of assigning closed timestamps to proposals, doesn't have info on what had been closed prior to the restart. The propBuf maintains the b.assignedClosedTimestamp field, which is supposed to be in advance of r.mu.state.RaftClosedTimestamp on leaseholder replicas, but nobody initializes that field on restart. For ranges with epoch-based leases, I believe we don't have a problem because the range will need to acquire a new lease after restart before proposing any commands - and lease acquisitions initialize b.assignedClosedTimestamp. But for expiration-based leases, I believe a lease from before the restart might be used(*). (*) This is probably a bad thing which we should discuss separately. I think we don't want leases to be used after a restart for multiple reasons and we have the r.minLeaseProposedTS[1] guard that supposed to protect against using them (in addition to the epoch protection we have for epoch-based leases, I think). But I believe this protection can be elided by a backwards clock jump - we refuse to use leases acquired *before* minLeaseProposedTS, and minLPTS is assigned to time.Now() on start; if the clock went backwards, the leases will appear to be good. [1] https://github.com/cockroachdb/cockroach/blob/6664d0c34df0fea61de4fff1e97987b7de609b9e/pkg/kv/kvserver/replica.go#L468 Release note: A bug causing nodes to repeatedly crash with a "raft closed timestamp regression" was fixed. The bug occurred when the node in question had been stopped and the machine clock jumped backwards before the node was restarted. Fixes cockroachdb#70894

yuzefovich · 2022-05-31T05:53:17Z

Pinging this issue since we had a few more sentry reports of this error.

andreimatei · 2022-05-31T14:46:53Z

I think #75298 has something to say about this. Tobi can I pass it to you? :P

pav-kv · 2022-10-31T17:36:36Z

Tried repro #70894 (comment) a few times, no luck catching it so far. Also tried with a negative time shift.

The description of #75298 mentions that only expiration-based leases are affected, so I tried also making (almost) all leases expiration-based to increase chances for the repro to catch this. No luck either.

Looking closer at the failure report #90682, the lease is epoch-based, so maybe both types are affected.

sentry-io · 2023-01-30T17:42:10Z

Sentry issue: COCKROACHDB-6BJ

tbg · 2023-02-15T21:04:59Z

I can reproduce this using #97173, looking into it.

tbg · 2023-02-27T13:39:53Z

I ended up instead sorting the reproposal slice by (LAI, ClosedTimestamp, Lease) which I thought would avoid this issue. I never saw it again during my testing after that until last night where I randomly got it while repro'ing for #97287 (with the sorting). So, I don't seem to have solved it completely and there's something to be understood yet. Possibly this is then also going to answer how we're seeing this in the wild.

Looking at what I wrote there again, I don't know why that would avoid the issue. We can still have the same problem:

assign LAI 12345, CT=1000 to log position 100
assign LAI 99999, CT=500 to log position 101
boom

So I don't know what shifted to hide this problem in my reproductions, but it should still exist (at least when invalid LAIs are introduced).

tbg · 2023-02-27T13:41:39Z

Ah - I know. By sorting by closed timestamp, the LAI=1234 block would now pick the smallest possible closed timestamp for slot LAI=1234. One of them had to be smaller than whatever we assigned to LAI=1235 (or higher). So we avoided the assertion, but not for any sound reason. We just reduced the chances of it a ton. Which explains why I haven't been seeing it any more except that one time. I should've looked into that one repro more but didn't: it might've been a different mode, perhaps more relevant to the issue at hand.

pav-kv · 2023-02-28T15:42:21Z

@tbg How about trying to catch the (MaxLeaseIndex, ClosedTimestamp) inversion early (by adding more assertions 😬 of the invariant mli1 <= mli2 ==> ct1 <= ct2), e.g. right after Sort has reordered the proposals, and/or at Flush time? Maybe your repro will then catch it earlier in the pipeline (rather than only at application time, after it got through Raft).

Is closed timestamp assigned once, during the first proposal flush time? Or is it reassigned on: a) reproposal, b) reproposal with a higher MaxLeaseIndex?

tbg · 2023-03-21T08:05:29Z

Earlier assertions sound good.

Is closed timestamp assigned once, during the first proposal flush time? Or is it reassigned on: a) reproposal, b) reproposal with a higher MaxLeaseIndex?

It's currently not reassigned, but just because that ended up being slightly more convenient in the current code, not because there's some reason not to. (But there also isn't a reason to, the less change across proposal reuse the better).

I'm going to unassign for now, we'll pick this back up in https://cockroachlabs.atlassian.net/browse/CRDB-25287.

tbg · 2023-07-18T11:56:15Z

We didn't get around to getting to the bottom of this. The comment¹ explains the understanding I have: there is a monotonicity requirement between MLAI and closed timestamp assignment - the assigned (MLAI,CT) pairs need to be totally ordered by the component-wise order (i.e. if MLAI > MLAI', then CT > CT'). This wasn't true in the experiment I ran (I was artificially introducing backward jumps in MLAI) and I suspect it may not always be true in practice either, though I'm hard-pressed to find an example. One thing to look into could be special commands like lease requests, etc, which don't fully use the LAI mechanism. Could these cause this reversion? I don't think so, but still worth double checking. Similarly, I don't think rejected commands (illegal lease index, invalid proposer lease) can trigger closed timestamp updates.

Pavel's suggestion above has merit, we can verify the total order before commands enter raft, and this could give us an idea of where the problem arises.

It's also totally possible that the problem occurs in practice only due to some corruption in the system ("Ex falso sequitur quodlibet") or that we've since "accidentally" fixed it in the course of simplifying tryReproposeWithNewLeaseIndex², though I don't recall reproposals being involved at all (refreshes were involved in my experiment).

One immediate thing we might do is to adopt must³ once it's available and to report the error to sentry without triggering the crash in production builds⁴. We don't even have to wait for must. It is pretty clear that users don't benefit from the assertion; we'll usually just tell them to disable it and go on with their lives.

pav-kv · 2023-12-14T00:21:14Z

From a recent repro:

$ grep -o "\(Term.*Index.*\|max_lease_index.*closed_timestamp.*replic\)" closed-timestamp-log.txt | sed 's/ replic//g'

Term:6 Index:15 Type:EntryNormal  (ID 0ad4ef7915f2d9f7) by lease #6
max_lease_index:2 closed_timestamp:<wall_time:1699467454251881189 >
Term:6 Index:14 Type:EntryNormal  (ID 4a37f23fa745a2e1) by lease #6
max_lease_index:1 closed_timestamp:<wall_time:1699467454363124760 >
Term:6 Index:13 Type:EntryNormal  (ID 174586ad919d0c61) by lease #6
max_lease_index:1 closed_timestamp:<wall_time:1699467454220166867 >
Term:6 Index:12 Type:EntryNormal  (ID 4a37f23fa745a2e1) by lease #6
max_lease_index:1 closed_timestamp:<wall_time:1699467454363124760 >
Term:6 Index:11 Type:EntryNormal : EMPTY [recovered]

All proposals seem to be made under the same leadership and lease.
All proposals are made right in the beginning of a range lifecycle (log indices start at 10 and higher). The EMPTY log entry (index 11) is committed when the raft leader has just been elected.

Observations on the proposals:

Proposals at index 12 and 14 have the same ID, so likely are full-copy reproposals.
Proposal at index 13 was done with the same MLI = 1, so is applied as no-op.
For some reason, proposal at index 15 was made with a lower closed timestamp than proposals 12 and 14.

Probably self-inflicted by the test-only LAI override that returns 1 in some cases when it wanted to return a LAI > 1.

Probably the original ordering had entry 12 and 14 submitted at LAI 3 (since its closed timestamp is the biggest of all), but the test interceptor randomly assigned LAI 1 to it. This is a bug in the test harness.

The failures reported via Sentry fail at higher indices and LAIs though, so the issue is real. The repro is irrelevant.

pav-kv · 2024-01-08T14:54:42Z

Latest occurrence of this panic in kvnemesis failure here: #116541 (comment).

$ grep -o "\(Term.*Index.*\|max_lease_index.*closed_timestamp.*replic\)" ct.txt | sed 's/ replic//g'                                                        backport23.1-115595-114191!?
Term:6 Index:15 Type:EntryNormal  (ID 450747ef87ce7014) by lease #9
max_lease_index:2 closed_timestamp:<wall_time:1704640779372779513 >
Term:6 Index:14 Type:EntryNormal  (ID 6f1758389525834e) by lease #9
max_lease_index:1 closed_timestamp:<wall_time:1704640779402664522 >
Term:6 Index:13 Type:EntryNormal  (ID 5c2068c6aee7e0e0) by lease #9
max_lease_index:1 closed_timestamp:<wall_time:1704640779361734754 >
Term:6 Index:12 Type:EntryNormal  (ID 6f1758389525834e) by lease #9
max_lease_index:1 closed_timestamp:<wall_time:1704640779402664522 >
Term:6 Index:11 Type:EntryNormal : EMPTY

Possibly same test harness bug as #70894 (comment).

Entries 12 and 14 were probably intended at MLI 3, but the test assigned MLI=1 to them, and one accidentally got submitted before entry 12 (which originally had MLI=1).

The fix would be to disallow the MLI=1 injection until the LAI has definitely passed 1. Then we'll avoid the unintended command reordering. Once we have this fix, nothing stops us from generalizing it: the MLI can be overridden to any value <= submitted LAI, and the command should not be applied in this case.

nicktrav · 2024-06-28T18:00:18Z

Likely closed as part of #123442. Sentry will tell us if / when we see this again.

pav-kv · 2024-06-28T18:11:46Z

#123442 saw a regression in closed timestamp sidetransport, while this issue is about a regression in the log. Not sure these are the same.

nicktrav · 2024-07-01T14:08:25Z

cc: @arulajmani - didn't we decide these were similar enough to warrant closing?

nicktrav · 2024-07-01T14:08:33Z

Re-opening just in case.

andreimatei added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 29, 2021

andreimatei added the A-kv-closed-timestamps Relating to closed timestamps label Sep 29, 2021

tbg mentioned this issue Jan 17, 2022

server: investigate recovery from clock jumps #74909

Closed

andreimatei linked a pull request Jan 21, 2022 that will close this issue

kvserver: fix closed timestamp regression on clock jump #75298

Open

erikgrinaker added the T-kv-replication label May 5, 2022

erikgrinaker mentioned this issue May 5, 2022

kvserver: v21.1.7: raft closed timestamp regression in cmd #78419

Closed

erikgrinaker mentioned this issue May 12, 2022

can't start cockroach: ERROR: [n1,s1,r2/1:/System/NodeLiveness{-Max}] Queued as error #81209

Closed

jlinder added sync-me-3 and removed sync-me-3 labels May 24, 2022

yuzefovich mentioned this issue Jun 29, 2022

sentry: store_raft.go:509: log.Fatal: ???: raft closed timestamp regression in cmd: "^\xb1\x9f\xfa3HO\xb2" (term: 42, index: 3279670); batch state: 1652071361.888790000,0, command: 1651969445.464306000,0, lea... #82934

Closed

yuzefovich mentioned this issue Oct 29, 2022

kvserver: v22.2.0-beta.4: raft closed timestamp regression in cmd #90682

Closed

pav-kv self-assigned this Oct 31, 2022

pav-kv mentioned this issue Feb 15, 2023

kvserver: outstanding reproposal at a higher max lease index #97102

Closed

tbg removed their assignment Mar 21, 2023

exalate-issue-sync bot assigned tbg May 30, 2023

tbg mentioned this issue Jul 18, 2023

kvserver: disable closed timestamp assertion by default #107064

Closed

tbg removed their assignment Jul 18, 2023

exalate-issue-sync bot assigned pav-kv Sep 11, 2023

jbowens mentioned this issue Nov 8, 2023

kvnemesis: raft closed timestamp regression #114059

Closed

yuzefovich mentioned this issue Nov 27, 2023

Sentry: replica.go:1333: log.Fatal: on-disk and in-memory state diverged: [RaftClosedTimestamp.WallTime: 1698592355915732544 != 1698595282944926229] (1) attached stack trace -- stack trace: | gith... #113274

Closed

pav-kv mentioned this issue Dec 13, 2023

kvnemesis: add Raft application assertions #116319

Merged

exalate-issue-sync bot unassigned pav-kv Jan 15, 2024

pav-kv mentioned this issue Jan 22, 2024

kv/kvnemesis: TestKVNemesisMultiNode failed #118017

Closed

kvoli mentioned this issue Apr 2, 2024

kv: node down with closed timestamp regression #121480

Closed

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

nicktrav closed this as completed Jun 28, 2024

nicktrav reopened this Jul 1, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

arulajmani mentioned this issue Sep 13, 2024

kv/kvnemesis: TestKVNemesisMultiNode failed #130662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

closedts: raft closed timestamp regression #70894

closedts: raft closed timestamp regression #70894

andreimatei commented Sep 29, 2021 •

edited by exalate-issue-sync bot

Loading

sentry-io bot commented Sep 29, 2021

tbg commented Jan 17, 2022

tbg commented Jan 17, 2022

tbg commented Jan 17, 2022

tbg commented Jan 17, 2022

yuzefovich commented May 31, 2022

andreimatei commented May 31, 2022

pav-kv commented Oct 31, 2022 •

edited

Loading

sentry-io bot commented Jan 30, 2023

tbg commented Feb 15, 2023 •

edited

Loading

tbg commented Feb 27, 2023

tbg commented Feb 27, 2023

pav-kv commented Feb 28, 2023 •

edited

Loading

tbg commented Mar 21, 2023

tbg commented Jul 18, 2023 •

edited

Loading

pav-kv commented Dec 14, 2023 •

edited

Loading

pav-kv commented Jan 8, 2024 •

edited

Loading

nicktrav commented Jun 28, 2024

pav-kv commented Jun 28, 2024

nicktrav commented Jul 1, 2024

nicktrav commented Jul 1, 2024

closedts: raft closed timestamp regression #70894

closedts: raft closed timestamp regression #70894

Comments

andreimatei commented Sep 29, 2021 • edited by exalate-issue-sync bot Loading

sentry-io bot commented Sep 29, 2021

tbg commented Jan 17, 2022

tbg commented Jan 17, 2022

tbg commented Jan 17, 2022

tbg commented Jan 17, 2022

yuzefovich commented May 31, 2022

andreimatei commented May 31, 2022

pav-kv commented Oct 31, 2022 • edited Loading

sentry-io bot commented Jan 30, 2023

tbg commented Feb 15, 2023 • edited Loading

tbg commented Feb 27, 2023

tbg commented Feb 27, 2023

pav-kv commented Feb 28, 2023 • edited Loading

tbg commented Mar 21, 2023

tbg commented Jul 18, 2023 • edited Loading

Footnotes

pav-kv commented Dec 14, 2023 • edited Loading

pav-kv commented Jan 8, 2024 • edited Loading

nicktrav commented Jun 28, 2024

pav-kv commented Jun 28, 2024

nicktrav commented Jul 1, 2024

nicktrav commented Jul 1, 2024

andreimatei commented Sep 29, 2021 •

edited by exalate-issue-sync bot

Loading

pav-kv commented Oct 31, 2022 •

edited

Loading

tbg commented Feb 15, 2023 •

edited

Loading

pav-kv commented Feb 28, 2023 •

edited

Loading

tbg commented Jul 18, 2023 •

edited

Loading

pav-kv commented Dec 14, 2023 •

edited

Loading

pav-kv commented Jan 8, 2024 •

edited

Loading