Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: transfer-leases/drain-other-node failed #104304

Closed
cockroach-teamcity opened this issue Jun 4, 2023 · 13 comments
Closed

roachtest: transfer-leases/drain-other-node failed #104304

cockroach-teamcity opened this issue Jun 4, 2023 · 13 comments
Assignees
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 4, 2023

roachtest.transfer-leases/drain-other-node failed with artifacts on release-22.2 @ cfc200070bbf9fc4b6bc6c6556b5fd56f48db381:

test artifacts and logs in: /artifacts/transfer-leases/drain-other-node/run_1
(cluster.go:1946).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-28449

Epic CRDB-27234

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jun 4, 2023
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jun 4, 2023
@irfansharif
Copy link
Contributor

q.runWithTimeout(ctx, func(ctx context.Context) { method(ctx, q.t, q.c, nodeID) })

Drain is given 1m to finish, but it takes just a bit longer.

node is draining... remaining: 449
I230604 07:34:07.378499 1 cli/rpc_node_shutdown.go:186  [-] 2  drain details: liveness record: 1, range lease iterations: 435, descriptor leases: 13
node is draining... remaining: 3
I230604 07:34:07.601703 1 cli/rpc_node_shutdown.go:186  [-] 3  drain details: range lease iterations: 3
node is draining... remaining: 2
I230604 07:34:07.822336 1 cli/rpc_node_shutdown.go:186  [-] 4  drain details: range lease iterations: 2
node is draining... remaining: 2
...

node is draining... remaining: 2
I230604 07:34:16.229077 1 cli/rpc_node_shutdown.go:186  [-] 42  drain details: range lease iterations: 2
node is draining... remaining: 2
I230604 07:34:16.450075 1 cli/rpc_node_shutdown.go:186  [-] 43  drain details: range lease iterations: 2
node is draining... remaining: 555
I230604 07:34:16.676265 1 cli/rpc_node_shutdown.go:186  [-] 44  drain details: range lease iterations: 555
node is draining... remaining: 2
I230604 07:34:16.896783 1 cli/rpc_node_shutdown.go:186  [-] 45  drain details: range lease iterations: 2
node is draining... remaining: 2
I230604 07:34:17.118054 1 cli/rpc_node_shutdown.go:186  [-] 46  drain details: range lease iterations: 2

...

I230604 07:35:01.597720 1 cli/rpc_node_shutdown.go:186  [-] 247  drain details: range lease iterations: 2
node is draining... remaining: 2
I230604 07:35:01.819723 1 cli/rpc_node_shutdown.go:186  [-] 248  drain details: range lease iterations: 2
node is draining... remaining: 2
I230604 07:35:02.041823 1 cli/rpc_node_shutdown.go:186  [-] 249  drain details: range lease iterations: 2
node is draining... remaining: 555
I230604 07:35:02.274241 1 cli/rpc_node_shutdown.go:186  [-] 250  drain details: range lease iterations: 555
node is draining... remaining: 2
I230604 07:35:02.495407 1 cli/rpc_node_shutdown.go:186  [-] 251  drain details: range lease iterations: 2
node is draining... remaining: 2
I230604 07:35:02.715840 1 cli/rpc_node_shutdown.go:186  [-] 252  drain details: range lease iterations: 2
node is draining... remaining: 2
I230604 07:35:02.938259 1 cli/rpc_node_shutdown.go:186  [-] 253  drain details: range lease iterations: 2
node is draining... remaining: 2
I230604 07:35:03.159378 1 cli/rpc_node_shutdown.go:186  [-] 254  drain details: range lease iterations: 2
node is draining... remaining: 2
I230604 07:35:03.380404 1 cli/rpc_node_shutdown.go:186  [-] 255  drain details: range lease iterations: 2
node is draining... remaining: 2
I230604 07:35:03.601990 1 cli/rpc_node_shutdown.go:186  [-] 256  drain details: range lease iterations: 2
node is draining... remaining: 1
I230604 07:35:03.815661 1 cli/rpc_node_shutdown.go:186  [-] 257  drain details: range lease iterations: 1
node is draining... remaining: 0 (complete)
I230604 07:35:04.025723 68 2@rpc/context.go:1393  [-] 259  closing [{n0: localhost:26257 (system)}]
I230604 07:35:04.025699 16 2@rpc/context.go:1393  [-] 258  closing [{n0: localhost:26257 (system)}]

BTW, what's with the occasional jumps to "node is draining... remaining: 555"?

@irfansharif
Copy link
Contributor

In #104185 too we saw drain get stuck-ish. Wonder if it's related.

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 16, 2023
@kvoli
Copy link
Collaborator

kvoli commented Jun 16, 2023

Looks like it got stuck on a single range for a while. Like @irfansharif mentioned it just missed the 1 minute timeout. Successfully draining (without shutdown). The SIGHUP also succeeds but looks like the context is cancelled just after the SIGKILL.

07:34:05 quit.go:93: stopping node 1
...
I230604 07:34:07.590445 63966 13@kv/kvserver/allocator/allocatorimpl/allocator.go:1736 ⋮ [drain,n1,s1,r27/1:‹/Table/2{6-7}›] 6429  no lease transfer target found for r27
... # 511 more attempts
I230604 07:35:03.810899 442218 13@kv/kvserver/replicate_queue.go:1885 ⋮ [drain,n1,s1,r27/1:‹/Table/2{6-7}›] 6941  transferring lease to s3
...
I230604 07:35:04.025188 443390 1@server/drain.go:285 ⋮ [n1] 6809  drain remaining: 0
I230604 07:35:04.025223 443390 1@server/drain.go:167 ⋮ [n1] 6810  drain request completed without server shutdown
...
07:35:04 cluster.go:677: test status: stopping nodes :1
teamcity-10381227-1685856001-25-n3cpu4: stopping
07:35:04 cluster.go:677: test status: stopping nodes :1
teamcity-10381227-1685856001-25-n3cpu4: stopping and waiting
0: context deadline exceeded: 

These log lines are odd. They are probably due the context being called with Done:

case <-transferCtx.Done():

I230604 07:35:02.264744 432231 kv/kvserver/store.go:1542 ⋮ [drain,n1,s1,r533/1:‹/Table/104/1/2{0-1}›] 6166  lease transfer aborted due to exceeded timeout
... # 550 more of these
I230604 07:35:02.267873 432429 kv/kvserver/store.go:1542 ⋮ [drain,n1,s1,r3/1:‹/System/{NodeLive…-tsd}›] 6296  lease transfer aborted due to exceeded timeout
...
I230604 07:35:02.273560 432718 kv/kvserver/store.go:1542 ⋮ [drain,n1,s1,r32/1:‹/{NamespaceTab…-Table/32}›] 6718  lease transfer aborted due to exceeded timeout

So the test isn't that far off passing (less than a second), but the stuck lease for r27.

The lease isn't being transferred because the allocator doesn't find a valid lease target. A lease transfer candidate for the prior and following range are found during the first pass:

I230604 07:34:07.246962 61514 13@kv/kvserver/replicate_queue.go:1885 ⋮ [drain,n1,s1,r408/1:‹/Table/104/1/14{5-6}›] 6242  transferring lease to s3
I230604 07:34:07.247129 61541 13@kv/kvserver/allocator/allocatorimpl/allocator.go:1736 ⋮ [drain,n1,s1,r27/1:‹/Table/2{6-7}›] 6243  no lease transfer target found for r27
I230604 07:34:07.247215 61542 13@kv/kvserver/replicate_queue.go:1885 ⋮ [drain,n1,s1,r192/1:‹/Table/104/1/36{1-2}›] 6244  transferring lease to s3

This indicates that it isn't a store/node level condition we are filtering on, but range level. This also occurs for r44 briefly but then succeeds:

I230604 07:34:07.368285 62607 13@kv/kvserver/allocator/allocatorimpl/allocator.go:1736 ⋮ [drain,n1,s1,r44/1:‹/Table/4{3-4}›] 6426  no lease transfer target found for r44
I230604 07:34:07.589451 63726 13@kv/kvserver/replicate_queue.go:1885 ⋮ [drain,n1,s1,r44/1:‹/Table/4{3-4}›] 6428  transferring lease to s2

What range level state would prevent a replica being a valid target? We know it is failing here:

if len(existing) == 0 || (len(existing) == 1 && existing[0].StoreID == leaseRepl.StoreID()) {
log.KvDistribution.VEventf(ctx, 2, "no lease transfer target found for r%d", leaseRepl.GetRangeID())
return roachpb.ReplicaDescriptor{}
}

If we exclude any node/store related checks, the following could invalidate a candidate:

  1. Range descriptor doesn't contain any voters1 (excl current LH).

if err := roachpb.CheckCanReceiveLease(existing[i], replDescs, false /* wasLastLeaseholder */); err != nil {

  1. Range descriptor only contains voters (excl current LH) which are in need of a snapshot 2.

candidates = append(validSnapshotCandidates, excludeReplicasInNeedOfSnapshots(
ctx, status, leaseRepl.GetFirstIndex(), candidates)...)

The range log shows the last activity for r27 was upreplicating, making n3 a voter from a learner about 80s before the failing drain run.

Range log entry
      "event": {
        "timestamp": "2023-06-04T07:32:33.422399Z",
        "range_id": 27,
        "store_id": 1,
        "event_type": 1,
        "info": {
          "UpdatedDesc": {
            "range_id": 27,
            "start_key": "og==",
            "end_key": "ow==",
            "internal_replicas": [
              {
                "node_id": 1,
                "store_id": 1,
                "replica_id": 1,
                "type": 0
              },
              {
                "node_id": 2,
                "store_id": 2,
                "replica_id": 2,
                "type": 0
              },
              {
                "node_id": 3,
                "store_id": 3,
                "replica_id": 3,
                "type": 0
              }
            ],
            "next_replica_id": 4,
            "generation": 4,
            "sticky_bit": {}
          },
          "AddReplica": {
            "node_id": 3,
            "store_id": 3,
            "replica_id": 3,
            "type": 0
          },
          "Reason": "range under-replicated",
          "Details": "{\"Target\":\"s3, diversity:0.25, converges:0, balance:1, rangeCount:40\"}"
        }
      },
      "pretty_info": {
        "updated_desc": "r27:/Table/2{6-7} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]",
        "added_replica": "(n3,s3):3",
        "reason": "range under-replicated",
        "details": "{\"Target\":\"s3, diversity:0.25, converges:0, balance:1, rangeCount:40\"}"
      }

So I somewhat doubt there aren't any voters, it even shows them in the descriptor being logged on failure throughout. It seems plausible that perhaps we are hitting the 2nd condition where we incorrectly think both of the other replicas may need a snapshot.

I filed #105066 to give an answer. I can't figure out how we ended up in this weird state, nor how it seemingly resolves itself after. The range status from both n1 (to-be drained) and n2 is available about 10s before the drain begins but doesn't show any issues:

"problems": {
  "unavailable": false,
  "leaderNotLeaseHolder": false,
  "noRaftLeader": false,
  "underreplicated": false,
  "overreplicated": false,
  "noLease": false,
  "quiescentEqualsTicking": false,
  "raftLogTooLarge": false,
  "circuitBreakerError": false,
  "pausedFollowers": false
}

I'm going to leave this test open but remove the release blocker. The behavior is unexpected but it does eventually succeed. With added obs into why there were no valid candidates, we could dig deeper - I'm suspecting a bug and not just a flake.

Footnotes

  1. Relevant code. The replica needs to either be an (incoming) voter or outgoing voter and previous leaseholder.

  2. Relevant code.

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/drain-other-node failed with artifacts on release-22.2 @ 2124c6b9adc61030bedc4a6a67c025566f989201:

test artifacts and logs in: /artifacts/transfer-leases/drain-other-node/run_1
(quit.go:429).func4: context deadline exceeded

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented Jul 5, 2023

Got a repro with v=5 logging. The culprit is replicas in need of a snapshot!

19:58:22 quit.go:93: stopping node 1
node is draining... remaining: 335
I230705 19:58:23.593971 1 cli/rpc_node_shutdown.go:186  [-] 2  drain details: descriptor leases: 13, liveness record: 1, range lease iterations: 321
node is draining... remaining: 2
...
node is draining... remaining: 2
I230705 19:59:22.189940 1 cli/rpc_node_shutdown.go:186  [-] 269  drain details: range lease iterations: 2
I230705 19:59:22.401308 456512 13@kv/kvserver/allocator/allocatorimpl/allocator.go:1562 ⋮ [drain,n1,s1,r10/1:‹/Table/{6-8}›] 9023  not considering [n1, s1] as a potential candidate for a lease transferbecause transfer options exclude current leaseholder
I230705 19:59:22.401335 456512 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2362 ⋮ [drain,n1,s1,r10/1:‹/Table/{6-8}›] 9024  not considering [n3, s3] as a potential candidate for a lease transfer because the replica may be waiting for a snapshot
I230705 19:59:22.401354 456512 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2362 ⋮ [drain,n1,s1,r10/1:‹/Table/{6-8}›] 9025  not considering [n2, s2] as a potential candidate for a lease transfer because the replica may be waiting for a snapshot
I230705 19:59:22.401371 456512 13@kv/kvserver/allocator/allocatorimpl/allocator.go:1775 ⋮ [drain,n1,s1,r10/1:‹/Table/{6-8}›] 9026  no lease transfer target found for r10

It is unexpected that a replica requires a snapshot for so long. I'll dig into why the leaseholder believes both of the other voter replicas require a snapshot.

@kvoli
Copy link
Collaborator

kvoli commented Jul 5, 2023

Added some more logging. The draining node cannot transfer away the lease because it considers the other replicas to be in StateProbe:

I230705 22:43:21.104864 287349 13@kv/kvserver/allocator/allocatorimpl/allocator.go:1562 ⋮ [drain,n2,s2,r48/3:‹/Table/4{7-8}›] 9371  not considering [n2, s2] as a potential candidate for a lease transferbecause transfer options exclude current leaseholder
I230705 22:43:21.104909 287349 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2363 ⋮ [drain,n2,s2,r48/3:‹/Table/4{7-8}›] 9372  not considering [n1, s1] as a potential candidate for a lease transfer because the replica may be waiting for a snapshot. reason=‹replica in StateProbe› raft-status=‹{"id":"3","term":
13,"vote":"3","commit":61,"lead":"3","raftState":"StateLeader","applied":61,"progress":{"1":{"match":61,"next":62,"state":"StateProbe"},"2":{"match":61,"next":62,"state":"StateProbe"},"3":{"match":61,"next":62,"state":"StateReplicate"}},"leadtransferee":"0"}›, first-index=26
I230705 22:43:21.104953 287349 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2363 ⋮ [drain,n2,s2,r48/3:‹/Table/4{7-8}›] 9373  not considering [n3, s3] as a potential candidate for a lease transfer because the replica may be waiting for a snapshot. reason=‹replica in StateProbe› raft-status=‹{"id":"3","term":
13,"vote":"3","commit":61,"lead":"3","raftState":"StateLeader","applied":61,"progress":{"1":{"match":61,"next":62,"state":"StateProbe"},"2":{"match":61,"next":62,"state":"StateProbe"},"3":{"match":61,"next":62,"state":"StateReplicate"}},"leadtransferee":"0"}›, first-index=26

Unfortunately, the debug.zip generated doesn't show any replicas in StateProbe and was captured 15 seconds after the test failed. I'll send over slack for those interested.

r48 range.json
{
  "span": {
    "start_key": "/Table/47",
    "end_key": "/Table/48"
  },
  "raft_state": {
    "replica_id": 3,
    "hard_state": {
      "term": 13,
      "vote": 3,
      "commit": 62
    },
    "lead": 3,
    "state": "StateLeader",
    "applied": 62,
    "progress": {
      "1": {
        "match": 62,
        "next": 63,
        "state": "StateReplicate"
      },
      "2": {
        "match": 62,
        "next": 63,
        "state": "StateReplicate"
      },
      "3": {
        "match": 62,
        "next": 63,
        "state": "StateReplicate"
      }
    }
  },
  "state": {
    "state": {
      "raft_applied_index": 62,
      "lease_applied_index": 20,
      "desc": {
        "range_id": 48,
        "start_key": "tw==",
        "end_key": "uA==",
        "internal_replicas": [
          {
            "node_id": 1,
            "store_id": 1,
            "replica_id": 1,
            "type": 0
          },
          {
            "node_id": 3,
            "store_id": 3,
            "replica_id": 2,
            "type": 0
          },
          {
            "node_id": 2,
            "store_id": 2,
            "replica_id": 3,
            "type": 0
          }
        ],
        "next_replica_id": 4,
        "generation": 4,
        "sticky_bit": {}
      },
      "lease": {
        "start": {
          "wall_time": 1688596943808307970
        },
        "replica": {
          "node_id": 2,
          "store_id": 2,
          "replica_id": 3,
          "type": 0
        },
        "proposed_ts": {
          "wall_time": 1688596943814253362
        },
        "epoch": 3,
        "sequence": 11,
        "acquisition_type": 2
      },
      "truncated_state": {
        "index": 25,
        "term": 6
      },
      "gc_threshold": {},
      "stats": {
        "contains_estimates": 0,
        "last_update_nanos": 1688596878371847979,
        "intent_age": 0,
        "gc_bytes_age": 0,
        "live_bytes": 2765,
        "live_count": 52,
        "key_bytes": 1067,
        "key_count": 52,
        "val_bytes": 1698,
        "val_count": 52,
        "intent_bytes": 0,
        "intent_count": 0,
        "separated_intent_count": 0,
        "range_key_count": 0,
        "range_key_bytes": 0,
        "range_val_count": 0,
        "range_val_bytes": 0,
        "sys_bytes": 464,
        "sys_count": 6,
        "abort_span_bytes": 0
      },
      "version": {
        "major": 22,
        "minor": 2,
        "patch": 0,
        "internal": 0
      },
      "raft_closed_timestamp": {
        "wall_time": 1688597021808929739
      },
      "raft_applied_index_term": 13,
      "gc_hint": {
        "latest_range_delete_timestamp": {}
      }
    },
    "last_index": 62,
    "raft_log_size": 7890,
    "raft_log_size_trusted": true,
    "approximate_proposal_quota": 8388608,
    "proposal_quota_base_index": 62,
    "range_max_bytes": 536870912,
    "active_closed_timestamp": {
      "wall_time": 1688597036166477063
    },
    "rangefeed_registrations": 1,
    "tenant_id": 1,
    "closed_timestamp_sidetransport_info": {
      "replica_closed": {
        "wall_time": 1688597036166477063
      },
      "replica_lai": 20,
      "central_closed": {}
    }
  },
  "source_node_id": 2,
  "source_store_id": 2,
  "lease_history": [
    {
      "start": {
        "wall_time": 1688596943808307970
      },
      "expiration": {
        "wall_time": 1688596952808261126
      },
      "replica": {
        "node_id": 2,
        "store_id": 2,
        "replica_id": 3,
        "type": 0
      },
      "proposed_ts": {
        "wall_time": 1688596943808261126
      },
      "sequence": 10,
      "acquisition_type": 1
    },
    {
      "start": {
        "wall_time": 1688596943808307970
      },
      "replica": {
        "node_id": 2,
        "store_id": 2,
        "replica_id": 3,
        "type": 0
      },
      "proposed_ts": {
        "wall_time": 1688596943814253362
      },
      "epoch": 3,
      "sequence": 11,
      "acquisition_type": 2
    }
  ],
  "problems": {},
  "stats": {
    "queries_per_second": 0.1990982102794311,
    "requests_per_second": 0.19909820892540883,
    "reads_per_second": 3.8457390900848414,
    "write_bytes_per_second": 1.5927856705353611,
    "read_bytes_per_second": 223.96452695703275
  },
  "lease_status": {
    "lease": {
      "start": {
        "wall_time": 1688596943808307970
      },
      "replica": {
        "node_id": 2,
        "store_id": 2,
        "replica_id": 3,
        "type": 0
      },
      "proposed_ts": {
        "wall_time": 1688596943814253362
      },
      "epoch": 3,
      "sequence": 11,
      "acquisition_type": 2
    },
    "now": {
    2023 22:43:59.265
      "wall_time": 1688597039265534769
    },
    "request_time": {
      "wall_time": 1688597039265534769
    },
    "state": 1,
    "liveness": {
      "node_id": 2,
      "epoch": 3,
      "expiration": {
        "wall_time": 1688597045920188603,
        "logical": 0
      },
      "draining": true
    },
    "min_valid_observed_timestamp": {
      "wall_time": 1688596943808307970
    }
  },
  "quiescent": true,
  "top_k_locks_by_wait_queue_waiters": null,
  "locality": {
    "tiers": [
      {
        "key": "cloud",
        "value": "gce"
      },
      {
        "key": "region",
        "value": "us-east1"
      },
      {
        "key": "zone",
        "value": "us-east1-b"
      }
    ]
  },
  "is_leaseholder": true,
  "lease_valid": true
}

A question for the @cockroachdb/replication folks – what could lead to not exiting StateProbe during drain? Are there some good vmodule settings for capturing information about hearbeats sent to followers in StateProbe?

I noticed there's been some recent activity to ensure ranges don't quiesce with followers in StateProbe in #103827 cc @tbg.

@erikgrinaker
Copy link
Contributor

#103827 was backported to 22.2 FWIW. However, there's another related PR, #104212, which wasn't. I wonder if that's a contributing factor here.

Btw, I'm seeing a lot of these messages, which I think you've added manually since I don't find them in the code? Would they also prevent lease transfers? Why are they considered non-live? Liveness?

not considering [n1, s1] as a potential candidate for a lease transferbecause replica is non-live

Anyway, ignoring those, we see n2 receive the lease here:

I230705 22:42:23.836414 145 kv/kvserver/replica_proposal.go:273 ⋮ [T1,n2,s2,r48/3:‹/Table/4{7-8}›,raft] 101
  new range lease repl=(n2,s2):3 seq=11 start=1688596943.808307970,0 epo=3 pro=1688596943.814253362,0
  following repl=(n2,s2):3 seq=10 start=1688596943.808307970,0 exp=1688596952.808261126,0 pro=1688596943.808261126,0

Shortly after, we see a Raft status:

I230705 22:42:23.883730 4274 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2363 ⋮ [T1,n2,replicate,s2,r48/3:‹/Table/4{7-8}›] 20
  not considering [n1, s1] as a potential candidate for a lease transfer because the replica may be waiting for a snapshot.
  reason=‹replica in StateProbe› raft-status=‹{"id":"3","term":13,"vote":"3","commit":61,"lead":"3","raftState":"StateLeader","applied":61,
  "progress":{"1":{"match":0,"next":59,"state":"StateProbe"},"2":{"match":61,"next":62,"state":"StateReplicate"},"3":{"match":61,"next":62,"state":"StateReplicate"}},"leadtransferee":"0"}›, first-index=26

Notice how both 2 and 3 (the leader n2) are in StateReplicate and are fully caught up at index 61, while 1 is lagging in StateProbe.

Some seconds later, we see that 1 has now caught up at index 61 and is in StateReplicate, while 2 has transitioned to StateProbe.

I230705 22:42:35.128527 26774 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2363 ⋮ [T1,n2,replicate,s2,r48/3:‹/Table/4{7-8}›] 1880
  not considering [n3, s3] as a potential candidate for a lease transfer because the replica may be waiting for a snapshot.
  reason=‹replica in StateProbe› raft-status=‹{"id":"3","term":13,"vote":"3","commit":61,"lead":"3","raftState":"StateLeader","applied":61",
  "progress":{"1":{"match":61,"next":62,"state":"StateReplicate"},"2":{"match":61,"next":62,"state":"StateProbe"},"3":{"match":61,"next":62,"state":"StateReplicate"}},"leadtransferee":"0"}›, first-index=26

And then a few seconds later, both followers are in StateProbe:

I230705 22:42:38.688945 32927 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2363 ⋮ [T1,n2,replicate,s2,r48/3:‹/Table/4{7-8}›] 3677
  not considering [n3, s3] as a potential candidate for a lease transfer because the replica may be waiting for a snapshot.
  reason=‹replica in StateProbe› raft-status=‹{"id":"3","term":13,"vote":"3","commit":61,"lead":"3","raftState":"StateLeader","applied":61,
  "progress":{"2":{"match":61,"next":62,"state":"StateProbe"},"3":{"match":61,"next":62,"state":"StateReplicate"},"1":{"match":61,"next":62,"state":"StateProbe"}},"leadtransferee":"0"}›, first-index=26

Here's what I'm guessing happened: once replica 1 caught up to index 61, all replicas were in StateReplicate. This allowed the range to quiesce. However, after quiescing, some replicas were reported as unreachable, and since the leader was quiesced it would never notice that they recovered.

Let's see where this could happen:

  1. Sending outbound Raft messages only happens during handleRaftReady processing, which does not happen if the range is quiesced (even when it just quiesced on the preceding tick in the same scheduling run).

    if !r.sendRaftMessageRequest(ctx, req) {
    if err := r.withRaftGroup(true, func(raftGroup *raft.RawNode) (bool, error) {
    r.mu.droppedMessages++
    raftGroup.ReportUnreachable(msg.To)
    return true, nil
    }); err != nil && !errors.Is(err, errRemoved) {
    log.Fatalf(ctx, "%v", err)
    }
    }

  2. When followers are paused, which doesn't happen when quiesced.

    for replicaID := range r.mu.pausedFollowers {
    // We're dropping messages to those followers (see handleRaftReady) but
    // it's a good idea to tell raft not to even bother sending in the first
    // place. Raft will react to this by moving the follower to probing state
    // where it will be contacted only sporadically until it responds to an
    // MsgApp (which it can only do once we stop dropping messages). Something
    // similar would result naturally if we didn't report as unreachable, but
    // with more wasted work.
    r.mu.internalRaftGroup.ReportUnreachable(uint64(replicaID))
    }

  3. When we fail to send a heartbeat to followers.

    if !s.cfg.Transport.SendAsync(chReq, rpc.SystemClass) {
    for _, beat := range beats {
    if repl, ok := s.mu.replicasByRangeID.Load(beat.RangeID); ok {
    repl.addUnreachableRemoteReplica(beat.ToReplicaID)
    }
    }
    for _, resp := range resps {
    if repl, ok := s.mu.replicasByRangeID.Load(resp.RangeID); ok {
    repl.addUnreachableRemoteReplica(resp.ToReplicaID)
    }
    }
    return 0
    }

The last case is the one that would be covered by #104212. Basically, we must've failed to send a heartbeat to the nodes, presumably the heartbeat where we quiesced the range since we otherwise don't send them (or alternatively a previously queued heartbeat). I'll see if I can find any indications of that.

There are also a few cases in etcd/raft where we call BecomeProbe(), but only in response to actions taken by CRDB, which shouldn't happen on a quiesced range.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 6, 2023

Sure enough, we see n2 failing connections to n1 and n3 right around this time:

I230705 22:42:28.620185 7325 2@rpc/context.go:1393 ⋮ [n2] 14  closing [{n3: ‹10.142.0.72:26257› (default)} {n0: ‹10.142.0.72:26257› (default)}]
I230705 22:42:28.718969 7589 2@rpc/context.go:1393 ⋮ [n2] 15  closing [{n3: ‹10.142.0.72:26257› (system)} {n0: ‹10.142.0.72:26257› (system)}]
I230705 22:42:28.917466 13361 2@rpc/context.go:1774 ⋮ [n2,rnode=3,raddr=‹10.142.0.72:26257›,class=system] 16  dialing
I230705 22:42:29.618228 14589 2@rpc/context.go:1774 ⋮ [n2,rnode=3,raddr=‹10.142.0.72:26257›,class=default] 17  dialing
I230705 22:42:31.918956 13412 2@rpc/context.go:1393 ⋮ [n2] 18  closing [{n3: ‹10.142.0.72:26257› (system)} {n0: ‹10.142.0.72:26257› (system)}]
I230705 22:42:32.421212 125 2@util/log/event_log.go:32 ⋮ [n2] 19 ={"Timestamp":1688596952421206651,"EventType":"runtime_stats","MemRSSBytes":228589568,"GoroutineCount":318,"MemStackSysBytes":3768320,"GoAllocBytes":65440344,"GoTotalBytes":116572440,"GoStatsStaleness":0.000988878,"HeapFragmentBytes":13948328,"HeapReservedBytes":16326656,"HeapReleasedBytes":1179648,"CGoAllocBytes":25360992,"CGoTotalBytes":31051776,"CGoCallRate":0.0000011707945,"CPUUserPercent":0.00000019187527,"CPUSysPercent":0.000000038493496,"GCPausePercent":0.00000000018450057,"GCRunCount":16,"NetHostRecvBytes":1663415,"NetHostSendBytes":882882}
I230705 22:42:32.421345 125 2@server/status/runtime_log.go:47 ⋮ [n2] 20  runtime stats: 218 MiB RSS, 318 goroutines (stacks: 3.6 MiB), 62 MiB/111 MiB Go alloc/total(stale) (heap fragmentation: 13 MiB, heap reserved: 16 MiB, heap released: 1.1 MiB), 24 MiB/30 MiB CGO alloc/total (0.0 CGO/sec), 0.0/0.0 %(u/s)time, 0.0 %gc (16x), 1.6 MiB/862 KiB (r/w)net
I230705 22:42:32.620820 14592 2@rpc/context.go:1393 ⋮ [n2] 21  closing [{n3: ‹10.142.0.72:26257› (default)} {n0: ‹10.142.0.72:26257› (default)}]
I230705 22:42:32.917555 20382 2@rpc/context.go:1774 ⋮ [n2,rnode=3,raddr=‹10.142.0.72:26257›,class=system] 22  dialing
I230705 22:42:33.141481 20792 2@rpc/context.go:1774 ⋮ [n2,rnode=3,raddr=‹10.142.0.72:26257›,class=default] 23  dialing 
I230705 22:42:33.386615 206 2@kv/kvserver/replica_proposal.go:376 ⋮ [n2,s2,r2/3:‹/System/NodeLiveness{-Max}›,raft] 24  acquired system range lease: repl=(n2,s2):3 seq=6 start=1688596953.375609416,0 exp=1688596962.375551770,0 pro=1688596953.375551770,0
I230705 22:42:33.608462 191 2@kv/kvserver/replica_proposal.go:376 ⋮ [n2,s2,r1/2:‹/{Min-System/NodeL…}›,raft] 25  acquired system range lease: repl=(n2,s2):2 seq=6 start=1688596953.495391157,0 exp=1688596962.495327860,0 pro=1688596953.495327860,0
W230705 22:42:35.000201 162 2@gossip/gossip.go:1447 ⋮ [n2] 26  no incoming or outgoing connections
I230705 22:42:35.000517 244 2@rpc/context.go:1393 ⋮ [n2] 27  closing [{n1: ‹10.142.0.56:26257› (system)} {n0: ‹10.142.0.56:26257› (system)}]
I230705 22:42:35.117638 26689 2@rpc/context.go:1774 ⋮ [n2,rnode=1,raddr=‹10.142.0.56:26257›,class=system] 28  dialing
W230705 22:42:35.118413 26689 2@rpc/nodedialer/nodedialer.go:193 ⋮ [n2] 29  unable to connect to n1: failed to connect to n1 at ‹10.142.0.56:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.142.0.56:26257: connect: connection refused"›
I230705 22:42:35.184562 222 2@rpc/context.go:1393 ⋮ [n2] 30  closing [{n1: ‹10.142.0.56:26257› (default)} {n0: ‹10.142.0.56:26257› (default)}]
I230705 22:42:35.184998 26844 2@rpc/context.go:1774 ⋮ [n2,rnode=1,raddr=‹10.142.0.56:26257›,class=default] 31  dialing
W230705 22:42:35.185623 26844 2@rpc/nodedialer/nodedialer.go:193 ⋮ [n2] 32  unable to connect to n1: failed to connect to n1 at ‹10.142.0.56:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.142.0.56:26257: connect: connection refused"›

I think it's plausible that #104212 would fix this then. I'll give it a try.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 6, 2023

Still saw 1 failure out of 100 runs with #104212. Interestingly, we're seeing the lease count bounce back up to >500 leases occasionally.

I230706 10:41:45.682020 1 cli/rpc_node_shutdown.go:186  [-] 66  drain details: range lease iterations: 555 
node is draining... remaining: 2
I230706 10:41:45.901536 1 cli/rpc_node_shutdown.go:186  [-] 67  drain details: range lease iterations: 2
node is draining... remaining: 2
I230706 10:41:46.120754 1 cli/rpc_node_shutdown.go:186  [-] 68  drain details: range lease iterations: 2
node is draining... remaining: 2
I230706 10:41:46.339899 1 cli/rpc_node_shutdown.go:186  [-] 69  drain details: range lease iterations: 2
node is draining... remaining: 555 
I230706 10:41:46.564888 1 cli/rpc_node_shutdown.go:186  [-] 70  drain details: range lease iterations: 555 
node is draining... remaining: 2
I230706 10:41:46.784263 1 cli/rpc_node_shutdown.go:186  [-] 71  drain details: range lease iterations: 2
node is draining... remaining: 2
I230706 10:41:47.003289 1 cli/rpc_node_shutdown.go:186  [-] 72  drain details: range lease iterations: 2
node is draining... remaining: 2

Didn't run this with any additional logging or anything, unfortunately.

Are there some good vmodule settings for capturing information about hearbeats sent to followers in StateProbe?

--vmodule=raft=5,replica_raft=5,replica_raft_quiesce=5,allocator=5 would give a ton of stuff. Maybe too much.

I can try another run with this and see what falls out.

@kvoli
Copy link
Collaborator

kvoli commented Jul 6, 2023

Btw, I'm seeing a lot of these messages, which I think you've added manually since I don't find them in the code? Would they also prevent lease transfers? Why are they considered non-live? Liveness?

Repro branch here: kvoli@d3fb6f0

It means the candidate was filtered by the storepool for being suspect/draining/unavailable/dead. It is a bit rough - I'm thinking we could probably benefit from logging which is on by default (not vmodule) for drain lease transfers in the allocator.

There's this issue #105066 which captures this.

Let's see where this could happen:

Thanks for the pointers - this was very helpful.

--vmodule=raft=5,replica_raft=5,replica_raft_quiesce=5,allocator=5 would give a ton of stuff. Maybe too much.

I can try another run with this and see what falls out.

Thanks! Let me know what you find. I'm going to stop actively stressing this test for now.

@erikgrinaker
Copy link
Contributor

I think I found the issue, see #101624 (comment).

@erikgrinaker erikgrinaker assigned erikgrinaker and unassigned kvoli Jul 6, 2023
@erikgrinaker erikgrinaker added T-kv-replication and removed T-kv KV Team labels Jul 6, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 6, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

Resolved by #106805.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants