2 Coordinators Elected Leader #16411

razinbouzar · 2024-05-07T19:44:49Z

Please provide a detailed title (e.g. "Broker crashes when using TopN query with Bound filter" instead of just "Broker crashes").

Affected Version

28.0.1 (also observed in v25)
ZK version 3.7

Description

During patching of our underlying EKS nodes, we observe a condition wherein 2 coordinators are elected leader. When we encounter this condition, we see multiple task failures across different data sources.

razinbouzar · 2024-05-09T14:34:08Z

Another observation is that this condition occurred during a ZK leader election change.

gianm · 2024-05-09T22:48:06Z

We saw a double-leader situation recently when a ZK server cycled, and we suspect it has something to do with https://issues.apache.org/jira/browse/CURATOR-696. That Curator Jira suggests a bug was introduced by https://issues.apache.org/jira/browse/CURATOR-644 (PR: apache/curator#430).

It seems possible that this did introduce a bug, since that changed the logic from doing reset() always on reconnection (which would recreate the ephemeral znode) to doing getChildren(), which would look for existing ones, and then only call reset() if they could not be found.

We updated to Curator 5.4 some time ago, in #13302. So if this is indeed what’s going on, it has potentially been an issue since Druid 25.

What we saw specifically was this scenario:

OL 1 was leader prior to ZK connection loss
OL 1 reconnected to ZK and got a session id that we believe is a new session id (although we were not able to confirm that)
OL 1's LeaderLatch recipe checked the latch patch and saw an ephemeral znode there that it believed was its own, so it started leadership.
OL 2, 30s later, checked the latch path and saw no children at all (not even the one for OL 1). It created an ephemeral znode for itself, and started leadership.

We think what happened is that both OLs established new sessions, even though the old sessions hadn’t expired yet. Because the old sessions hadn’t expired yet, the old ephemeral znodes were still there upon reconnection. The old leader, OL 1, saw both old znodes there and assumed it was still leader. But because those znodes were associated with different sessions, they went away in 30s. When OL 2 noticed that, it assumed there was no active leader, so it became one and then we had two leaders.

gianm · 2024-05-09T22:52:55Z

I commented on CURATOR-696 linking back here.

razinbouzar · 2024-05-16T03:41:30Z

@cryptoe can we re-open this issue since #16425 was reverted in #16445?

razinbouzar · 2024-05-22T22:41:52Z

@gianm Curator 5.7.0 includes the fix for https://issues.apache.org/jira/browse/CURATOR-696. I'm unsure when this version will be made available, but have asked here.

Added listener method that tracks ZK leader state

razinbouzar added the Uncategorized problem report label May 7, 2024

gianm added Bug Area - ZooKeeper/Curator and removed Uncategorized problem report labels May 9, 2024

asdf2014 mentioned this issue May 10, 2024

Downgrade the version of Apache Curator from 5.5.0 to 5.3.0 to avoid a bug in the new version #16425

Merged

10 tasks

tisonkun mentioned this issue May 10, 2024

CURATOR-696. Fix double leader for LeaderLatch apache/curator#500

Merged

cryptoe closed this as completed in #16425 May 10, 2024

kfaraz reopened this May 16, 2024

kfaraz self-assigned this May 16, 2024

razinbouzar pushed a commit to razinbouzar/druid that referenced this issue May 31, 2024

Addressing apache#16411

b95b944

Added listener method that tracks ZK leader state

razinbouzar mentioned this issue May 31, 2024

Addressing 2 Coordinators Elected As Leader (#16411) #16528

Merged

10 tasks

kfaraz closed this as completed in #16528 Jun 7, 2024

razinbouzar pushed a commit to razinbouzar/druid that referenced this issue Jun 17, 2024

Addressing apache#16411

e3e2b9f

Added listener method that tracks ZK leader state

razinbouzar mentioned this issue Jun 17, 2024

Curator 5.7.0 Upgrade #16617

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2 Coordinators Elected Leader #16411

2 Coordinators Elected Leader #16411

razinbouzar commented May 7, 2024

razinbouzar commented May 9, 2024

gianm commented May 9, 2024

gianm commented May 9, 2024

razinbouzar commented May 16, 2024

razinbouzar commented May 22, 2024

2 Coordinators Elected Leader #16411

2 Coordinators Elected Leader #16411

Comments

razinbouzar commented May 7, 2024

Affected Version

Description

razinbouzar commented May 9, 2024

gianm commented May 9, 2024

gianm commented May 9, 2024

razinbouzar commented May 16, 2024

razinbouzar commented May 22, 2024