-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Ingesters unable to re-join the cluster then unregister_on_shutdown=false
+ -ingester.heartbeat-period=0
#4366
Conversation
unregister_on_shutdown=false
+ -ingester.heartbeat-period=0
pkg/ring/lifecycler.go
Outdated
@@ -588,6 +589,13 @@ func (i *Lifecycler) initRing(ctx context.Context) error { | |||
i.setTokens(tokens) | |||
|
|||
level.Info(log.Logger).Log("msg", "existing entry found in ring", "state", i.GetState(), "tokens", len(tokens), "ring", i.RingName) | |||
|
|||
// If we flip the instance from LEAVING to ACTIVE and the heartbeat is disabled, we need to update KV here | |||
if i.cfg.HeartbeatPeriod == 0 && ringDesc.Ingesters[i.ID].State == LEAVING && instanceDesc.State == ACTIVE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: are we going to hit the issue #4365 in the case the ingester crashes while in the PENDING
or JOINING
state in the ring and then restarts? Could you add unit tests to check that too, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure..
But i think we are fine there as..
if instanceDesc.State == JOINING {
level.Warn(log.Logger).Log("msg", "instance found in ring as JOINING, setting to PENDING",
"ring", i.RingName)
instanceDesc.State = PENDING
return ringDesc, true, nil
}
We will still hit the issue if the number of tokens change though... but this is already the case now
6f69d75
to
e27561c
Compare
Hi @pracucci Did u had a change to see the updated change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks reasonable, but I got lost in the test.
I made a few comments, but I think I'm not getting to the crux of what it is doing.
E.g. how many ingesters are there supposed to be?
pkg/ring/lifecycler_test.go
Outdated
require.NoError(t, err) | ||
require.NoError(t, services.StartAndAwaitRunning(context.Background(), l2)) | ||
|
||
pool := func(condition func(*Desc) bool) map[string]InstanceDesc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get what "pool" means here. Can you pick a more descriptive name, or add a comment saying what this function does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be poll
instead of pool
:P
This function is basically waiting for a condition and returning actual state of the ingesters after the condition succeed.
pkg/ring/lifecycler_test.go
Outdated
return ingesters | ||
} | ||
|
||
startIngesterAndWaitActive := func(lcConfig LifecyclerConfig) *Lifecycler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function starts one ingester then also waits for two more to be active?
Since starting one ingester is repeated many times, could that be a separate function from the one that waits for two to be active?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does not need to check the ingester1. I did that just in case... removing it.
But basically this function does what you stated: "starts ingester2 and wait it to became active."
13aa6bf
to
4587b67
Compare
…tdown=false Signed-off-by: Alan Protasio <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]> Signed-off-by: Alan Protasio <[email protected]>
Co-authored-by: Bryan Boreham <[email protected]> Signed-off-by: Alan Protasio <[email protected]>
Co-authored-by: Bryan Boreham <[email protected]> Signed-off-by: Alan Protasio <[email protected]>
88353cb
to
8111704
Compare
Signed-off-by: Alan Protasio <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think I get it now, but a bit more consistency in naming and usage would improve clarity in the test.
pkg/ring/lifecycler_test.go
Outdated
|
||
// Starts Ingester2 and wait it to became active | ||
startIngester2AndWaitActive := func(lcConfig LifecyclerConfig) *Lifecycler { | ||
ingester, err := NewLifecycler(lcConfig, &noopFlushTransferer{}, "ingester", IngesterRingKey, true, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we create a new Lifecycler here? What happens to the old one in l2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And why is it named ingester
? Nearby we have a variable ingesters
which is a map[string]InstanceDesc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we create a new Lifecycler here? What happens to the old one in l2?
I think i tried to reuse it and it did not work.. but besides that, In order to simulated a brand new ingester i would assume that creating a new instance is the safest thing to do. make sense?
pkg/ring/lifecycler_test.go
Outdated
} | ||
|
||
// Starts Ingester2 and wait it to became active | ||
startIngester2AndWaitActive := func(lcConfig LifecyclerConfig) *Lifecycler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing specific to Ingester 2 in this function, is there?
That is a consequence of passing LifecyclerConfig with ID
set to "ing2"
, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Signed-off-by: Alan Protasio <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - sorry it took a while but I think the flow of the test is much clearer now.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, LGTM!
Thanks guys! 😄 |
…own=false` + `-ingester.heartbeat-period=0` (cortexproject#4366) * Fix problem when ingester heartbeat is disabled and unregister_on_shutdown=false Signed-off-by: Alan Protasio <[email protected]> * Update CHANGELOG.md Co-authored-by: Marco Pracucci <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Update pkg/ring/lifecycler.go Co-authored-by: Bryan Boreham <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Update pkg/ring/lifecycler_test.go Co-authored-by: Bryan Boreham <[email protected]> Signed-off-by: Alan Protasio <[email protected]> * Address comments Signed-off-by: Alan Protasio <[email protected]> * Addressing comments on test Signed-off-by: Alan Protasio <[email protected]> Co-authored-by: Marco Pracucci <[email protected]> Co-authored-by: Bryan Boreham <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
What this PR does:
Address problem state on #4365
Which issue(s) this PR fixes:
Fixes #4365
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]