-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Persistence.Healthcheck probe stall because of failed warmup #275
Fix Persistence.Healthcheck probe stall because of failed warmup #275
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions
_currentLivenessStatus = livenessStatus; | ||
if (livenessStatus.Warmup && (!livenessStatus.SnapshotSaved || !livenessStatus.JournalPersisted)) | ||
_firstIndex++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this value goes up each time the warmup probe recovers without knowing whether or not the two persist operations have both completed successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warmup is a property we pass back to the probe that the suicide actor is in "first attempt" mode, if any of the snapshot save or journal persist failed, we extend the "first attempt" time to the next cycle until both succeed.
This is because the liveness status does not allow for undefined/null recovery status for the probe to detect if recovery was performed on both snapshot and journal.
Maybe we should change the recovery properties of the persistence liveness status to be nullable instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should change the recovery properties of the persistence liveness status to be nullable instead?
Yes, if there's a better / more intuitive way of modeling this we should do that instead.
Persistence.Healthcheck will stall if the first warmup cycle failed.
NOTE: this would mean that if snapshot save and journal persist never succeed, the probe will be stuck in a perpetual warmup state because recovery would never be defined