Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Persistence.Healthcheck probe stall because of failed warmup #275

Conversation

Arkatufus
Copy link
Contributor

@Arkatufus Arkatufus commented Mar 25, 2024

Persistence.Healthcheck will stall if the first warmup cycle failed.

  • Warmup cycle need to be extended every time snapshot save or journal persist fail until both succeeded because recovery will never be defined until both succeeded for the first time.
  • Recovery status should not be a required check to declare that the persistence probe is a success.

NOTE: this would mean that if snapshot save and journal persist never succeed, the probe will be stuck in a perpetual warmup state because recovery would never be defined

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions

_currentLivenessStatus = livenessStatus;
if (livenessStatus.Warmup && (!livenessStatus.SnapshotSaved || !livenessStatus.JournalPersisted))
_firstIndex++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this value goes up each time the warmup probe recovers without knowing whether or not the two persist operations have both completed successfully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warmup is a property we pass back to the probe that the suicide actor is in "first attempt" mode, if any of the snapshot save or journal persist failed, we extend the "first attempt" time to the next cycle until both succeed.

This is because the liveness status does not allow for undefined/null recovery status for the probe to detect if recovery was performed on both snapshot and journal.

Maybe we should change the recovery properties of the persistence liveness status to be nullable instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change the recovery properties of the persistence liveness status to be nullable instead?

Yes, if there's a better / more intuitive way of modeling this we should do that instead.

@Aaronontheweb Aaronontheweb merged commit 121db73 into petabridge:dev Mar 25, 2024
6 checks passed
@Arkatufus Arkatufus deleted the fix-persistence-healthcheck-probe-stalling branch July 15, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants