Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some clarifications about data persistence [HZG-257] #1510

Merged
merged 4 commits into from
Feb 5, 2025

Conversation

th0masb
Copy link
Contributor

@th0masb th0masb commented Jan 30, 2025

Some clarifications that data persistence is there to augment in-memory partition backups and that planned cluster shutdowns must be done cluster-wide. A rolling restart is not the same as a cluster shutdown, it is a sequence of individual member restarts and so is more appropriate to categorize as a speed improvement instead of a resiliency one.

@th0masb th0masb requested a review from a team as a code owner January 30, 2025 14:28
@devOpsHazelcast
Copy link
Collaborator

devOpsHazelcast commented Jan 30, 2025

CLA assistant check
All committers have signed the CLA.

Copy link

netlify bot commented Jan 30, 2025

Deploy Preview for hardcore-allen-f5257d ready!

Name Link
🔨 Latest commit a289dbe
🔍 Latest deploy log https://app.netlify.com/sites/hardcore-allen-f5257d/deploys/67a38cbff0c98f0008839293
😎 Deploy Preview https://deploy-preview-1510--hardcore-allen-f5257d.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor

@JamesHazelcast JamesHazelcast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working to improve our documentation on this topic @th0masb - a few comments from me on this round

docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
- **Speeding up single member restarts**:

** **Planned**: During a rolling restart each cluster member is restarted one by one for scenarios such as installing an operating system patch or new hardware. xref:maintain-cluster:rolling-upgrades.adoc[Rolling upgrades] are an example of a rolling restart.
** **Unplanned**: A member may crash or terminate unexpectedly at any time, using persistence allows faster recovery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also depends on configuration - if rebalancing is delayed, or the cluster is in PASSIVE state, it loads all data from disk without receiving any from other members

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that even if the member crashed when the cluster was in active state and rebalancing was not delayed the persisted data could still be used to speed up the member rejoining the cluster. Aren't merkle trees compared to reduce network traffic as much as possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct, I'm saying that in addition to this "faster recovery" case, there are conditions where it recovers entirely using its own data, not just using it as a speedup mechanism.


- **Speeding up single member restarts**:

** **Planned**: During a rolling restart each cluster member is restarted one by one for scenarios such as installing an operating system patch or new hardware. xref:maintain-cluster:rolling-upgrades.adoc[Rolling upgrades] are an example of a rolling restart.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of a cluster-wide shutdown, we do not want to create confusion between these topics imo

Copy link
Contributor Author

@th0masb th0masb Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it confusing to read the existing way, how is a rolling restart a cluster wide shutdown if only one member at any time is stopped? Is it mandatory to put the cluster into passive mode for a rolling restart? It seems like it would work fine if the cluster was still in active mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there's arguments to be made both ways - I can see why you see it as a series of single member restarts, but I can also see that it does eventually result in the entire cluster restarting, so it could be seen as a cluster-wide shutdown (restart). I'm not too fussed, we can let docs team review this aspect and see if it makes sense to them 👍

Copy link
Contributor

@Rob-Hazelcast Rob-Hazelcast Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a cluster-wide restart implies the cluster becomes unavailable, and a rolling restart is designed to avoid that. I agree there's some overlap but I think the proposed text is good.

@Rob-Hazelcast
Copy link
Contributor

@th0masb Looks like you're still debating the technical details here, so ignoring for now. Feel free to @ me when this is ready for editorial review, and please add the backport to all versions label if appropriate.

Copy link
Contributor

@JamesHazelcast JamesHazelcast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments from me, but mainly nits so approving in advance - thanks for raising this @th0masb and improving our docs 👍

docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
@th0masb
Copy link
Contributor Author

th0masb commented Feb 3, 2025

@Rob-Hazelcast I think this is ready now, as for backporting I am unsure of the policy for the docs but since this is a clarification on existing functionality and the page content looks to be the same since 5.3 I would recommend backporting to 5.3.

@Rob-Hazelcast Rob-Hazelcast added the backport to all versions Backport commits to maintenance branches (from main) label Feb 4, 2025
Copy link
Contributor

@Rob-Hazelcast Rob-Hazelcast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested some small edits for clarify but generally looks good, thanks!

docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved

- **Speeding up single member restarts**:

** **Planned**: During a rolling restart each cluster member is restarted one by one for scenarios such as installing an operating system patch or new hardware. xref:maintain-cluster:rolling-upgrades.adoc[Rolling upgrades] are an example of a rolling restart.
Copy link
Contributor

@Rob-Hazelcast Rob-Hazelcast Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a cluster-wide restart implies the cluster becomes unavailable, and a rolling restart is designed to avoid that. I agree there's some overlap but I think the proposed text is good.

docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
** **Unplanned**: A cluster is restarted after all its members crash at the same time due an event such as a power outage. Note that some data loss is expected unless `fsync` is set to `true`. For more information, see xref:storage:configuring-persistence.adoc#data-structures[Data structure options].
- **A single member restart**: A single member is restarted for whatever reason.

- **Accelerating single member restarts**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Accelerating single member restarts**:
- **Single member restarts**:

The two bullets should be consistent. Removing the verb here; could alternatively add a verb ("recovering from cluster-wide shutdowns") above.

docs/modules/storage/pages/persistence.adoc Outdated Show resolved Hide resolved
@th0masb th0masb merged commit fdcbc16 into hazelcast:main Feb 5, 2025
6 checks passed
@th0masb th0masb deleted the hzg-257 branch February 5, 2025 16:17
Rob-Hazelcast pushed a commit that referenced this pull request Feb 6, 2025
Backport of #1510

Some clarifications that data persistence is there to augment in-memory
partition backups and that planned cluster shutdowns must be done
cluster-wide. A rolling restart is not the same as a cluster shutdown,
it is a sequence of individual member restarts and so is more
appropriate to categorize as a speed improvement instead of a resiliency
one.

Co-authored-by: Thomas Ball <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport to all versions Backport commits to maintenance branches (from main)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants