Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix eager updateFromV1 #150

Merged
merged 5 commits into from
Jun 20, 2024
Merged

Conversation

masnax
Copy link
Contributor

@masnax masnax commented Jun 14, 2024

updateFromV1 was introduced by #94 because our schema update handling logic was fundamentally broken. Internal and external schema updates incremented the same row in the schemas table, which means that when we receive a new update, we cannot tell which list (internal or external) got incremented. It's ambiguous which update we need to run.

In order to keep the implementation cleaner, updateFromV1 was made to run before cluster members agreed. It actually ran as soon as the first cluster member starts up and sees that the update exists. I overlooked that this would break querying the cluster members table, because updateFromV1 changes the columns of that table. This means existing cluster members who have not yet received the upgrade will break when querying the cluster members table, as a column they expect has been changed.

This PR attempts to solve that problem by making updateFromV1 behave like all other schema updates, where all cluster members have to agree first. However, this means we need to maintain two tables to check schema updates prior to running updateFromV1.

Here is how the process works:

  • When a cluster member starts, it checks if updateFromV1 has ran by checking if a type column exists on the schemas table. This type column makes a distinction between internal/external updates, and is introduced by updateFromV1.
  • If not, then that cluster member attempts to create internal_cluster_members_new if another cluster member has not already done so. It populates the internal_cluster_members_new table with all the same rows from the internal_cluster_members table, but fixes the schema columns. It also creates some insert/delete triggers so that the two tables remain in sync.
  • Each cluster member updates its row in the internal_cluster_members_new table with the schema version it expects based on the size of the update slices.
  • Once all cluster members have settled on a schema update (all schema columns are equal for every row), we proceed to run schema updates.
  • updateFromV1 replaces the old internal_cluster_members table with internal_cluster_members_new, and drops the triggers as well.

What this means is that cluster members who have not received the update yet can continue to function using the old table. After the update has occurred, the old table is seamlessly replaced with the new table, which will be used for all schema updates going forward.

@masnax masnax marked this pull request as draft June 14, 2024 22:22
@masnax masnax force-pushed the fix-eager-schema-update branch from 99dc639 to c993195 Compare June 14, 2024 22:50
@masnax masnax marked this pull request as ready for review June 14, 2024 22:55
@masnax
Copy link
Contributor Author

masnax commented Jun 14, 2024

This should fix the error reported in canonical/microceph#367 as not-upgraded cluster members will continue to function with the old table.

@tomponline
Copy link
Member

However, since this update runs before we have ensured every cluster member is ready for the update, this means that micro* cluster list will be broken on all running nodes when 1 node is refreshed with the update. This is a particular issue for MicroOVN whose focus for this cycle has been hassle-free updates.

Why is this?

@masnax
Copy link
Contributor Author

masnax commented Jun 16, 2024

However, since this update runs before we have ensured every cluster member is ready for the update, this means that micro* cluster list will be broken on all running nodes when 1 node is refreshed with the update. This is a particular issue for MicroOVN whose focus for this cycle has been hassle-free updates.

Why is this?

The schema upgrade process works like in LXD where each member updates its row in the cluster members table with the size of the upgrade list, then compares this value with other rows. If all rows don't match for that column, the node waits for a notification over the internal API. The first cluster member to see all rows match for that column applies the updates.

That last node compares the upgrade count from the cluster members table to the latest entry in the schemas table to determine which updates to run. Since the value in the old tables is an aggregate of internal/external updates, we can't accurately determine which updates we have already run. This is the example I gave with 1+3=4 vs 2+2=4.

So we need to split the internal/external update count first outside of the main update mechanism. Right now we do this eagerly, so right when the first node starts. This changes the cluster members table which means old nodes querying that table will break.

If we want to run that update after all nodes agree, we need two separate implementations for how to establish that agreement, because after the agreement is established and we run the update, the next time the nodes restart they won't be able to use the old mechanism anymore, since the schema will have changed.

@masnax
Copy link
Contributor Author

masnax commented Jun 17, 2024

@tomponline I've modified this PR with a slightly cleaner approach to the same problem. I've updated the PR description as well. I have #154 up as a draft with the old approach that uses a temporary table to compare.

One caveat with this approach (which wasn't present in the temporary table version) is that there is a slight chance of incorrectly determining two cluster members expect the same schema version.

@masnax
Copy link
Contributor Author

masnax commented Jun 17, 2024

@roosterfish If you've got any thoughts on this one versus #154, or if you've got a better idea for how to handle this problem, please let me know, thanks :)

@masnax masnax force-pushed the fix-eager-schema-update branch from 221ea8d to 8b5634c Compare June 18, 2024 16:33
@masnax
Copy link
Contributor Author

masnax commented Jun 18, 2024

@tomponline @roosterfish I ended up sticking with a new table with insert/delete triggers over adding a view for two reasons:

  • If a cluster member is refreshed accidentally to the new version, there is no way to undo the view creation which means the cluster will be forced to update or become read-only. With a temporary table, we can still revert the refresh until all cluster members agree. The only side effect would be a leftover table that is not used by the old implementation anyway.

  • A view is not writeable, which means if a cluster member is in an unhealthy state during an update, the cluster will never agree on the update. We would need to add support for something like LXD's patch.global.sql to work around this. Using a temporary table means other cluster members can still remove broken nodes, and with the triggers, these changes will be reflected in the temporary table.

Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why having a view won't work as downstreams might want to delete rows from the internal_cluster_members table as part of removing some nodes from the cluster.

internal/db/update/schema.go Outdated Show resolved Hide resolved
internal/db/update/schema.go Outdated Show resolved Hide resolved
cluster/cluster_members.go Show resolved Hide resolved
cluster/cluster_members.go Show resolved Hide resolved
cluster/cluster_members.go Show resolved Hide resolved
internal/db/update/update.go Show resolved Hide resolved
internal/db/update/update.go Show resolved Hide resolved
cluster/cluster_members.go Show resolved Hide resolved
masnax added 5 commits June 19, 2024 15:29
Forcibly applying the first schema update causes issues on existing
nodes, particularly when restarting or during heartbeats.

Instead, we can just infer what we expect the maximum version to look
like, and wait until the last version is updated to actually apply
anything.

Signed-off-by: Max Asnaashari <[email protected]>
So that we don't break existing online cluster members, we can't just
update the schema from underneath them.

Instead, we can store schema updates in a temporary table
`internal_cluster_members_new`, and reference this table during the
schema update process, instead of the real `internal_cluster_members`
table.

Then, when we actually run the `updateFromV1` on the final cluster
member to receive the update, replace the real
`internal_cluster_members` table with the one that has been keeping
track of the updates.

Signed-off-by: Max Asnaashari <[email protected]>
@masnax masnax force-pushed the fix-eager-schema-update branch from 8b5634c to a7859e3 Compare June 19, 2024 15:29
@masnax masnax requested a review from roosterfish June 19, 2024 21:42
Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@masnax masnax merged commit 526a3b6 into canonical:main Jun 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants