-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: libpe_status: Don't fence a remote node due to failed migrate_from #3458
Conversation
Marking ready for review. This might be sufficient to fix the
If this isn't a viable solution (or close to it) as-is, @clumens or anyone else can feel free to take it and run with it themselves. I took a crack at it since I've been talking to Chris about it a lot last week. Okay, there is at least one big wrinkle in this... if a resource is running on the remote node when the connection resource
This is because the remote node is considered offline after the migration failure.
I don't think we want to skip setting the A failed |
cts/cts-scheduler.in
Outdated
@@ -1120,6 +1120,8 @@ TESTS = [ | |||
"Make sure partial migrations are handled before ops on the remote node" ], | |||
[ "remote-partial-migrate2", | |||
"Make sure partial migration target is prefered for remote connection" ], | |||
[ "remote-partial-migrate3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
realizing this doesn't qualify as a partial migration, which is really more of an "in-progress migration"
lib/pengine/pe_actions.c
Outdated
*/ | ||
if (rsc->is_remote_node | ||
&& pcmk__is_remote_node(pcmk_find_node(rsc->cluster, rsc->id)) | ||
&& !pcmk_is_probe(action_name, interval_ms) | ||
&& !pcmk__str_eq(action_name, PCMK_ACTION_START, pcmk__str_none)) { | ||
&& !pcmk__str_any_of(action_name, PCMK_ACTION_START, | ||
PCMK_ACTION_MIGRATE_FROM, NULL)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe even more permissive like MIGRATE_TO
, undecided right now
Edit: Yeah, thinking we should treat migrate_to
and migrate_from
the same for failure handling purposes for remote connection resources, and just call it a failed migration. It would be hard for migrate_to
to fail at all unless something wider is seriously wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, failed migrate_to or migrate_from shouldn't cause an automatic fencing, though the connection should be restarted.
The test results are currently wrong. Signed-off-by: Reid Wahl <[email protected]>
This also prevents the resource from remaining stopped until the cluster-recheck-interval expires. That's because we no longer set pcmk_on_fail_reset_remote after a migrate_from failure, so pcmk__role_after_failure() no longer returns pcmk_role_stopped. Ref T214 Signed-off-by: Reid Wahl <[email protected]>
Ref T214 Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
Deprecated, for internal use only. Signed-off-by: Reid Wahl <[email protected]>
…grate Previously, if the remote node's connection resource failed to migrate, the remote node was considered offline. If a resource was running on the remote node, the remote node would be fenced. This doesn't work yet. Fixes T214 Signed-off-by: Reid Wahl <[email protected]>
This still has problems. The newly added tests work as expected, but existing tests break. Signed-off-by: Reid Wahl <[email protected]>
Still not working. Latest push rebases on main, adds another test, and adds some experimental commits on top. |
I'm thinking the high-level behavior should be:
That would still leave some decisions about exactly how and where in the call chain to implement the online-state patch. The possibility of multiple failed actions in the history, occurring in various orders, gives me a headache. |
Successful actions (including fencing) do not and should not trigger new transitions; every transition should be complete within itself, and a transition should need to be recalculated only if it's interrupted by a new event or some action fails. What's missing in this case is a call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first two Fix commits (which could be squashed) look like a good idea
lib/pengine/pe_actions.c
Outdated
*/ | ||
if (rsc->is_remote_node | ||
&& pcmk__is_remote_node(pcmk_find_node(rsc->cluster, rsc->id)) | ||
&& !pcmk_is_probe(action_name, interval_ms) | ||
&& !pcmk__str_eq(action_name, PCMK_ACTION_START, pcmk__str_none)) { | ||
&& !pcmk__str_any_of(action_name, PCMK_ACTION_START, | ||
PCMK_ACTION_MIGRATE_FROM, NULL)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, failed migrate_to or migrate_from shouldn't cause an automatic fencing, though the connection should be restarted.
This will be good to revisit in the future, but closing for now due to the uncertainties and significant changes in the code base since. Worth summarizing the discussion on the task. |
@clumens @kgaillot This is just a demo for the unnecessary fencing part of T214 / RHEL-23399. I'm not requesting to merge this until we figure out the rest of the failure response/cluster-recheck-interval behavior.
The result of this patch is as follows. Scenario:
ocf:pacemaker:remote
resource withreconnect_interval=30s
,cluster-recheck-interval=2min
, and fencing configured.Before patch:
reconnect_interval
has passed.)After patch:
migrate_from
, sincestart-failure-is-fatal
doesn't apply tomigrate_from
). This entails stopping on both nodes (due to multiple-active policy) and then trying to start on node 2. This will fail due to firewall block.reconnect_interval
expires, the resource tries to migrate back to node 2 again. Which will fail due to firewall block. This will continue happening everyreconnect_interval
untilmigration-threshold
is reached.Fixes T214