Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix when monitor of systemd resource continues to be pending. #3720

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

HideoYamauchi
Copy link
Contributor

@HideoYamauchi HideoYamauchi commented Nov 6, 2024

Hi All,

If a systemd resource fails and continues to return pending, detection of the failure will be delayed until the recheck-timer, and the service will be stopped until failover occurs.

This issue was confirmed by our user when using Zabbix 6.0 resources with systemd.

  • The problem occurred when a user killed a process to simulate a failure.
  • This did not occur in previous Zabbix versions.

To control this, we have added the use-monitor-pending-timeout parameter and made a correction so that if the pending state continues until the monitor timeout, it will be treated as an error.


  • Use the parameter use-monitor-pending-timeout, but it may be better to use a shorter parameter.
  • It may be better to use it commonly with sysetmd resources, but to maintain compatibility, we have introduced the use-monitor-pending-timeout parameter.
  • Considered controlling it on the controld side, but decided to control it on the execd side.
  • Parameter passing to systemd resources was enabled.
  • Perhaps it would be sufficient to only pass it in the use-monitor-pending-timeout case.
  • Only run the monitor until the timeout within the range of the monitor interval, and managed it only by continuing the deactivation.
  • Did not adopt the same fix as the follow up monitor. This is because there are monitors in progress, so management would be complicated. Also, we thought that if the monitor timeout is long, running the follow up monitor could cause a high load.

Please let me know your thoughts on the proposed amendments.

Best Regards,
Hideo Yamauchi.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Hideo,

Thanks for the fix!

@@ -901,6 +901,28 @@ action_complete(svc_action_t * action)
}
}
}
} else if (pcmk__strcase_any_of(cmd->action, PCMK_ACTION_MONITOR, PCMK_ACTION_STATUS, NULL) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A case-sensitive comparison would be appropriate (pcmk__str_any_of())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

/* For monitors other than follow up monitors, if "use-monitor-pending-timeout" */
/* is enabled and the pending state continues from the time of the first notification */
/* until the timeout, it will be treated as a timeout. */
if (pcmk__str_eq(g_hash_table_lookup(cmd->params, PCMK_XA_USE_MONITOR_PENDING_TIMEOUT), "true", pcmk__str_casei)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this fix makes sense to apply universally -- no new parameter is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right!
Adding parameters will be discontinued.

@@ -901,6 +901,28 @@ action_complete(svc_action_t * action)
}
}
}
} else if (pcmk__strcase_any_of(cmd->action, PCMK_ACTION_MONITOR, PCMK_ACTION_STATUS, NULL) &&
(cmd->interval_ms > 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think probes would have the same issue. Probes will expect "not running" but could find it pending instead. There could be an argument that a probe should return "running" for pending but I think following the same timeout logic makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into Probe much.
I'll look into it a bit more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgaillot

I checked it.

I think what you mean when you say that Probe can handle the same thing here is something like this, what do you think?


If the status becomes PENDING during Probe, it will not respond immediately, but will continue to monitor at regular intervals until a timeout occurs, and will respond to controld when it gets a status other than PENDING.
If the timeout has passed, it will return PENDING.

In terms of processing, I think it is probably closer to the already existing start/stop follow up monitor than this time.


If I'm not mistaken, I will include this Probe change when I resubmit this PR.

Best Regards,
Hideo Yamauchi.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HideoYamauchi ,

Yes, I think that's correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgaillot

All right.
Prepare the pending monitor fixes and the pending probe fixes and submit the PR again.

Many thanks,
Hideo Yamauchi.

crm_warn("If monitor-pending-timeout is set to true and the interval is longer than the monitor timeout, "
"two consecutive monitor deactivations will be treated as an error. "
"We recommend setting a shorter interval to ensure that "
"multiple consecutive deactivations before the timeout are treated as an error.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what the issue is here. The interval is always counted from the last completion, not the last initiation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, this is the case when extreme settings such as an interval of 3600s and a timeout of 60s are used.
If the interval is performed at a short interval such as 20s, it can be determined that there is a problem if deactivating is returned several times in a row.

If the setting is 3600s and deactivating is returned the Nth time, the N+1st time will be the second deactivating, so I thought that the number of times might not be enough.

However, on reflection, this warning does not seem to be very meaningless, so I will ultimately delete this warning.

@HideoYamauchi
Copy link
Contributor Author

@kgaillot

Thank you for your comment.
I will check it and reply.

Many thanks,
Hideo Yamauchi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants