[WIP] Fix when monitor of systemd resource continues to be pending. #3720

HideoYamauchi · 2024-11-06T04:29:16Z

Hi All,

If a systemd resource fails and continues to return pending, detection of the failure will be delayed until the recheck-timer, and the service will be stopped until failover occurs.

This issue was confirmed by our user when using Zabbix 6.0 resources with systemd.

The problem occurred when a user killed a process to simulate a failure.
This did not occur in previous Zabbix versions.

To control this, we have added the use-monitor-pending-timeout parameter and made a correction so that if the pending state continues until the monitor timeout, it will be treated as an error.

Use the parameter use-monitor-pending-timeout, but it may be better to use a shorter parameter.
It may be better to use it commonly with sysetmd resources, but to maintain compatibility, we have introduced the use-monitor-pending-timeout parameter.
Considered controlling it on the controld side, but decided to control it on the execd side.
Parameter passing to systemd resources was enabled.
Perhaps it would be sufficient to only pass it in the use-monitor-pending-timeout case.
Only run the monitor until the timeout within the range of the monitor interval, and managed it only by continuing the deactivation.
Did not adopt the same fix as the follow up monitor. This is because there are monitors in progress, so management would be complicated. Also, we thought that if the monitor timeout is long, running the follow up monitor could cause a high load.

Please let me know your thoughts on the proposed amendments.

Best Regards,
Hideo Yamauchi.

kgaillot

Hi Hideo,

Thanks for the fix!

kgaillot · 2024-11-06T21:37:05Z

daemons/execd/execd_commands.c

@@ -901,6 +901,28 @@ action_complete(svc_action_t * action)
                    }
                }
            }
+        } else if (pcmk__strcase_any_of(cmd->action, PCMK_ACTION_MONITOR, PCMK_ACTION_STATUS, NULL) && 


A case-sensitive comparison would be appropriate (pcmk__str_any_of())

kgaillot · 2024-11-06T21:37:34Z

daemons/execd/execd_commands.c

+            /* For monitors other than follow up monitors, if "use-monitor-pending-timeout"       */
+            /* is enabled and the pending state continues from the time of the first notification */
+            /* until the timeout, it will be treated as a timeout.                                */
+            if (pcmk__str_eq(g_hash_table_lookup(cmd->params, PCMK_XA_USE_MONITOR_PENDING_TIMEOUT), "true", pcmk__str_casei)) {


I agree that this fix makes sense to apply universally -- no new parameter is needed

All right!
Adding parameters will be discontinued.

kgaillot · 2024-11-12T16:03:22Z

daemons/execd/execd_commands.c

@@ -901,6 +901,28 @@ action_complete(svc_action_t * action)
                    }
                }
            }
+        } else if (pcmk__strcase_any_of(cmd->action, PCMK_ACTION_MONITOR, PCMK_ACTION_STATUS, NULL) && 
+		  (cmd->interval_ms > 0)) { 


I think probes would have the same issue. Probes will expect "not running" but could find it pending instead. There could be an argument that a probe should return "running" for pending but I think following the same timeout logic makes sense.

I haven't looked into Probe much.
I'll look into it a bit more.

@kgaillot

I checked it.

I think what you mean when you say that Probe can handle the same thing here is something like this, what do you think?

If the status becomes PENDING during Probe, it will not respond immediately, but will continue to monitor at regular intervals until a timeout occurs, and will respond to controld when it gets a status other than PENDING.
If the timeout has passed, it will return PENDING.

In terms of processing, I think it is probably closer to the already existing start/stop follow up monitor than this time.

If I'm not mistaken, I will include this Probe change when I resubmit this PR.

Best Regards,
Hideo Yamauchi.

Hi @HideoYamauchi ,

Yes, I think that's correct

Hi @kgaillot

All right.
Prepare the pending monitor fixes and the pending probe fixes and submit the PR again.

Many thanks,
Hideo Yamauchi.

kgaillot · 2024-11-12T16:34:45Z

lib/services/services.c

+	     		    crm_warn("If monitor-pending-timeout is set to true and the interval is longer than the monitor timeout, "
+		                     "two consecutive monitor deactivations will be treated as an error. "
+	                             "We recommend setting a shorter interval to ensure that "
+	                             "multiple consecutive deactivations before the timeout are treated as an error.");


I don't follow what the issue is here. The interval is always counted from the last completion, not the last initiation.

For example, this is the case when extreme settings such as an interval of 3600s and a timeout of 60s are used.
If the interval is performed at a short interval such as 20s, it can be determined that there is a problem if deactivating is returned several times in a row.

If the setting is 3600s and deactivating is returned the Nth time, the N+1st time will be the second deactivating, so I thought that the number of times might not be enough.

However, on reflection, this warning does not seem to be very meaningless, so I will ultimately delete this warning.

HideoYamauchi · 2024-11-12T23:52:22Z

@kgaillot

Thank you for your comment.
I will check it and reply.

Many thanks,
Hideo Yamauchi.

Mid: systemd: Add use-monitor-pending-timeout parameter.

eefbbb7

kgaillot reviewed Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix when monitor of systemd resource continues to be pending. #3720

[WIP] Fix when monitor of systemd resource continues to be pending. #3720

HideoYamauchi commented Nov 6, 2024 •

edited

Loading

kgaillot left a comment

kgaillot Nov 6, 2024

HideoYamauchi Nov 13, 2024

kgaillot Nov 6, 2024

HideoYamauchi Nov 13, 2024

kgaillot Nov 12, 2024

HideoYamauchi Nov 13, 2024

HideoYamauchi Nov 22, 2024

kgaillot Dec 2, 2024

HideoYamauchi Dec 2, 2024

kgaillot Nov 12, 2024

HideoYamauchi Nov 13, 2024

HideoYamauchi commented Nov 12, 2024

[WIP] Fix when monitor of systemd resource continues to be pending. #3720

Are you sure you want to change the base?

[WIP] Fix when monitor of systemd resource continues to be pending. #3720

Conversation

HideoYamauchi commented Nov 6, 2024 • edited Loading

kgaillot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HideoYamauchi commented Nov 12, 2024

HideoYamauchi commented Nov 6, 2024 •

edited

Loading