-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: libpacemaker: crm_mon --one-shot can fail during pacemaker shutdown #2902
Conversation
a3d8402
to
ae51b37
Compare
ae51b37
to
82885ee
Compare
Replaced the simple "just enough to unbreak it" fix with a redesign that uses the |
lib/pacemaker/pcmk_status.c
Outdated
} | ||
break; | ||
case EREMOTEIO: | ||
// @TODO: Why do we try to connect to the fencer and CIB here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't figured out why EREMOTEIO
is considered okay and we try to connect to the fencer and the CIB after getting that RC. I went ahead and preserved the existing behavior there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pacemakerd IPC is not proxied to remote nodes, so any attempt to contact it will always fail with EREMOTEIO when run from a remote node. However the fencer and CIB are proxied and may be available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case is the "waiting to be connected by/contacted by the cluster" message inaccurate? Sounds like even if the cluster has already reached us, we'll still fail with EREMOTEIO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of, I think the idea is that if the CIB or fencer can be reached, it will immediately be cleared off the screen, but otherwise it'll be visible
Quick note since CI is hung due to infra issues, the rpm target passed on my machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew, this got big ...
lib/pacemaker/pcmk_status.c
Outdated
} | ||
break; | ||
case EREMOTEIO: | ||
// @TODO: Why do we try to connect to the fencer and CIB here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pacemakerd IPC is not proxied to remote nodes, so any attempt to contact it will always fail with EREMOTEIO when run from a remote node. However the fencer and CIB are proxied and may be available.
if (rc != EREMOTEIO) { | ||
out->err(out, "error: Could not connect to %s: %s", | ||
pcmk_ipc_name(api, true), pcmk_rc_str(rc)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that all other callers would have to check for EREMOTEIO and output the error. Maybe a bool argument for whether EREMOTEIO is ok (probably also need to return pcmk_rc_ok if so)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't like the idea of duplicating work in the callers. I'd like to eventually find a way to use pcmk__pacemakerd_status()
for console and daemonized modes of crm_mon
...
One of my thoughts was that we might need the caller to know that we got EREMOTEIO
(e.g., to set on_remote_node
) even though EREMOTEIO
is considered OK. Setting on_remote_node
is the only place that's actually needed right now, hypotheticals aside.
On closer inspection, on_remote_node
is only used for choosing an error message if the initial do/while loop fails, and that logic probably isn't ideal anyway. I'm not convinced we should give up and exit in the scenario that pacemakerd_status()
passes and cib_connect()
fails... as it looks like we're doing now for Pacemaker Remote and cluster nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current thought is to return EREMOTEIO
from pcmk_cluster_queries.c:ipc_connect()
if eremoteio_ok (bool)
is true, but without logging an error. That way if the caller needs to do anything special, it can.
In particular, I'd like pcmk__status()
to output a pacemakerd-health
message with state "Running on Pacemaker Remote node". To do that, it needs to know that ipc_connect()
returned EREMOTEIO
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just skip the pacemakerd-health
message altogether in that case. We'd just have less detailed info if we fail to connect to the CIB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...yeah, probably better to just skip pacemakerd-health
for remote nodes.
- No clean way to do a short vs. friendly state string for "Running on Pacemaker Remote node"
- From your comment on another thread, it sounds like "Running on" would be misleading since pacemakerd isn't proxied to remote nodes... and maybe we'd EREMOTEIO even if pacemaker_remoted isn't running, I dunno
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this is a good sign you should test the affected commands on remote nodes (not running, running but not connected, connected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a later PR (not touching it here), I might rethink this... setting rc = pcmk_rc_ok
within ipc_connect()
is fine if and only if we also set data->pcmkd_state = pcmk_pacemakerd_state_running
when we get EREMOTEIO
. That no longer seems unreasonable. It seems that EREMOTEIO
is returned if and only if pacemaker-remoted
is running (regardless of whether it's connected). That's roughly equivalent to pacemakerd
running.
The current and planned callers treat EREMOTEIO
the same as pcmk_rc_ok && pcmk_pacemakerd_state_running
, so we could cut out the middle man by doing it here if there's no objection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable
82885ee
to
2af7741
Compare
Updated to address review. I think the below bullet points cover all the changes...
|
8c12250
to
98fcf00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just about there
Signed-off-by: Reid Wahl <[email protected]>
Given an enum pcmk_pacemakerd_state value, this function returns a user-friendly string representation. This will be used in future commits. Signed-off-by: Reid Wahl <[email protected]>
If the XML_ATTR_TSTAMP attribute is present but can't be parsed as an integer, value_ll gets set to PCMK__PARSE_INT_DEFAULT (-1). This should never happen, but just in case, we should convert a negative to 0 before we cast to time_t, an unsigned type. Signed-off-by: Reid Wahl <[email protected]>
Default should be NULL so that the last_updated default gets used correctly in the pacemakerd-health message. Signed-off-by: Reid Wahl <[email protected]>
Use pcmk_rc_ipc_unresponsive if we don't get a response from the API, and EBADMSG if we get an bad reply or unexpected reply type. Signed-off-by: Reid Wahl <[email protected]>
Previously, the pacemakerd-health message accepted only a state string. This made it difficult to use different state strings for different output formats. Now, the pacemakerd-health message accepts an enum pcmk_pacemakerd_state value and an optional state string. If the state string is not set, then the formatter function looks up an appropriate string representation for the state. If the state string is set, it acts as an explicit override and is used in place of a lookup. Note that this will cause "invalid" to be printed instead of "<null>" for quiet text outputs, and it will cause "Invalid pacemakerd state" to be printed instead of "unknown state" for the default output. Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
build_uname_list -> pe__build_node_name_list() Signed-off-by: Reid Wahl <[email protected]>
Make only_node and only_rsc arguments const. Add doxygen blocks. Change "st" argument to "stonith". This is not comprehensive. We're updating pcmk__status() because we're about to add a new argument, and pcmk__output_cluster_status() because it shares most of its arguments with pcmk__status(). Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
We were using F_CRM_SYS_FROM as the name of the XML element, instead of something static like "pacemakerd". It happens that the value in F_CRM_SYS_FROM seems to always be CRM_SYSTEM_MCP ("pacemakerd"), so the element name was effectively deterministic. Nonetheless, the schema required the element be called "pacemakerd"; there was no allowance for another system name. That defeats any purpose of flexible element naming. It seems better to call the element "pacemakerd" and make sys_from a field, if we keep sys_from at all. (Can't use "pacemakerd-health" for backward compatibility reasons.) Additionally, if sys_from or last_updated is NULL, pass them directly to pcmk__output_create_xml_node(). Those attributes will simply be skipped if their values are NULL. Signed-off-by: Reid Wahl <[email protected]>
If message_timeout_ms == 0 for various functions in pcmk_cluster_queries.c, default to using sync dispatch instead of starting a mainloop with timeout 30s that behaves basically like sync dispatch. This makes it easier to reason about calling these functions when the caller may have its own mainloop. Signed-off-by: Reid Wahl <[email protected]>
crm_mon --one-shot checks the pacemakerd state before trying to get a CIB connection. If pacemakerd is shutting down, it returns ENOTCONN. This can cause a resource agent that calls crm_mon (for example, ocf:heartbeat:pgsql) to fail to stop during shutdown. This is a regression introduced by commit 3f342e3. crm_mon.c:pacemakerd_status() returns pcmk_rc_ok if pacemakerd is shutting down, since 49ebe4c and 46d6edd (fixes for CLBZ#5471). 3f342e3 refactored crm_mon --one-shot to use library functions. pcmk__status() now does most of the work, calling pcmk_status.c:pacemakerd_status(). That function returns ENOTCONN if pacemakerd is shutting down. As a result, we don't try to connect to the CIB during shutdown. Here we update pcmk__status() to use pcmk__pacemakerd_status() instead of a static and mostly redundant pacemakerd_status(). It receives the pacemakerd state via an output pointer argument. If pacemakerd is running or shutting down (or if we get an EREMOTEIO rc), we try connecting to the fencer and CIB. However, as long as we successfully get the pacemakerd state, we return success from pcmk__status(), since we did obtain the cluster status. A couple of minor notes: * pcmk__status() now takes a timeout argument that it passes to pcmk__pacemakerd_status(). timeout == 0 uses pcmk_ipc_dispatch_sync, matching the old implementation. A positive timeout uses pcmk_ipc_dispatch_main. * pcmk_cluster_queries.c:ipc_connect() no longer always prints a "Could not connect" error for EREMOTEIO. The caller may consider it OK. Fixes T579 Fixes CLBZ#5501 Signed-off-by: Reid Wahl <[email protected]>
sys_from should be a subsystem ("pacemakerd" is expected), not a node. Signed-off-by: Reid Wahl <[email protected]>
98fcf00
to
a165cdf
Compare
Rebased on current main and updated to address review |
* If positive, \p pcmk_ipc_dispatch_main | ||
* will be used, and a new mainloop will be | ||
* created for this purpose (freed before | ||
* return). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to replace the mainloop stuff in pcmk_cluster_queries.c
with pcmk_ipc_dispatch_poll
when we need a timeout. That way we never have to worry about mainloop conflicts.
If so, it'll go in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, as long as anything crm_mon
needs still works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. crm_mon
already uses polling (timeout == reconnect_ms / 2) instead of sync or mainloop, for its console and daemonized modes. one-shot can continue using sync.
pcmk_ipc_api_t *controld_api = NULL; | ||
|
||
if (message_timeout_ms == 0) { | ||
dispatch_type = pcmk_ipc_dispatch_sync; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this also changed the behavior of "crmadmin -D" under certain circumstance.
Now upon cluster startup, if "crmadmin -D" is executed before a DC is elected yet, it will be hanging forever .
AFAICS the lookup involves a ping request to the DC. But given that there's not a DC to actually handle the request at the moment, crmadmin most probably won't receive any reply under the situation...
Previously with an implied timeout 30s, the command would eventually return with an error though rather than hanging forever.
Or course first it's a question what should be the best as the default for such a case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it seems suboptimal that an IPC request sent to CRM_SYSTEM_DC will be broadcast to the CPG and no one will reply if there is no DC.
It would be nice if the local controller could return an error immediately in relay_message() if the message is for the DC but controld_globals.dc_name is NULL (possibly only for CRM_OP_PING to reduce the scope for unexpected consequences). Unfortunately the local controller might just not have learned the DC yet (in do_cl_join_announce() and do_cl_join_query() we set DC to NULL while waiting to hear from the actual DC). Maybe it's the lesser evil to get an error in that situation (especially if we can return something like EAGAIN/CRM_EX_UNSATISFIED).
@nrwahl2 , do you want to look into that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this PR landed in 2.1.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that synchronous requests are supposed to time out after 5 seconds...
@nrwahl2 , do you want to look into that?
"want" is such a loaded word ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sleep()
would still be synchronous (although it would be less work, at the expense of a delay in finding a message)I think
pcmk_poll_ipc()
would be just as busy -- it's just shifting the work to thepoll()
system call. It blocks onpoll()
until either we receive something or the timeout we pass expires.
Blocking on one poll() call is better than constantly iterating it. Since crm_ipc_read() calls qb_ipc_us_ready() -> poll() with a timeout 0, continuously iterating it without an interval due to EAGAIN significantly raises CPU load.
I mentioned in T735 that at a glance, it looks like
pcmk_ipc_dispatch_sync
might behave the same aspcmk_ipc_dispatch_poll
with no timeout.
Indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the original thinking was that EAGAIN would be a brief condition (often one or two tries). It's definitely a problem that timeouts would have no effect if it's stuck in that loop. It's especially bad if a daemon is waiting on another daemon.
I'm thinking we could poll with a 5s timeout.
An async wait via mainloop might be better, especially for daemons, but it's impractical (every send would need a callback for when it completes), and we'd still need the timeout for non-mainloop sends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything about this yet by any chance? Otherwise should I try doing something about this? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gao-yan I don't think anything's been started yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for the update. I'll take a look then.
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
The interraction between crmadmin and pacemaker-controld is done in xml format. When you do 'crmadmin -D' the sequence is as follows: 1) crmadmin::pcmk__send_ipc_request::crm_ipc_send --> controld::route_message::relay_message 2) controld::route_message::relay_message::send_msg_via_ipc --> crmadmin::pcmk__send_ipc_request Besides, there are two other commits in here 1) The mylog + mylog_xml functionality 2) Fix: controld: leave xml-src attribute empty when no DC selected ClusterLabs#2902 ( Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735 )
…hanging Fix: controld: leave xml-src attribute empty when no DC selected #2902
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…terLabs#2902 Right after starting the cluster the DC is offline for short time, and if you call 'crmadmin --dc_lookup' during this time, it would hang. This change leaves the xml-src attribute that the crmadmin receives from the pacemaker-controld empty, so that crmadmin knows the DC was not selected. ref: https://projects.clusterlabs.org/T735
…sterLabs#3606 If the DC is not yet selected, the crmadmin will return an error. (This change accomplishes ClusterLabs#3606).
…sterLabs#3606 If the DC is not yet elected, the crmadmin will return an error. (This change accomplishes ClusterLabs#3606).
…sterLabs#3606 If the DC is not yet elected, the crmadmin will return an error. (This change complements ClusterLabs#3606).
…sterLabs#3606 If the DC is not yet elected, the crmadmin will return an error. (This change complements ClusterLabs#3606).
…sterLabs#3606 If the DC is not yet elected, the crmadmin will return an error. (This change complements ClusterLabs#3606).
crm_mon --one-shot
checks the pacemakerd state before trying to get a CIB connection. Ifpacemakerd
is shutting down, it returnsENOTCONN
. This can cause a resource agent that callscrm_mon
(for example,ocf:heartbeat:pgsql
) to fail to stop during shutdown.This may be a regression introduced by commit 3f342e3.
crm_mon.c:pacemakerd_status()
returnspcmk_rc_ok
ifpacemakerd
is shutting down, since 49ebe4c and 46d6edd (fixes for CLBZ#5471). 3f342e3 refactoredcrm_mon --one-shot
to use library functions, andpcmk_status.c:pacemakerd_status()
returnsENOTCONN
ifpacemakerd
is shutting down. As a result, we don't try to connect to the CIB.Fixes CLBZ#5501