Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: libpacemaker: crm_mon --one-shot can fail during pacemaker shutdown #2902

Merged
merged 17 commits into from
Oct 12, 2022

Conversation

nrwahl2
Copy link
Contributor

@nrwahl2 nrwahl2 commented Oct 6, 2022

crm_mon --one-shot checks the pacemakerd state before trying to get a CIB connection. If pacemakerd is shutting down, it returns ENOTCONN. This can cause a resource agent that calls crm_mon (for example, ocf:heartbeat:pgsql) to fail to stop during shutdown.

This may be a regression introduced by commit 3f342e3. crm_mon.c:pacemakerd_status() returns pcmk_rc_ok if pacemakerd is shutting down, since 49ebe4c and 46d6edd (fixes for CLBZ#5471). 3f342e3 refactored crm_mon --one-shot to use library functions, and pcmk_status.c:pacemakerd_status() returns ENOTCONN if pacemakerd is shutting down. As a result, we don't try to connect to the CIB.

Fixes CLBZ#5501

@nrwahl2 nrwahl2 force-pushed the nrwahl2-clbz#5501 branch from a3d8402 to ae51b37 Compare October 6, 2022 23:09
@nrwahl2 nrwahl2 mentioned this pull request Oct 6, 2022
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 11, 2022

Replaced the simple "just enough to unbreak it" fix with a redesign that uses the pcmk__pacemakerd_status() function and the pacemakerd-health message. It's a lot of commits, but most of them are trivial things that I found along the way. Other than that, it's prep work and then the main change in the last commit.

}
break;
case EREMOTEIO:
// @TODO: Why do we try to connect to the fencer and CIB here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't figured out why EREMOTEIO is considered okay and we try to connect to the fencer and the CIB after getting that RC. I went ahead and preserved the existing behavior there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pacemakerd IPC is not proxied to remote nodes, so any attempt to contact it will always fail with EREMOTEIO when run from a remote node. However the fencer and CIB are proxied and may be available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case is the "waiting to be connected by/contacted by the cluster" message inaccurate? Sounds like even if the cluster has already reached us, we'll still fail with EREMOTEIO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of, I think the idea is that if the CIB or fencer can be reached, it will immediately be cleared off the screen, but otherwise it'll be visible

lib/pacemaker/pcmk_output.c Outdated Show resolved Hide resolved
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 11, 2022

Quick note since CI is hung due to infra issues, the rpm target passed on my machine.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew, this got big ...

lib/pacemaker/pcmk_cluster_queries.c Outdated Show resolved Hide resolved
lib/pacemaker/pcmk_output.c Outdated Show resolved Hide resolved
lib/pacemaker/pcmk_cluster_queries.c Outdated Show resolved Hide resolved
lib/pacemaker/pcmk_output.c Show resolved Hide resolved
}
break;
case EREMOTEIO:
// @TODO: Why do we try to connect to the fencer and CIB here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pacemakerd IPC is not proxied to remote nodes, so any attempt to contact it will always fail with EREMOTEIO when run from a remote node. However the fencer and CIB are proxied and may be available.

include/pcmki/pcmki_status.h Outdated Show resolved Hide resolved
if (rc != EREMOTEIO) {
out->err(out, "error: Could not connect to %s: %s",
pcmk_ipc_name(api, true), pcmk_rc_str(rc));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that all other callers would have to check for EREMOTEIO and output the error. Maybe a bool argument for whether EREMOTEIO is ok (probably also need to return pcmk_rc_ok if so)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't like the idea of duplicating work in the callers. I'd like to eventually find a way to use pcmk__pacemakerd_status() for console and daemonized modes of crm_mon...

One of my thoughts was that we might need the caller to know that we got EREMOTEIO (e.g., to set on_remote_node) even though EREMOTEIO is considered OK. Setting on_remote_node is the only place that's actually needed right now, hypotheticals aside.

On closer inspection, on_remote_node is only used for choosing an error message if the initial do/while loop fails, and that logic probably isn't ideal anyway. I'm not convinced we should give up and exit in the scenario that pacemakerd_status() passes and cib_connect() fails... as it looks like we're doing now for Pacemaker Remote and cluster nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current thought is to return EREMOTEIO from pcmk_cluster_queries.c:ipc_connect() if eremoteio_ok (bool) is true, but without logging an error. That way if the caller needs to do anything special, it can.

In particular, I'd like pcmk__status() to output a pacemakerd-health message with state "Running on Pacemaker Remote node". To do that, it needs to know that ipc_connect() returned EREMOTEIO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just skip the pacemakerd-health message altogether in that case. We'd just have less detailed info if we fail to connect to the CIB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...yeah, probably better to just skip pacemakerd-health for remote nodes.

  1. No clean way to do a short vs. friendly state string for "Running on Pacemaker Remote node"
  2. From your comment on another thread, it sounds like "Running on" would be misleading since pacemakerd isn't proxied to remote nodes... and maybe we'd EREMOTEIO even if pacemaker_remoted isn't running, I dunno

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is a good sign you should test the affected commands on remote nodes (not running, running but not connected, connected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a later PR (not touching it here), I might rethink this... setting rc = pcmk_rc_ok within ipc_connect() is fine if and only if we also set data->pcmkd_state = pcmk_pacemakerd_state_running when we get EREMOTEIO. That no longer seems unreasonable. It seems that EREMOTEIO is returned if and only if pacemaker-remoted is running (regardless of whether it's connected). That's roughly equivalent to pacemakerd running.

The current and planned callers treat EREMOTEIO the same as pcmk_rc_ok && pcmk_pacemakerd_state_running, so we could cut out the middle man by doing it here if there's no objection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable

lib/common/ipc_pacemakerd.c Outdated Show resolved Hide resolved
lib/pacemaker/pcmk_status.c Show resolved Hide resolved
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 11, 2022

Updated to address review. I think the below bullet points cover all the changes...

  • Use switch instead of array for enum2friendly()
  • Use guint for timeout_ms argument to pcmk__status()
  • Added commit on top to change sys_from default to "unknown subsystem"
  • Don't return an error for unexpected event type.
  • Use an enum pcmk_pacemakerd_state in pcmk_cluster_queries.c:data_t instead of void *private
  • Use hardcoded "pacemakerd" instead of hardcoded "pacemakerd-health" for XML element name, for backward compatibility
  • Use pcmk_ipc_dispatch_sync instead of pcmk_ipc_dispatch_main if message_timeout_ms == 0 in pcmk_cluster_queries.c. This lets us avoid possible complications if the caller has its own mainloop.
  • Add an eremoteio_ok boolean argument to pcmk_cluster_queries.c:ipc_connect. If true, don't log an error message for EREMOTEIO. We still return EREMOTEIO because the caller may need to handle it specially. In particular, state will be invalid, so we can't rely on pcmk_rc_ok && <a good state> in our switch statement in pcmk__status(). We need to treat EREMOTEIO as automatically okay in pcmk__status().

@nrwahl2 nrwahl2 force-pushed the nrwahl2-clbz#5501 branch 2 times, most recently from 8c12250 to 98fcf00 Compare October 11, 2022 23:19
Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just about there

xml/api/pacemakerd-health-2.25.rng Outdated Show resolved Hide resolved
lib/pacemaker/pcmk_output.c Show resolved Hide resolved
Given an enum pcmk_pacemakerd_state value, this function returns a
user-friendly string representation. This will be used in future
commits.

Signed-off-by: Reid Wahl <[email protected]>
If the XML_ATTR_TSTAMP attribute is present but can't be parsed as an
integer, value_ll gets set to PCMK__PARSE_INT_DEFAULT (-1). This should
never happen, but just in case, we should convert a negative to 0 before
we cast to time_t, an unsigned type.

Signed-off-by: Reid Wahl <[email protected]>
Default should be NULL so that the last_updated default gets used
correctly in the pacemakerd-health message.

Signed-off-by: Reid Wahl <[email protected]>
Use pcmk_rc_ipc_unresponsive if we don't get a response from the API,
and EBADMSG if we get an bad reply or unexpected reply type.

Signed-off-by: Reid Wahl <[email protected]>
Previously, the pacemakerd-health message accepted only a state string.
This made it difficult to use different state strings for different
output formats.

Now, the pacemakerd-health message accepts an enum pcmk_pacemakerd_state
value and an optional state string. If the state string is not set, then
the formatter function looks up an appropriate string representation for
the state. If the state string is set, it acts as an explicit override
and is used in place of a lookup.

Note that this will cause "invalid" to be printed instead of "<null>"
for quiet text outputs, and it will cause "Invalid pacemakerd state" to
be printed instead of "unknown state" for the default output.

Signed-off-by: Reid Wahl <[email protected]>
build_uname_list -> pe__build_node_name_list()

Signed-off-by: Reid Wahl <[email protected]>
Make only_node and only_rsc arguments const. Add doxygen blocks. Change
"st" argument to "stonith".

This is not comprehensive. We're updating pcmk__status() because we're
about to add a new argument, and pcmk__output_cluster_status() because
it shares most of its arguments with pcmk__status().

Signed-off-by: Reid Wahl <[email protected]>
We were using F_CRM_SYS_FROM as the name of the XML element, instead of
something static like "pacemakerd". It happens that the value in
F_CRM_SYS_FROM seems to always be CRM_SYSTEM_MCP ("pacemakerd"), so the
element name was effectively deterministic. Nonetheless, the schema
required the element be called "pacemakerd"; there was no allowance for
another system name. That defeats any purpose of flexible element
naming.

It seems better to call the element "pacemakerd" and make
sys_from a field, if we keep sys_from at all. (Can't use
"pacemakerd-health" for backward compatibility reasons.)

Additionally, if sys_from or last_updated is NULL, pass them directly to
pcmk__output_create_xml_node(). Those attributes will simply be skipped
if their values are NULL.

Signed-off-by: Reid Wahl <[email protected]>
If message_timeout_ms == 0 for various functions in
pcmk_cluster_queries.c, default to using sync dispatch instead of
starting a mainloop with timeout 30s that behaves basically like sync
dispatch.

This makes it easier to reason about calling these functions when the
caller may have its own mainloop.

Signed-off-by: Reid Wahl <[email protected]>
crm_mon --one-shot checks the pacemakerd state before trying to get a
CIB connection. If pacemakerd is shutting down, it returns ENOTCONN.
This can cause a resource agent that calls crm_mon (for example,
ocf:heartbeat:pgsql) to fail to stop during shutdown.

This is a regression introduced by commit 3f342e3.
crm_mon.c:pacemakerd_status() returns pcmk_rc_ok if pacemakerd is
shutting down, since 49ebe4c and 46d6edd (fixes for CLBZ#5471). 3f342e3
refactored crm_mon --one-shot to use library functions. pcmk__status()
now does most of the work, calling pcmk_status.c:pacemakerd_status().
That function returns ENOTCONN if pacemakerd is shutting down. As a
result, we don't try to connect to the CIB during shutdown.

Here we update pcmk__status() to use pcmk__pacemakerd_status() instead
of a static and mostly redundant pacemakerd_status(). It receives the
pacemakerd state via an output pointer argument. If pacemakerd is
running or shutting down (or if we get an EREMOTEIO rc), we try
connecting to the fencer and CIB. However, as long as we successfully
get the pacemakerd state, we return success from pcmk__status(), since
we did obtain the cluster status.

A couple of minor notes:
* pcmk__status() now takes a timeout argument that it passes to
  pcmk__pacemakerd_status(). timeout == 0 uses pcmk_ipc_dispatch_sync,
  matching the old implementation. A positive timeout uses
  pcmk_ipc_dispatch_main.
* pcmk_cluster_queries.c:ipc_connect() no longer always prints a "Could
  not connect" error for EREMOTEIO. The caller may consider it OK.

Fixes T579
Fixes CLBZ#5501

Signed-off-by: Reid Wahl <[email protected]>
sys_from should be a subsystem ("pacemakerd" is expected), not a node.

Signed-off-by: Reid Wahl <[email protected]>
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 12, 2022

Rebased on current main and updated to address review

* If positive, \p pcmk_ipc_dispatch_main
* will be used, and a new mainloop will be
* created for this purpose (freed before
* return).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to replace the mainloop stuff in pcmk_cluster_queries.c with pcmk_ipc_dispatch_poll when we need a timeout. That way we never have to worry about mainloop conflicts.

If so, it'll go in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, as long as anything crm_mon needs still works

Copy link
Contributor Author

@nrwahl2 nrwahl2 Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. crm_mon already uses polling (timeout == reconnect_ms / 2) instead of sync or mainloop, for its console and daemonized modes. one-shot can continue using sync.

@kgaillot kgaillot merged commit 44db4ca into ClusterLabs:main Oct 12, 2022
@nrwahl2 nrwahl2 deleted the nrwahl2-clbz#5501 branch December 15, 2022 20:02
pcmk_ipc_api_t *controld_api = NULL;

if (message_timeout_ms == 0) {
dispatch_type = pcmk_ipc_dispatch_sync;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this also changed the behavior of "crmadmin -D" under certain circumstance.

Now upon cluster startup, if "crmadmin -D" is executed before a DC is elected yet, it will be hanging forever .

AFAICS the lookup involves a ping request to the DC. But given that there's not a DC to actually handle the request at the moment, crmadmin most probably won't receive any reply under the situation...

Previously with an implied timeout 30s, the command would eventually return with an error though rather than hanging forever.

Or course first it's a question what should be the best as the default for such a case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems suboptimal that an IPC request sent to CRM_SYSTEM_DC will be broadcast to the CPG and no one will reply if there is no DC.

It would be nice if the local controller could return an error immediately in relay_message() if the message is for the DC but controld_globals.dc_name is NULL (possibly only for CRM_OP_PING to reduce the scope for unexpected consequences). Unfortunately the local controller might just not have learned the DC yet (in do_cl_join_announce() and do_cl_join_query() we set DC to NULL while waiting to hear from the actual DC). Maybe it's the lesser evil to get an error in that situation (especially if we can return something like EAGAIN/CRM_EX_UNSATISFIED).

@nrwahl2 , do you want to look into that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this PR landed in 2.1.5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that synchronous requests are supposed to time out after 5 seconds...

@nrwahl2 , do you want to look into that?

"want" is such a loaded word ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sleep() would still be synchronous (although it would be less work, at the expense of a delay in finding a message)

I think pcmk_poll_ipc() would be just as busy -- it's just shifting the work to the poll() system call. It blocks on poll() until either we receive something or the timeout we pass expires.

Blocking on one poll() call is better than constantly iterating it. Since crm_ipc_read() calls qb_ipc_us_ready() -> poll() with a timeout 0, continuously iterating it without an interval due to EAGAIN significantly raises CPU load.

I mentioned in T735 that at a glance, it looks like pcmk_ipc_dispatch_sync might behave the same as pcmk_ipc_dispatch_poll with no timeout.

Indeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the original thinking was that EAGAIN would be a brief condition (often one or two tries). It's definitely a problem that timeouts would have no effect if it's stuck in that loop. It's especially bad if a daemon is waiting on another daemon.

I'm thinking we could poll with a 5s timeout.

An async wait via mainloop might be better, especially for daemons, but it's impractical (every send would need a callback for when it completes), and we'd still need the timeout for non-mainloop sends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything about this yet by any chance? Otherwise should I try doing something about this? :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gao-yan I don't think anything's been started yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the update. I'll take a look then.

aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 15, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 15, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 15, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 15, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 15, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 15, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 15, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 16, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 16, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 16, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 16, 2024
The interraction between crmadmin and pacemaker-controld is done in
xml format. When you do 'crmadmin -D' the sequence is as follows:
1) crmadmin::pcmk__send_ipc_request::crm_ipc_send --> controld::route_message::relay_message
2) controld::route_message::relay_message::send_msg_via_ipc --> crmadmin::pcmk__send_ipc_request

Besides, there are two other commits in here

1) The mylog + mylog_xml functionality

2) Fix: controld: leave xml-src attribute empty when no DC selected ClusterLabs#2902

( Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735 )
kgaillot added a commit that referenced this pull request Oct 16, 2024
…hanging

Fix: controld: leave xml-src attribute empty when no DC selected #2902
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 17, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Oct 17, 2024
…terLabs#2902

Right after starting the cluster the DC is offline for short time,
and if you call 'crmadmin --dc_lookup' during this time, it would hang.
This change leaves the xml-src attribute that the crmadmin receives from
the pacemaker-controld empty, so that crmadmin knows the DC was not
selected.
ref: https://projects.clusterlabs.org/T735
kgaillot added a commit that referenced this pull request Oct 17, 2024
…hanging-2.1

Fix: controld: leave xml-src attribute empty when no DC selected #2902 #3606
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Nov 2, 2024
…sterLabs#3606

If the DC is not yet selected, the crmadmin will return an error.
(This change accomplishes ClusterLabs#3606).
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Nov 2, 2024
…sterLabs#3606

If the DC is not yet elected, the crmadmin will return an error.
(This change accomplishes ClusterLabs#3606).
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Nov 2, 2024
…sterLabs#3606

If the DC is not yet elected, the crmadmin will return an error.
(This change complements ClusterLabs#3606).
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Nov 11, 2024
…sterLabs#3606

If the DC is not yet elected, the crmadmin will return an error.
(This change complements ClusterLabs#3606).
aleksei-burlakov pushed a commit to aleksei-burlakov/pacemaker that referenced this pull request Nov 11, 2024
…sterLabs#3606

If the DC is not yet elected, the crmadmin will return an error.
(This change complements ClusterLabs#3606).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants