Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: tools: crm_mon --daemonize should update when disconnected #2868

Merged
merged 13 commits into from
Nov 8, 2022

Conversation

nrwahl2
Copy link
Contributor

@nrwahl2 nrwahl2 commented Sep 15, 2022

With crm_mon --daemonize, the output will continue to show the last known status after the cluster is stopped on the local node. It should go back to something like "Cluster is not available".

Likewise, messages like "Reconnecting..." should not go to the daemonized output. The output file (or external handler) should receive only the cluster status. So we print those messages only if output_format is mon_output_console (or if we're in one-shot mode, where applicable). Normally those messages don't get printed to the daemonized output anyway because we don't flush the buffer, but theoretically they could get printed if the buffer fills up.

It's questionable whether we want to print "Cluster is not available" when we're in interactive mode or only in daemonized mode. For now, we'll do daemonized-only. Interactive (console) mode already gets the "Connection to cluster daemons terminated" message, which is quickly replaced by "Reconnecting...".

External agents are notified via traps rather than via output, so we can ignore them.

Closes T15

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Sep 15, 2022

The first commit may not be comprehensive -- for example, "waiting for start" type messages may still appear in daemonized text output during startup. It's an improvement and should take care of most situations after we're already up and running.

tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 6, 2022

pushed from the wrong local branch...

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 6, 2022

Updated. This gets the job done for daemonized mode without switching on output formats and such, and without negatively impacting console or one-shot mode.

It does not print a string representation of the pacemakerd status (e.g., "Pacemaker daemons shutting down") if the pacemakerd state is noteworthy when run in one-shot mode. That would require either merging pcmk_status.c:pacemakerd_status() and crm_mon.c:pacemakerd_status() into one function that can fulfill both use cases' requirements, or adding a fair amount of code to pcmk_status.c:pacemakerd_status() and its callers.

I think it would be a good idea to merge them into library functions at some point -- and likewise for other similar static functions. I started working on it and named it pcmk__pacemakerd_status()... and then it failed to compile because there's already a function with that name in pcmk_cluster_queries.c. That function and its event callback do some of the same things but were built with a different use case in mind. Getting everything merged cleanly shouldn't be a huge undertaking, but it's not trivial and so I don't want to let it hold up this PR.

I also put a couple of formatter functions for a new message in crm_mon.c and registered them directly. It would make intuitive sense to have crm_mon_register_messages() register them. That function currently is defined in crm_mon_curses.c and registers formatters specifically for curses. If we want, we can change that function name to "crm_mon_register_curses_messages" and then define a "crm_mon_register_messages" in crm_mon.c.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've improved the crm_mon code drastically over the past few years, but it's still difficult to follow :-/

tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 11, 2022

Depending on how much effort it is, before having this merged, I may try to make the console and daemonized mode share more code with one-shot mode since we're doing that work in #2902. That would entail, at minimum, using pcmk__pacemakerd_status(). It could also involve sharing a cib_connect() and fencing_connect() implementation and maybe incorporating pcmk__status()... I haven't given it any detailed thought since starting on #2902.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 17, 2022

Updated minimally to address review. Most of the massive Compare output is from rebasing on current main and resolving conflicts. There's still room for improvement in both the output and the implementation, but this gets the job done.

I'll be filing another PR with some other refactors as a start

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Oct 17, 2022

The CI failures were RPM issues on three hosts. Two of them were OpenSUSE hosts that timed out when trying to access the libqb100 RPM. The other was CentOS 9 s390x: " - nothing provides libqb(s390-64) = 2.0.6-1.el9.next needed by libqb-devel-2.0.6-1.el9.next.s390x"

tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Outdated Show resolved Hide resolved
tools/crm_mon.c Show resolved Hide resolved
@nrwahl2 nrwahl2 marked this pull request as draft November 3, 2022 22:09
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Nov 3, 2022

Totally replaced commits.

Still working out some schema issues (edit: fixed in #2919)

@nrwahl2 nrwahl2 marked this pull request as ready for review November 3, 2022 23:00
tools/crm_mon.c Outdated Show resolved Hide resolved
Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

lib/cib/cib_utils.c Show resolved Hide resolved
tools/crm_mon.c Show resolved Hide resolved
tools/crm_mon.c Outdated
@@ -672,7 +672,7 @@ reconnect_after_timeout(gpointer data)
static void
mon_cib_connection_destroy(gpointer user_data)
{
out->info(out, "Connection to the cluster-daemons terminated");
out->transient(out, "\nConnection to the cluster lost");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a newline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because the string of ellipses (of varying length) before the first "Connection lost" message always bugged me:

Cluster Summary:
  * Stack: corosync
  * Current DC:	node1 (version 2.1.5-0b3656e85) - partition with quorum
  * Last updated: Thu Nov  3 17:02:27 2022
  * Last change:  Thu Nov  3 15:59:10 2022 by hacluster via crmd on node1
  * 1 node configured
  * 2 resource instances configured (1 DISABLED)

Node List:
  * Online: [ node1 ]

Active Resources:
  * dummy	(ocf:pacemaker:Dummy):   Started node1
......Connection to the cluster-daemons terminated
Connection to the cluster-daemons terminated

I'm not tied to the newline though. Other options besides the newline:

  • Leave it as-is
  • Modify the curses error and info/transient functions so that they move the cursor to the beginning of the current line before printing. Not sure if that would have any unintended consequences, but it definitely looks the nicest in this scenario:
Cluster Summary:
...
Active Resources:
  * dummy	(ocf:pacemaker:Dummy):   Started node1
Connection to the cluster lost
Connection to the cluster lost

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think the handling should be in the implementation rather than the message string. The idea is that the message string should be identical across all formats and not have format-specific control characters.

Moving to the beginning of the line should be fine for error at least (clearing the line would also be a good idea in case the existing content is longer than the error message). For info/transient it would probably be fine too. I suppose the ideal would be to check the line position and output a newline if not at the beginning, but that might not be worth the hassle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to just remove the newline from the commit for now. We can revisit this later.

that might not be worth the hassle.

For curses, doing something based on the current position is no more difficult than moving to the beginning of the line. Which is to say, not very. It's probably doable in other formats as well (something like get the character at ftell(fd) - 1 and act if that character is not a newline? Haven't done much with streams).

There may or may not be existing places where we build part of a string piece-by-piece based on conditions, and then finish it with out->info(out, "the rest"). Any new approach that adds a newline or overwrites part of a line would break that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be, we rejected that possibility when designing the interface. We use the usual techniques to build a string then pass the whole string to the output object function. That's the only way to abstract e.g. text vs xml needing a newline or not. We considered separate methods for build-up-a-line and end-a-line but there are already functions for that purpose, and not all formats are line-oriented.

Just dropping the newline is fine for this PR

tools/crm_mon.c Show resolved Hide resolved
The XML message should use the short output, not the friendly output.

Signed-off-by: Reid Wahl <[email protected]>
This isn't really important, but it's more correct to set the pacemakerd
health last updated time based on the time of the query rather than the
time of the message call.

Signed-off-by: Reid Wahl <[email protected]>
As far as I can tell, it's equivalent to `clear(); refresh();`, except
that it loops over all the lines individually. If we feel it's worth
functionizing to save one line per call, we can add it back. I default
to being explicit here.

Signed-off-by: Reid Wahl <[email protected]>
Currently, if we fail to get a query result, we return pcmk_rc_no_input.
This masks the true source of the error, so we should only do this as a
failsafe when the return code is pcmk_rc_ok and there's still no query
result XML.

Signed-off-by: Reid Wahl <[email protected]>
@nrwahl2 nrwahl2 force-pushed the nrwahl2-T15 branch 2 times, most recently from a30e6f0 to 757936d Compare November 5, 2022 00:47
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Nov 5, 2022

Updated to address review, which was more involved than expected (see #2868 (comment)). This is already way too big, so if it gets any bigger, it'll be split up.

  • Updated cib__signon_query() to return pcmk_rc_no_input if (rc == pcmk_rc_ok) && (*cib_object == NULL).
  • Added a commit at the beginning to fix the pcmkd state display for XML format (should use the short display instead of friendly).
  • Added two commits to include pacemakerd status in the cluster-summary message for native CIB connections. It's excluded by default for file and remote CIB connections (if you include it, it will always say "Unknown"). It's also specifically excluded when pcmk_simulate calls cluster-summary.

Not addressed yet:

  • Newline before "Connection to the cluster lost". Need to figure out whether to leave that as-it-was, add the newline, or modify the curses error/info functions to start at the beginning of the line.

@nrwahl2 nrwahl2 force-pushed the nrwahl2-T15 branch 3 times, most recently from b406e53 to 6d72ffa Compare November 5, 2022 01:36
Changes:
* New crm-mon-disconnected message when not connected to the CIB
* Pacemaker status in cluster stack

Ref T15

Signed-off-by: Reid Wahl <[email protected]>
crm_mon's cluster stack section now includes the pacemakerd status for
native CIB connections. The stack section does not include the
pacemakerd status for file and remote CIB connections.

Since the summary should only be shown when we have a CIB
connection, there should be three possible values in practice:
* Pacemaker is running
* Pacemaker is shutting down
* pacemaker-remoted is running (on a Pacemaker Remote node)

Signed-off-by: Reid Wahl <[email protected]>
pcmk__pacemakerd_status() can replace most of what the static
pacemakerd_status() was doing. A new setup_api_connections() function
can take the place of pacemakerd_status() and conditionally connect to
the fencer and CIB.

There's also no compelling reason to keep the use_cib_native and
on_remote_node variables. For the latter, we can use a global
pcmkd_state variable. (It may be desirable to use fewer globals, but
that's out of scope regardless.)

Signed-off-by: Reid Wahl <[email protected]>
To align with the cib and daemons functions

Signed-off-by: Reid Wahl <[email protected]>
Note also that the "Writing html..." message was obsolete.

Signed-off-by: Reid Wahl <[email protected]>
With crm_mon --daemonize, currently the output will continue to show the
last known status after the cluster is stopped on the local node. This
commit causes it to write "Not connected to CIB" (and more specific
details) to the output file:
* before the initial connection;
* as soon as the CIB connection is destroyed; and
* every time a reconnection attempt fails

External agents are notified via traps rather than via output, so we can
ignore them.

We register the new message formatter functions directly instead of via
crm_mon_register_messages(). The reason is that
crm_mon_register_messages() is used for messages specific to the curses
format, so that they can be registered from within crm_mon.c. The new
crm-mon-disconnected formatter isn't used with console output; it's not
really necessary, and it's more complicated to implement (attempts so
far led to display issues after connection loss).

In the future it might make sense to rename crm_mon_register_messages()
in crm_mon_curses.c and reuse that name in crm_mon.c.

Closes T15

Signed-off-by: Reid Wahl <[email protected]>
Today I learned you don't have to use "int" here.

Signed-off-by: Reid Wahl <[email protected]>
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Nov 7, 2022

Updated to:

  • remove the newline from the "Connection to cluster lost" transient message
  • move the pacemakerd state into the stack section:
    <stack type="corosync" pacemakerd-state="running"/>
    <stack type="corosync" pacemakerd-state="shutting_down"/>

  * Stack: corosync (Pacemaker is running)
  * Stack: corosync (Pacemaker daemons are shutting down)
  • Remove the pacemakerd-health messages from crm_mon.c (the crm-mon-disconnected message covers this already for the non-one-shot case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants