Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory issues with pcmk__schedule_actions() #3810

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

nrwahl2
Copy link
Contributor

@nrwahl2 nrwahl2 commented Jan 24, 2025

No description provided.

@nrwahl2 nrwahl2 force-pushed the nrwahl2-crm_resource branch 4 times, most recently from c954448 to a156e58 Compare January 25, 2025 05:44
@nrwahl2 nrwahl2 marked this pull request as ready for review January 25, 2025 09:36
@nrwahl2 nrwahl2 changed the title WIP: Fix memory issues with pcmk__schedule_actions Fix memory issues with pcmk__schedule_actions Jan 25, 2025
@nrwahl2 nrwahl2 changed the title Fix memory issues with pcmk__schedule_actions Fix memory issues with pcmk__schedule_actions() Jan 25, 2025
@nrwahl2 nrwahl2 requested a review from clumens January 25, 2025 10:06
We were freeing the string members of text_list_data_t, but not the
text_list_data_t object itself.

Signed-off-by: Reid Wahl <[email protected]>
pcmk__schedule_actions() has a use-after-free bug if its cib argument is
the same as the current scheduler->input (or if it's an ancestor or
descendant). This is the first step toward dropping that argument.

Signed-off-by: Reid Wahl <[email protected]>
@nrwahl2 nrwahl2 force-pushed the nrwahl2-crm_resource branch from ab48cae to d355993 Compare January 25, 2025 22:07
pcmk__schedule_actions() has a use-after-free bug if its cib argument is
the same as the current scheduler->input (or if it's an ancestor or
descendant). This is another step toward dropping that argument.

Signed-off-by: Reid Wahl <[email protected]>
crm_simulate --profile crashes because of a bug in
pcmk__schedule_actions(), described below.

pcmk__schedule_actions() has a use-after-free bug if its cib argument is
the same as the current scheduler->input (or if it's an ancestor or
descendant). This is another step toward dropping that argument, and it
fixes the crash in crm_simulate --profile.

Behavior is preserved. The only caller of pcmk__profile_dir() is
crm_simulate. The scheduler object has no state except for output flags
when pcmk__profile_dir() is called. In particular,
pcmk__sched_have_status is not set. So we achieve equivalent behavior by
explicitly calling pcmk_reset_scheduler(), setting scheduler->input,
setting scheduler flags, and calling cluster_status() -- instead of
relying on pcmk__schedule_actions() to do all this via unpack_cib().

This is a regression that was introduced by c24e732 (or a neighboring
commit) and has not made it into any release.

Signed-off-by: Reid Wahl <[email protected]>
Previously, we leaked cib_object in profile_file() if repeat > 1.

Currently the only caller is crm_simulate, and we know the scheduler
object has no relevant state at call time. Thus we can reset the
scheduler before doing the simulation.

We can avoid unnecessary copies and special case handling by setting
scheduler->input to cib_object and then setting it back to NULL before
resetting the scheduler.

Then we can make sure cib_object is freed unconditionally at the end.

Signed-off-by: Reid Wahl <[email protected]>
@nrwahl2 nrwahl2 force-pushed the nrwahl2-crm_resource branch 2 times, most recently from c07d590 to 5ac4bd1 Compare January 26, 2025 00:32
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Jan 26, 2025

This touches a lot of the same code as #3735 by the way

* to the scheduler.
*/
scheduler->input = pcmk__xml_copy(NULL, *cib_object);
pcmk_scheduler_t *scheduler = pcmk_new_scheduler();
Copy link
Contributor Author

@nrwahl2 nrwahl2 Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clumens This commit will probably be gone by the time you look at the PR, since it breaks things. But it's an interesting side effect of unsetting pcmk__config_has_{error,warning} when we free the scheduler. The caller of pcmk__verify() doesn't need the scheduler object at all, so this purpose of this commit was to drop the argument and create a scheduler object within pcmk__verify() instead -- and then free it before returning.

Problem is, that unsets pcmk__config_has_{error,warning}, so the caller thinks there were no errors. I'm not sure that unsetting them when we free the scheduler was a good idea. I mean, it works fine if I don't make this change, but it doesn't seem like what I'm doing here SHOULD break anything...

Logically, it's kind of odd that those are globals rather than part of the scheduler data, really... but it looks like it would be super clunky to put them in a scheduler object, since things that call pcmk__config_err() may not have a scheduler object.

@nrwahl2 nrwahl2 force-pushed the nrwahl2-crm_resource branch 2 times, most recently from 6d1070c to 05679d6 Compare January 26, 2025 06:58
The only caller is crm_simulate. This makes the profile_file() helper
easier to reason about, by guaranteeing certain properties of the
scheduler object. (It was created in pcmk__profile_dir(), the sole
caller of profile_file(), so it's easy to see that it has no relevant
state except two potential flags.)

Note that we're now able to drop the scheduler flag setting from
crm_simulate. The only other function that uses this scheduler object is
pcmk__simulate(). The first thing it does is set scheduler flags based
on the pcmk_sim_flags (via the reset() helper), making the sets in
crm_simulate redundant.

Signed-off-by: Reid Wahl <[email protected]>
Generally make the code easier to follow.

Signed-off-by: Reid Wahl <[email protected]>
It would be more efficient to compare characters directly for "." and
".." instead of making string comparison function calls. I think
readability is more important. That overhead is insignificant compared
to the simulations that we're profiling.

We could simplify the filter further if we don't add the crm_trace
messages. I have no preference. I mostly added them as documentation so
that what we're doing is obvious. Profiling is really just for
developers.

Signed-off-by: Reid Wahl <[email protected]>
Due to the way glib parses integer options and casts them to assign them
to variables, a value of --repeat="-1" wrapped around to 2^32 - 1,
causing a very long loop. Instead, we default it to 0 if a negative
value is passed in.

Signed-off-by: Reid Wahl <[email protected]>
...to pcmk__schedule_actions().

When pcmk__schedule_actions() is called, cluster_status() has already
set the pcmk__sched_have_status flag. So unpack_cib() doesn't reset the
scheduler or set scheduler->input; it only sets the given flags.

There's also no longer any reason to pass flags, since
pcmk__schedule_actions() is no longer resetting the scheduler. The
existing flags will be kept.

pcmk__schedule_actions() has a use-after-free bug if its cib argument is
the same as the current scheduler->input (or if it's an ancestor or
descendant). This is another step toward dropping that argument.

Signed-off-by: Reid Wahl <[email protected]>
pcmk__schedule_actions() has a use-after-free bug if its cib argument is
the same as the current scheduler->input (or if it's an ancestor or
descendant). This is another step toward dropping that argument.

Signed-off-by: Reid Wahl <[email protected]>
Previously we were passing scheduler->input as the CIB argument, which
should be a no-op. And we were already explicitly setting the
pcmk__sched_no_counts flag.

pcmk__schedule_actions() has a use-after-free bug if its cib argument is
the same as the current scheduler->input (or if it's an ancestor or
descendant). This is another step toward dropping that argument.

Signed-off-by: Reid Wahl <[email protected]>
pcmk__schedule_actions() has a use-after-free bug if its cib argument is
the same as the current scheduler->input (or if it's an ancestor or
descendant). This is another step toward dropping that argument.

Signed-off-by: Reid Wahl <[email protected]>
Previously, if cluster_status() were called twice without resetting the
scheduler between calls, most of the unpacked data would be duplicated
in the scheduler object's data structures. For example, the nodes in
scheduler->input would be stored twice in the scheduler's node list.

This changes the behavior of a public API function, but the previous
behavior resulted in incorrect status, so that seems acceptable.

This is not a perfect fix: scheduler->flags is publicly accessible, so a
caller could clear the pcmk__sched_have_status flag without resetting
the scheduler. That would not be wise to do, however, so we can ignore
that possibility for now and plan to deprecate cluster_status() later.

An alternative approach would be to explicitly free each scheduler data
structure before we unpack into it. Then cluster_status() could be
called multiple times with different inputs, getting the new status each
time. However, there is no internal use case for that, and it seems
prone to unforeseen corner cases.

Signed-off-by: Reid Wahl <[email protected]>
pcmk__schedule_actions() expects that cluster_status() has already been
called, so just do it at the beginning. cluster_status() now checks
whether it's already been called and returns early if so, so this is
safe.

We actually did something similar in pcmk__schedule_actions() until a
few commits ago, but the intermediate commits were clearer without
pcmk__schedule_actions() calling cluster_status().

Signed-off-by: Reid Wahl <[email protected]>
pcmk_simulate() resets or overwrites most of the passed-in scheduler
object already. It resets everything if there are any injections. It
doesn't make any sense to pass in an already-initialized scheduler
object and rely on any particular behavior.

With that said, behavior may change in certain corner cases if the
scheduler argument already has some values (such as flags) set.

Signed-off-by: Reid Wahl <[email protected]>
@nrwahl2 nrwahl2 force-pushed the nrwahl2-crm_resource branch from 05679d6 to a8ab91b Compare January 26, 2025 07:16
@@ -34,6 +34,7 @@ free_list_data(gpointer data) {

free(list_data->singular_noun);
free(list_data->plural_noun);
free(list_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same problem will affect tools/crm_mon_curses.c. Might as well add that to this patch and break it out into a separate PR so we can merge it while talking about the rest of this PR.

@@ -32,6 +32,7 @@ extern "C" {
* \brief Modify operation of running a cluster simulation.
*/
enum pcmk_sim_flags {
// @COMPAT Use UINT32_C(1); should not affect behavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with UINT32_C() but I see that we are using it in a variety of places already. Should we be using it in the definitions of all these flags enums?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yeah. It's a macro that makes a fixed-width integer constant basically. I don't think it matters AT ALL. It was something that I suggested as a super-pedantic point on some Pacemaker PR that I halfway reviewed while I was still in support. Ken started using it in a fair number of places after that.

lib/pacemaker/pcmk_scheduler.c Show resolved Hide resolved
}

pcmk_reset_scheduler(scheduler);
end = clock();
out->message(out, "profile", xml_file, start, end);

done:
pcmk__xml_free(cib_object);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could split this out into its own PR as well to make potential backporting easier, but it's only ever going to affect crm_simulate, and I don't think that's really worth the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check if it's trivial or not. Currently I've got it ordered after the the crash fix. I did that because right now on main, the mem leak is not reproducible (and thus the fix is not testable) until the crash is fixed. Meanwhile, the crash isn't present on 3.0.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah to your point I don't see how it could affect anything besides crm_simulate so why bother. It's not currently exposed via any public API, and if it ever is, then it'll be on a Pacemaker version that has this fix.

*/
return TRUE;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might find what I started in #3735 interesting here. I'd like to return to making progress on that, but I got bogged down with handling all the problems in our scheduler regression tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, and in fact I mentioned that PR in a comment on this one:
#3810 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants