-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory issues with pcmk__schedule_actions() #3810
base: main
Are you sure you want to change the base?
Conversation
c954448
to
a156e58
Compare
We were freeing the string members of text_list_data_t, but not the text_list_data_t object itself. Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
pcmk__schedule_actions() has a use-after-free bug if its cib argument is the same as the current scheduler->input (or if it's an ancestor or descendant). This is the first step toward dropping that argument. Signed-off-by: Reid Wahl <[email protected]>
ab48cae
to
d355993
Compare
pcmk__schedule_actions() has a use-after-free bug if its cib argument is the same as the current scheduler->input (or if it's an ancestor or descendant). This is another step toward dropping that argument. Signed-off-by: Reid Wahl <[email protected]>
crm_simulate --profile crashes because of a bug in pcmk__schedule_actions(), described below. pcmk__schedule_actions() has a use-after-free bug if its cib argument is the same as the current scheduler->input (or if it's an ancestor or descendant). This is another step toward dropping that argument, and it fixes the crash in crm_simulate --profile. Behavior is preserved. The only caller of pcmk__profile_dir() is crm_simulate. The scheduler object has no state except for output flags when pcmk__profile_dir() is called. In particular, pcmk__sched_have_status is not set. So we achieve equivalent behavior by explicitly calling pcmk_reset_scheduler(), setting scheduler->input, setting scheduler flags, and calling cluster_status() -- instead of relying on pcmk__schedule_actions() to do all this via unpack_cib(). This is a regression that was introduced by c24e732 (or a neighboring commit) and has not made it into any release. Signed-off-by: Reid Wahl <[email protected]>
Previously, we leaked cib_object in profile_file() if repeat > 1. Currently the only caller is crm_simulate, and we know the scheduler object has no relevant state at call time. Thus we can reset the scheduler before doing the simulation. We can avoid unnecessary copies and special case handling by setting scheduler->input to cib_object and then setting it back to NULL before resetting the scheduler. Then we can make sure cib_object is freed unconditionally at the end. Signed-off-by: Reid Wahl <[email protected]>
c07d590
to
5ac4bd1
Compare
This touches a lot of the same code as #3735 by the way |
lib/pacemaker/pcmk_verify.c
Outdated
* to the scheduler. | ||
*/ | ||
scheduler->input = pcmk__xml_copy(NULL, *cib_object); | ||
pcmk_scheduler_t *scheduler = pcmk_new_scheduler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clumens This commit will probably be gone by the time you look at the PR, since it breaks things. But it's an interesting side effect of unsetting pcmk__config_has_{error,warning}
when we free the scheduler. The caller of pcmk__verify()
doesn't need the scheduler object at all, so this purpose of this commit was to drop the argument and create a scheduler object within pcmk__verify()
instead -- and then free it before returning.
Problem is, that unsets pcmk__config_has_{error,warning}
, so the caller thinks there were no errors. I'm not sure that unsetting them when we free the scheduler was a good idea. I mean, it works fine if I don't make this change, but it doesn't seem like what I'm doing here SHOULD break anything...
Logically, it's kind of odd that those are globals rather than part of the scheduler data, really... but it looks like it would be super clunky to put them in a scheduler object, since things that call pcmk__config_err()
may not have a scheduler object.
6d1070c
to
05679d6
Compare
The only caller is crm_simulate. This makes the profile_file() helper easier to reason about, by guaranteeing certain properties of the scheduler object. (It was created in pcmk__profile_dir(), the sole caller of profile_file(), so it's easy to see that it has no relevant state except two potential flags.) Note that we're now able to drop the scheduler flag setting from crm_simulate. The only other function that uses this scheduler object is pcmk__simulate(). The first thing it does is set scheduler flags based on the pcmk_sim_flags (via the reset() helper), making the sets in crm_simulate redundant. Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
Generally make the code easier to follow. Signed-off-by: Reid Wahl <[email protected]>
It would be more efficient to compare characters directly for "." and ".." instead of making string comparison function calls. I think readability is more important. That overhead is insignificant compared to the simulations that we're profiling. We could simplify the filter further if we don't add the crm_trace messages. I have no preference. I mostly added them as documentation so that what we're doing is obvious. Profiling is really just for developers. Signed-off-by: Reid Wahl <[email protected]>
Due to the way glib parses integer options and casts them to assign them to variables, a value of --repeat="-1" wrapped around to 2^32 - 1, causing a very long loop. Instead, we default it to 0 if a negative value is passed in. Signed-off-by: Reid Wahl <[email protected]>
...to pcmk__schedule_actions(). When pcmk__schedule_actions() is called, cluster_status() has already set the pcmk__sched_have_status flag. So unpack_cib() doesn't reset the scheduler or set scheduler->input; it only sets the given flags. There's also no longer any reason to pass flags, since pcmk__schedule_actions() is no longer resetting the scheduler. The existing flags will be kept. pcmk__schedule_actions() has a use-after-free bug if its cib argument is the same as the current scheduler->input (or if it's an ancestor or descendant). This is another step toward dropping that argument. Signed-off-by: Reid Wahl <[email protected]>
pcmk__schedule_actions() has a use-after-free bug if its cib argument is the same as the current scheduler->input (or if it's an ancestor or descendant). This is another step toward dropping that argument. Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
Previously we were passing scheduler->input as the CIB argument, which should be a no-op. And we were already explicitly setting the pcmk__sched_no_counts flag. pcmk__schedule_actions() has a use-after-free bug if its cib argument is the same as the current scheduler->input (or if it's an ancestor or descendant). This is another step toward dropping that argument. Signed-off-by: Reid Wahl <[email protected]>
pcmk__schedule_actions() has a use-after-free bug if its cib argument is the same as the current scheduler->input (or if it's an ancestor or descendant). This is another step toward dropping that argument. Signed-off-by: Reid Wahl <[email protected]>
Signed-off-by: Reid Wahl <[email protected]>
Previously, if cluster_status() were called twice without resetting the scheduler between calls, most of the unpacked data would be duplicated in the scheduler object's data structures. For example, the nodes in scheduler->input would be stored twice in the scheduler's node list. This changes the behavior of a public API function, but the previous behavior resulted in incorrect status, so that seems acceptable. This is not a perfect fix: scheduler->flags is publicly accessible, so a caller could clear the pcmk__sched_have_status flag without resetting the scheduler. That would not be wise to do, however, so we can ignore that possibility for now and plan to deprecate cluster_status() later. An alternative approach would be to explicitly free each scheduler data structure before we unpack into it. Then cluster_status() could be called multiple times with different inputs, getting the new status each time. However, there is no internal use case for that, and it seems prone to unforeseen corner cases. Signed-off-by: Reid Wahl <[email protected]>
pcmk__schedule_actions() expects that cluster_status() has already been called, so just do it at the beginning. cluster_status() now checks whether it's already been called and returns early if so, so this is safe. We actually did something similar in pcmk__schedule_actions() until a few commits ago, but the intermediate commits were clearer without pcmk__schedule_actions() calling cluster_status(). Signed-off-by: Reid Wahl <[email protected]>
pcmk_simulate() resets or overwrites most of the passed-in scheduler object already. It resets everything if there are any injections. It doesn't make any sense to pass in an already-initialized scheduler object and rely on any particular behavior. With that said, behavior may change in certain corner cases if the scheduler argument already has some values (such as flags) set. Signed-off-by: Reid Wahl <[email protected]>
05679d6
to
a8ab91b
Compare
@@ -34,6 +34,7 @@ free_list_data(gpointer data) { | |||
|
|||
free(list_data->singular_noun); | |||
free(list_data->plural_noun); | |||
free(list_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This same problem will affect tools/crm_mon_curses.c
. Might as well add that to this patch and break it out into a separate PR so we can merge it while talking about the rest of this PR.
@@ -32,6 +32,7 @@ extern "C" { | |||
* \brief Modify operation of running a cluster simulation. | |||
*/ | |||
enum pcmk_sim_flags { | |||
// @COMPAT Use UINT32_C(1); should not affect behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with UINT32_C() but I see that we are using it in a variety of places already. Should we be using it in the definitions of all these flags enums?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yeah. It's a macro that makes a fixed-width integer constant basically. I don't think it matters AT ALL. It was something that I suggested as a super-pedantic point on some Pacemaker PR that I halfway reviewed while I was still in support. Ken started using it in a fair number of places after that.
} | ||
|
||
pcmk_reset_scheduler(scheduler); | ||
end = clock(); | ||
out->message(out, "profile", xml_file, start, end); | ||
|
||
done: | ||
pcmk__xml_free(cib_object); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could split this out into its own PR as well to make potential backporting easier, but it's only ever going to affect crm_simulate, and I don't think that's really worth the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check if it's trivial or not. Currently I've got it ordered after the the crash fix. I did that because right now on main, the mem leak is not reproducible (and thus the fix is not testable) until the crash is fixed. Meanwhile, the crash isn't present on 3.0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah to your point I don't see how it could affect anything besides crm_simulate
so why bother. It's not currently exposed via any public API, and if it ever is, then it'll be on a Pacemaker version that has this fix.
*/ | ||
return TRUE; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might find what I started in #3735 interesting here. I'd like to return to making progress on that, but I got bogged down with handling all the problems in our scheduler regression tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do, and in fact I mentioned that PR in a comment on this one:
#3810 (comment)
No description provided.