Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle queue workers inheriting values from worker_base: fix incorrect count and memory threshold values on workers tab #7725

Merged

Conversation

jrafanie
Copy link
Member

The existing code was checking only the worker's value and falling back to
queue_worker_base but not falling back to the worker_base if the other two
locations had no value.

This causes the select's value to be the first item in the select, which is 200 MB.
If you as a user try to change some other unrelated value and then save, all
of these "incorrect" 200 MB values get saved instead of the correct value
inherited from worker_base.

For example, smart_proxy_worker was correct because that worker class specified a
memory_threshold. Reporting, C & U processor and collector, Generic, and
Priority workers were all showing 200 MB in the select because that's the first
entry in the select and because the default of 1.gigabyte was coming from
worker_base and no value was coming from queue_worker_base or the worker class
itself.

For more information, review https://github.com/ManageIQ/manageiq/blob/19b57d3c88679b12718e301f4a19d4b1fee11705/config/settings.yml#L1063-L1113 and see that the reporting worker specifies no memory_threshold and the queue_worker_base also specifies no memory_threshold but should inherit the worker_base value of 1.gigabytes. This is the same problem for the other workers mentioned above.

Before

workers_tab_1
workers_tab_2

After

workers_tab_after

const baseMemDefault = toBytes(wb.queue_worker_base.defaults.memory_threshold);
const monitorDefault = toBytes(wb.event_catcher.defaults.memory_threshold);

const selectCount = value => (typeof value === 'number' ? value : countDefault);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing I'm curious about is what this === 'number' means... @kavyanekkalapu do you know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's to handle 0 which is a number but is also falsey?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, we also don't check for countDefault to be a number but theoretically, that could be empty/"". We might need validation on all worker values + all *Default values here. I can't tell.

defaults: {
memory_threshold: '500.megabytes',
},
defaults: {},
Copy link
Member Author

@jrafanie jrafanie Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the 500.megabytes example for the queue_worker_base defaults to recreate the situation we have on master where some queue based workers have a value specified and others inherit from worker_base, such as reporting worker in this example.

Copy link
Member

@Fryguy Fryguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - would also like @kavyanekkalapu to review, since I had a hand in helping @jrafanie .

The existing code was checking only the worker's value and falling back to
queue_worker_base but not falling back to the worker_base if the other two
locations had no value.

This causes the select's value to be the first item in the select, which is 200 MB.
If you as a user try to change some other unrelated value and then save, all
of these "incorrect" 200 MB values get saved instead of the correct value
inherited from worker_base.

For example, smart_proxy_worker was correct because that worker class specified a
memory_threshold.  Reporting, C & U processor and collector, Generic, and
Priority workers were all showing 200 MB in the select because that's the first
entry in the select and because the default of 1.gigabyte was coming from
worker_base and no value was coming from queue_worker_base or the worker class
itself.

For more information, review https://github.com/ManageIQ/manageiq/blob/19b57d3c88679b12718e301f4a19d4b1fee11705/config/settings.yml#L1063-L1113 and see that the reporting worker specifies no memory_threshold and the queue_worker_base also specifies no memory_threshold but should inherit the worker_base value of 1.gigabytes.  This is the same problem for the other workers mentioned above.
@jrafanie jrafanie force-pushed the fix_inheritance_of_worker_settings branch from b44e2cc to 8474293 Compare April 28, 2021 17:38
@miq-bot
Copy link
Member

miq-bot commented Apr 28, 2021

Checked commit jrafanie@8474293 with ruby 2.6.3, rubocop 0.82.0, haml-lint 0.35.0, and yamllint
0 files checked, 0 offenses detected
Everything looks fine. 🏆

@jrafanie
Copy link
Member Author

LGTM - would also like @kavyanekkalapu to review, since I had a hand in helping @jrafanie .

yes, I blame @Fryguy for recently being in this code and knowing enough to make me dangerous in changing this code. 🤣


const parsedValues = {
generic_worker: {
memory_threshold: parseWorker(wb.queue_worker_base.generic_worker).bytes || baseMemDefault,
count: selectCount(wb.queue_worker_base.generic_worker.count),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this UI code knows way too much about the worker class hierarchy. In other words, if we ever change it, we need to be mindful of this area of code. Also, wondering if changing same via our API is possible and, if so, how safe that code is.

Copy link
Member Author

@jrafanie jrafanie Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is exactly the problem here. We're reimplementing the backend logic in the frontend.

Apparently, we don't have an api for getting worker settings yet. Unfortunately, the current code is really bad, it's the source of a user accidentally changing lots of workers to use 200 MB memory threshold, so we need to fix this. I'm not sure when this was broken.

See below...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 agree - unfortunately, we don't have an API that normalizes the worker settings, and that would be the "real" fix

Copy link
Member Author

@jrafanie jrafanie Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thanks... I should have clarified, the UI is calling API.get('/api/servers/${id}/settings'), so we have the API for getting server settings but this does not walk the workers subhash to resolve each subclass's settings based on inherited values.

This isn't the first time that this special inheritance structure of the workers subhash has caused confusion and complicated code. Even if we flatten worker settings, we still have this inheritance that isn't always obvious for most users.

Is it time to consider if it makes sense doing inheritance at all in the workers hash? I believe the original idea was to enable setting more general defaults that get inherited but if this is not easy, is it buying anything for us other than creating complicated code?

Copy link
Member

@Fryguy Fryguy Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke with @jrafanie and I'm leaning towards removing the inheritance as well (not for this PR). If we did so, we'd need to update this UI to present all of the workers and allow them to edit it. In its current presentation it is not feasible but perhaps in a tabular form it would make more sense.

Technically we'd lose "bulk" changing of all workers of a particular class, but it's arguable if that is useful. I'd think users would only change values where there are having problems, not necessarily across the board. Even so, it's not hard to go in and change, at most, 20 workers to the same value. @chessbyte Thoughts?

The flatten workers code would still be necessary for upgrade, but it would be simpler as well, because it would remove the inheritance entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @agrare Would like your perspective too, because this overlaps with providers a lot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues on my side re: removing the inheritance. The only thing I can see being more complex (and this is just on the developer side so we can manage it) is if we want to add a setting that is used on e.g. the base MiqWorker class we have to blast it out to every worker config section instead of just adding in one place.

Copy link
Member

@Fryguy Fryguy Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather take on that developer inconvenience than push it onto users. In fact, that makes me lean even more to removing the inheritance.

Copy link
Member

@agrare agrare Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah and something we can do a shared provider spec test for, e.g. make sure all provider queue workers have :dequeue_method, :poll_method, and :queue_timeout settings because core will fail without them.

@kavyanekkalapu
Copy link
Member

@Fryguy @jrafanie i couldn't understand the code fully, but checked this form in UI, changes looks good.
One question is, we are just one day away from release, do we really need to merge this change to lasker?

@jrafanie
Copy link
Member Author

jrafanie commented Apr 28, 2021

@Fryguy @jrafanie i couldn't understand the code fully, but checked this form in UI, changes looks good.
One question is, we are just one day away from release, do we really need to merge this change to lasker?

I'm fine with doing a different change. We can keep the selectCount that was removed and remove any changes that aren't immediately related but the alternative to not fixing this is users will save 200 MB memory threshold which means none of our workers will be able to run. Additionally, 200 MB memory threshold will be less than the default 500 MB memory request in podified, meaning, the orchestrator will reject that worker's configuration.

I found this yesterday by accident while doing UI testing for ManageIQ/manageiq#20889. I saw the 200 MB values in the various memory thresholds and realized that was the reason we had the exact same problem with a user the other day where they changed something else on the worker tab and it set 200 MB memory threshold on a bunch of workers.

@jrafanie
Copy link
Member Author

jrafanie commented Apr 28, 2021

the reason we had the exact same problem with a user the other day where they changed something else on the worker tab and it set 200 MB memory threshold on a bunch of workers.

Actually, I just tested it and even if it shows 200 MB like in my screenshots and you change something unrelated, such as the generic worker count to 1 and save, only the worker count of 1 is shown in advanced tab as changed.

So, the real problem is it doesn't show the current value correctly, so if you decide to increase the value from 200 MB to 400 MB for the generic worker, you're actually changing it from the default of 1 GB to 400 MB.

@Fryguy
Copy link
Member

Fryguy commented Apr 29, 2021

I'm fine with doing a different change.

No, it's fine as is.

One question is, we are just one day away from release, do we really need to merge this change to lasker?

While we are cutting downstream soon, upstream release will still happen. Additionally, we will ideally have more downstream fixpacks, so we want this to land in the future. This bug is pretty bad, so I'd like to get it back as soon as possible.

@jrafanie jrafanie changed the title Handle queue workers inheriting values from worker_base. Handle queue workers inheriting values from worker_base: fix workers tab showing incorrect count and memory threshold values for queue workers May 6, 2021
@jrafanie jrafanie changed the title Handle queue workers inheriting values from worker_base: fix workers tab showing incorrect count and memory threshold values for queue workers Handle queue workers inheriting values from worker_base: fix incorrect count and memory threshold values on workers tab May 6, 2021
@Fryguy
Copy link
Member

Fryguy commented May 12, 2021

@kavyanekkalapu If you are good with this change can you merge? I was involved in writing it, so don't feel comfortable doing so.

@kavyanekkalapu kavyanekkalapu merged commit b242799 into ManageIQ:master May 12, 2021
@Fryguy
Copy link
Member

Fryguy commented May 18, 2021

Backported to lasker in commit b35c1ed.

commit b35c1edffd7de163152474d0ae6c8b48cb1cbcb5
Author: Kavya Nekkalapu <[email protected]>
Date:   Wed May 12 09:29:34 2021 -0400

    Merge pull request #7725 from jrafanie/fix_inheritance_of_worker_settings
    
    Handle queue workers inheriting values from worker_base: fix incorrect count and memory threshold values on workers tab
    
    (cherry picked from commit b2427990dbb0a1242440533f2af5453d1da90247)

@Fryguy Fryguy removed the lasker/yes label May 18, 2021
Fryguy pushed a commit that referenced this pull request May 18, 2021
…ings

Handle queue workers inheriting values from worker_base: fix incorrect count and memory threshold values on workers tab

(cherry picked from commit b242799)
@jrafanie jrafanie deleted the fix_inheritance_of_worker_settings branch June 9, 2021 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants