-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure ems workers are killed by their server/orchestrator pod #20290
Ensure ems workers are killed by their server/orchestrator pod #20290
Conversation
@carbonin I'll need help testing this in pods. 🤣 |
3fb43b2
to
9dde30e
Compare
Fixes ManageIQ#20288 Previously, direct calls to ems#destroy would assume you're calling it local to each of the ems's workers and would fail to find the pid if not local. Additionally, in pods, only the orchestrator pod of the worker has permissions to kill the pod so this would fail with permission errors such as: deployments.apps "1-xyz-event-catcher-1" is forbidden: User "abc" cannot patch resource "deployments" in API group "apps" in the namespace "123" for PATCH https:...] The ems.destroy_queue method calls _queue_task from the AsyncDeleteMixin, which doesn't specify the server_guid or queue_name so a UI request to delete the ems COULD be initiated in a UI appliance and picked up by the same appliance, which isn't where the ems's worker processes are running, and would ultimately call kill on each workers that don't exist locally. Now, we queue the worker's kill method for the queue_name 'miq_server' so it's handled by the server "process" in appliances or orchestrator in pods and server_guid of the worker's server as an ems's workers can be on different servers.
9dde30e
to
aadc622
Compare
Checked commit jrafanie@aadc622 with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint |
end | ||
|
||
def wait_for_ems_workers_removal | ||
return if Rails.env.test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how else to test this if this method will loop and wait for the worker rows to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could stub the #kill_async
and have it execute it directly instead of putting test specifics in the main code...might play with this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a possibility. The problems I had in the 2 ext_management_system_spec.rb
examples changed in this PR:
- I'd need to stub
#kill_async
to delete the rows - I'd need to either stub using
any_instance
sincedestroy
queues for the Ems and deliver would get a different Ems unless youany_instance
or you'd need to stub deliver to get your specific Ems instance with the stubbed method.
return if Rails.env.test? | ||
|
||
quiesce_loop_timeout = ::Settings.server.worker_monitor.quiesce_loop_timeout || 5.minutes | ||
worker_monitor_poll = (::Settings.server.worker_monitor.poll || 1.second).to_i_with_method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I grabbed these values from the worker quiesce code.
@agrare @carbonin I think this is ready to go. I tested in an upstream appliance and in pods by monkey patching the queueing code in console, ensuring calling [1] event catchers for amazon are not killed immediately (seems like aws might be rescuing Exception or trapping signals) [2] refresh worker for amazon with multiple ems queue names don't get killed immediately because we're only looking for a singular ems queue name. These workers exit after the managers and ems is destroyed, so it's possible that a refresh puts the ems/managers back. |
@agrare Look good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
end | ||
|
||
def wait_for_ems_workers_removal | ||
return if Rails.env.test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could stub the #kill_async
and have it execute it directly instead of putting test specifics in the main code...might play with this later
…ver_before_ems_destroy Ensure ems workers are killed by their server/orchestrator pod (cherry picked from commit 9600648)
Jansa backport details:
|
Fixes #20288
Previously, direct calls to ems#destroy would assume you're calling it local
to each of the ems's workers and would fail to find the pid if not local. Additionally,
in pods, only the orchestrator pod of the worker has permissions to kill the pod
so this would fail with permission errors such as:
deployments.apps "1-xyz-event-catcher-1" is forbidden: User "abc" cannot patch resource "deployments" in API group "apps" in the namespace "123" for PATCH https:...]
The ems.destroy_queue method calls _queue_task from the AsyncDeleteMixin, which doesn't specify the server_guid or queue_name so a UI request to delete the ems COULD be initiated in a UI appliance and picked up by the same appliance, which isn't where the ems's worker processes are running, and would ultimately call kill on each workers that don't exist locally.
Now, we queue the worker's kill method for the queue_name 'miq_server' so it's handled by the server "process" in appliances or orchestrator in pods and server_guid of the worker's server as an ems's workers can be on different servers.