Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure ems workers are killed by their server/orchestrator pod #20290

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Jun 18, 2020

Fixes #20288

Previously, direct calls to ems#destroy would assume you're calling it local
to each of the ems's workers and would fail to find the pid if not local. Additionally,
in pods, only the orchestrator pod of the worker has permissions to kill the pod
so this would fail with permission errors such as:

deployments.apps "1-xyz-event-catcher-1" is forbidden: User "abc" cannot patch resource "deployments" in API group "apps" in the namespace "123" for PATCH https:...]

The ems.destroy_queue method calls _queue_task from the AsyncDeleteMixin, which doesn't specify the server_guid or queue_name so a UI request to delete the ems COULD be initiated in a UI appliance and picked up by the same appliance, which isn't where the ems's worker processes are running, and would ultimately call kill on each workers that don't exist locally.

Now, we queue the worker's kill method for the queue_name 'miq_server' so it's handled by the server "process" in appliances or orchestrator in pods and server_guid of the worker's server as an ems's workers can be on different servers.

@jrafanie
Copy link
Member Author

@carbonin I'll need help testing this in pods. 🤣
It works in appliances.

@jrafanie jrafanie force-pushed the kill_ems_workers_on_their_server_before_ems_destroy branch from 3fb43b2 to 9dde30e Compare June 18, 2020 20:45
Fixes ManageIQ#20288

Previously, direct calls to ems#destroy would assume you're calling it local
to each of the ems's workers and would fail to find the pid if not local.  Additionally,
in pods, only the orchestrator pod of the worker has permissions to kill the pod
so this would fail with permission errors such as:

deployments.apps "1-xyz-event-catcher-1" is forbidden: User "abc" cannot patch resource "deployments"
in API group "apps" in the namespace "123" for PATCH https:...]

The ems.destroy_queue method calls _queue_task from the AsyncDeleteMixin, which doesn't specify the server_guid or queue_name
so a UI request to delete the ems COULD be initiated in a UI appliance and picked up by the same appliance,
which isn't where the ems's worker processes are running, and would ultimately call kill on each workers that don't exist locally.

Now, we queue the worker's kill method for the queue_name 'miq_server' so it's handled by the server "process" in appliances
or orchestrator in pods and server_guid of the worker's server as an ems's workers can be on different servers.
@jrafanie jrafanie force-pushed the kill_ems_workers_on_their_server_before_ems_destroy branch from 9dde30e to aadc622 Compare June 19, 2020 20:38
@miq-bot
Copy link
Member

miq-bot commented Jun 19, 2020

Checked commit jrafanie@aadc622 with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
4 files checked, 0 offenses detected
Everything looks fine. 🏆

end

def wait_for_ems_workers_removal
return if Rails.env.test?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how else to test this if this method will loop and wait for the worker rows to be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could stub the #kill_async and have it execute it directly instead of putting test specifics in the main code...might play with this later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a possibility. The problems I had in the 2 ext_management_system_spec.rb examples changed in this PR:

  • I'd need to stub #kill_async to delete the rows
  • I'd need to either stub using any_instance since destroy queues for the Ems and deliver would get a different Ems unless you any_instance or you'd need to stub deliver to get your specific Ems instance with the stubbed method.

return if Rails.env.test?

quiesce_loop_timeout = ::Settings.server.worker_monitor.quiesce_loop_timeout || 5.minutes
worker_monitor_poll = (::Settings.server.worker_monitor.poll || 1.second).to_i_with_method
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grabbed these values from the worker quiesce code.

@jrafanie
Copy link
Member Author

@agrare @carbonin I think this is ready to go. I tested in an upstream appliance and in pods by monkey patching the queueing code in console, ensuring calling Ems#destroy queues a message that was targeted to and picked up successfully by the orchestrator and kills the workers with the problems mentioned above[1][2]

[1] event catchers for amazon are not killed immediately (seems like aws might be rescuing Exception or trapping signals) [2] refresh worker for amazon with multiple ems queue names don't get killed immediately because we're only looking for a singular ems queue name. These workers exit after the managers and ems is destroyed, so it's possible that a refresh puts the ems/managers back.

@carbonin
Copy link
Member

@agrare Look good to you?

Copy link
Member

@agrare agrare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

end

def wait_for_ems_workers_removal
return if Rails.env.test?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could stub the #kill_async and have it execute it directly instead of putting test specifics in the main code...might play with this later

@carbonin carbonin self-assigned this Jun 24, 2020
@carbonin carbonin merged commit 9600648 into ManageIQ:master Jun 24, 2020
@jrafanie jrafanie deleted the kill_ems_workers_on_their_server_before_ems_destroy branch June 24, 2020 14:41
simaishi pushed a commit that referenced this pull request Jun 25, 2020
…ver_before_ems_destroy

Ensure ems workers are killed by their server/orchestrator pod

(cherry picked from commit 9600648)
@simaishi
Copy link
Contributor

Jansa backport details:

$ git log -1
commit 167fc6f6d1d4e0a76c184d023d13f036a19dbfe1
Author: Nick Carboni <[email protected]>
Date:   Wed Jun 24 10:41:38 2020 -0400

    Merge pull request #20290 from jrafanie/kill_ems_workers_on_their_server_before_ems_destroy

    Ensure ems workers are killed by their server/orchestrator pod

    (cherry picked from commit 9600648fdbd0803809742fbf33f25c69b2c99f06)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExtManagementSystem#destroy will fail to kill ems's workers if called from a remote server
6 participants