Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call ResetFailedUnit when cleaning up failed services #20810

Merged
merged 1 commit into from
Nov 19, 2020

Conversation

agrare
Copy link
Member

@agrare agrare commented Nov 12, 2020

If a systemd service has failed it stays around until you call systemd reset-failed unit-name otherwise it stays around as:

# systemctl status
● vmware_infra_manager_event_catcher@8f2f33a3-d1b1-4603-a8aa-44a7156347d9.service   loaded failed failed    vmware_infra_manager_event_catcher@8f2f33a3-d1b1-4603-a8aa-44a7156347d9.service

# systemctl status vmware_infra_manager_event_catcher@8f2f33a3-d1b1-4603-a8aa-44a7156347d9.service
Warning: The unit file, source configuration file or drop-ins of vmware_infra_manager_event_catcher@8f2f33a3-d1b1-4603-a8aa-44a7156347d9.service changed on disk. Run 'systemctl daemon-reload' to reload units.
● vmware_infra_manager_event_catcher@8f2f33a3-d1b1-4603-a8aa-44a7156347d9.service
   Loaded: loaded (/etc/systemd/system/[email protected]; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/vmware_infra_manager_event_catcher@8f2f33a3-d1b1-4603-a8aa-44a7156347d9.service.d
           └─override.conf
   Active: inactive (dead) since Thu 2020-11-12 11:26:12 EST; 1h 25min ago
  Process: 178314 ExecStart=/bin/bash -lc exec ruby lib/workers/bin/run_single_worker.rb ManageIQ::Providers::Vmware::InfraManager::EventCatcher --heartbeat --guid=8f2f33a3-d1b1-4603-a8aa-44a7156347d9 (code=exi>
 Main PID: 178314 (code=exited, status=1/FAILURE)

@@ -29,6 +29,7 @@ def systemd_manager
def systemd_stop_services(service_names)
service_names.each do |service_name|
systemd_manager.StopUnit(service_name, "replace")
systemd_manager.ResetFailedUnit(service_name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP because I want to confirm that this order is correct, (stop, reset, disable)

Copy link
Member Author

@agrare agrare Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay taking out of WIP, tested this live on an appliance by kill -9'ing a worker and the failed systemd service is properly cleaned up

@miq-bot miq-bot added the wip label Nov 12, 2020
@jrafanie
Copy link
Member

Does it stay around in the miq_workers table as a running worker while it's failed in systemd?

@agrare
Copy link
Member Author

agrare commented Nov 12, 2020

No it is not in the workers table (checked that on the appliance that I pulled that systemd status from)

@jrafanie
Copy link
Member

I know this is still wip, but the travis failure looks relevant for this change:

  1) MiqServer::WorkerManagement::Monitor::Systemd#cleanup_failed_systemd_services with failed services calls DisableUnitFiles with the service name
     Failure/Error: systemd_manager.ResetFailedUnit(service_name)
       #<Double "DBus::Systemd::Manager"> received unexpected message :ResetFailedUnit with ("[email protected]")
     # ./app/models/miq_server/worker_management/monitor/systemd.rb:32:in `block in systemd_stop_services'
     # ./app/models/miq_server/worker_management/monitor/systemd.rb:30:in `each'
     # ./app/models/miq_server/worker_management/monitor/systemd.rb:30:in `systemd_stop_services'
     # ./app/models/miq_server/worker_management/monitor/systemd.rb:9:in `cleanup_failed_systemd_services'
     # ./spec/models/miq_server/worker_management/monitor/systemd_spec.rb:28:in `block (4 levels) in <top (required)>'

If a systemd service has failed it stays around until you call
`systemd reset-failed unit-name`
@agrare agrare force-pushed the reset_failed_systemd_unit_files branch from 97f0f8b to 1678e35 Compare November 18, 2020 14:24
@agrare agrare requested a review from gtanzillo as a code owner November 18, 2020 14:24
@miq-bot
Copy link
Member

miq-bot commented Nov 18, 2020

Checked commit agrare@1678e35 with ruby 2.6.3, rubocop 0.82.0, haml-lint 0.35.0, and yamllint
2 files checked, 0 offenses detected
Everything looks fine. 🍪

@agrare agrare changed the title [WIP] Call ResetFailedUnit when cleaning up failed services Call ResetFailedUnit when cleaning up failed services Nov 18, 2020
@miq-bot miq-bot removed the wip label Nov 18, 2020
@agrare
Copy link
Member Author

agrare commented Nov 18, 2020

@jrafanie this is ready to go, please take a look

@jrafanie jrafanie merged commit 0a56551 into ManageIQ:master Nov 19, 2020
@agrare agrare deleted the reset_failed_systemd_unit_files branch November 19, 2020 15:50
simaishi pushed a commit that referenced this pull request Nov 19, 2020
Call ResetFailedUnit when cleaning up failed services

(cherry picked from commit 0a56551)
@simaishi
Copy link
Contributor

Kasparov backport details:

$ git log -1
commit 2be0d2302b7873fbb0e6f67a6d5a708a3d73f073
Author: Joe Rafaniello <[email protected]>
Date:   Thu Nov 19 10:23:12 2020 -0500

    Merge pull request #20810 from agrare/reset_failed_systemd_unit_files

    Call ResetFailedUnit when cleaning up failed services

    (cherry picked from commit 0a5655118d8b2817400406b44d697292c47b0893)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants