Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for timed out active tasks #15231

Merged
merged 6 commits into from
May 31, 2017

Conversation

yrudman
Copy link
Contributor

@yrudman yrudman commented May 25, 2017

Issue:
If reporting worker killed (kill -9 <pid>) when report is still running than task status stays Running forever.

original BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1397600

This PR:

  • adds active_task_timeout and task_timeout_check_frequency to settings.yml
  • adds check for timed out tasks to MiqScheduleWorker

@miq-bot add-label bug, core

\cc @jrafanie @gtanzillo

@yrudman yrudman force-pushed the introduce-report-execution-timeout branch 4 times, most recently from e6e182e to 3970a3b Compare May 25, 2017 22:48
def schedule_check_for_task_timeout
every = worker_settings[:task_timeout_check_frequency]
scheduler = scheduler_for(:scheduler)
scheduler.schedule_every(every, :first_at => Time.current + 1.minute) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yrudman have you tested that rufus supports ActiveSupport::TimeWithZone?

irb(main):003:0> Time.current.class
=> ActiveSupport::TimeWithZone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've done manual testing: leaving active records in DB and restarting server - works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>> require 'rufus-scheduler'
>> s = Rufus::Scheduler.new
>> n = Time.current
=> Fri, 26 May 2017 15:50:17 UTC +00:00
>> p [ :scheduled_at, n, n.to_f ]
[:scheduled_at, Fri, 26 May 2017 15:50:17 UTC +00:00, 1495813817.563702]
=> [:scheduled_at, Fri, 26 May 2017 15:50:17 UTC +00:00, 1495813817.563702]

scope :timed_out, -> { where("updated_on < ?", Time.now.utc - ::Settings.task.active_task_timeout.to_i_with_method) }

def self.update_status_for_timed_out_active_tasks
MiqTask.active.timed_out.no_associated_job.find_each do |task|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about tasks that have a job but are "timed out"? What cleans them up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Job should have its own implementation for checking stalled job due to worker killing. There is 'timing out' implementation for job but i think it would not accommodate killing worker.

@@ -33,6 +33,17 @@ class MiqTask < ApplicationRecord
t.grouping(Arel::Nodes::Case.new(t[:state]).when(STATE_FINISHED).then(t[:status]).else(t[:state]))
end)

scope :active, -> { where(:state => STATE_ACTIVE) }
scope :no_associated_job, -> { where.not("id IN (SELECT miq_task_id from jobs)") }
scope :timed_out, -> { where("updated_on < ?", Time.now.utc - ::Settings.task.active_task_timeout.to_i_with_method) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, can you align the ->

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


def self.update_status_for_timed_out_active_tasks
MiqTask.active.timed_out.no_associated_job.find_each do |task|
task.update_status(STATE_FINISHED, STATUS_ERROR,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason to not use STATUS_TIMEOUT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout not present as possible state and task with status Timeout will not be shown on Task management screens:

screen shot 2017-05-26 at 1 40 56 pm

I think we should add `Timeout` to a list of possible state filter.

@gtanzillo what do you think, should we first allow timed out tasks to be visible on task management screens and change this PR to set status to STATUS_TIMEOUT ?
or proceed with with this PR setting timed out task Error status ?
or set status to Warn ?

@@ -1299,6 +1301,7 @@
:session_timeout_interval: 30.seconds
:storage_file_collection_interval: 1.days
:storage_file_collection_time_utc: 21600
:task_timeout_check_frequency: 1.hour
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what's are the correct numbers here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me too ❓

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.hour here seems reasonable. We don't want to be checking too often but, we also don't want to have tasks sitting in an active state for an extended period of time when they are already dead.

@@ -49,4 +49,15 @@
)
end
end

describe "#check_for_timed_out_active_tasks" do
it "queue request to updtae status for timed out tasks" do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: update.... Maybe enqueues update_status_for_timed_out_active_tasks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing it !

end

context "task is not active" do
it "does not updates status to 'Error' if task state is 'Finished'" do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here - "updates" should be "update"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

job.miq_task
end

it "does not updates status to 'Error'" do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same typo here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -1129,6 +1129,8 @@
:max_parallel_scans_per_host: 1
:max_qitems_per_scan_request: 0
:watchdog_interval: 1.minute
:task:
:active_task_timeout: 5.hours
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6.hours somehow feels better here. I can't explain why, though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -1299,6 +1301,7 @@
:session_timeout_interval: 30.seconds
:storage_file_collection_interval: 1.days
:storage_file_collection_time_utc: 21600
:task_timeout_check_frequency: 1.hour
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.hour here seems reasonable. We don't want to be checking too often but, we also don't want to have tasks sitting in an active state for an extended period of time when they are already dead.

@yrudman
Copy link
Contributor Author

yrudman commented May 30, 2017

@miq-bot assign @gtanzillo


def self.update_status_for_timed_out_active_tasks
MiqTask.active.timed_out.no_associated_job.find_each do |task|
task.update_status(STATE_FINISHED, STATUS_ERROR,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused by the STATUS_ERROR. We should fix this in a followup so the UI displays timed out tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will look at it

@yrudman yrudman force-pushed the introduce-report-execution-timeout branch from ae224d7 to 8051e5e Compare May 30, 2017 22:52
@miq-bot
Copy link
Member

miq-bot commented May 30, 2017

Checked commits yrudman/manageiq@032d422~...8051e5e with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
6 files checked, 0 offenses detected
Everything looks fine. 🍰

@jrafanie jrafanie merged commit 7fce59f into ManageIQ:master May 31, 2017
@jrafanie jrafanie added this to the Sprint 62 Ending Jun 5, 2017 milestone May 31, 2017
@jrafanie jrafanie deleted the introduce-report-execution-timeout branch May 31, 2017 14:59
@jrafanie
Copy link
Member

@gtanzillo what versions should this go back to?

@yrudman
Copy link
Contributor Author

yrudman commented May 31, 2017

@miq-bot add-label fine/yes, euwe/yes

@simaishi
Copy link
Contributor

simaishi commented Jun 1, 2017

@yrudman Please remove euwe/conflict label when you have an Euwe PR. Thanks!

@simaishi
Copy link
Contributor

simaishi commented Jun 5, 2017

Backported to Euwe via #15277

simaishi pushed a commit that referenced this pull request Jun 9, 2017
@simaishi
Copy link
Contributor

simaishi commented Jun 9, 2017

Fine backport details:

$ git log -1
commit 83c3180ddc1b532d3bcf10f86337fe433e721033
Author: Joe Rafaniello <[email protected]>
Date:   Wed May 31 10:59:27 2017 -0400

    Merge pull request #15231 from yrudman/introduce-report-execution-timeout
    
    Check for timed out active tasks
    (cherry picked from commit 7fce59fb0dd5ac4ecc01d5163a7bf55ffbe495b9)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1460349

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants