-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give active queue worker time to complete message #15529
Give active queue worker time to complete message #15529
Conversation
@gtanzillo we'll need a BZ for this... What do you think? I hate the evm:status change but if stopping is confusing to users, I'm not sure how else to clarify this without actually changing the stopping to some other value.... it's been stopping for so many years |
This pull request is not mergeable. Please rebase and repush. |
2b057ba
to
71dc00e
Compare
https://bugzilla.redhat.com/show_bug.cgi?id=1481800 In e5f4bd3, we added a 10 minute timeout that would give workers a little time to complete their work after they exceed their memory threshold before we'd kill them. This causes workers to be killed prematurely before completing the work item. What we really want is for the work item to complete but kill the worker if the worker has exceeded memory/time thresholds and the work item hasn't completed in a reasonable time. This reasonable time is the msg_timeout associated with the queue message.
https://bugzilla.redhat.com/show_bug.cgi?id=1481800 The worker is probably working on a queue message that takes a long time so we let it try to complete this work item and have a follow up work item where we ask the worker to exit cleanly on it's own. "Stop pending" better describes this graceful worker exit workflow. ``` ** Using session_store: ActionDispatch::Session::MemCacheStore Checking EVM status... Zone | Server | Status | ID | PID | SPID | URL | Started On | Last Heartbeat | Master? | Active Roles ---------+--------+---------+---------------+-------+-------+-------------------------+----------------------+----------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------- default | EVM | started | 1000000000001 | 38192 | 38206 | druby://127.0.0.1:50844 | 2017-07-07T21:29:20Z | 2017-07-07T21:32:34Z | true | automate:database_operations:database_owner:ems_inventory:ems_operations:event:reporting:scheduler:smartstate:user_interface:web_services:websocket Worker Type | Status | ID | PID | SPID | Server id | Queue Name / URL | Started On | Last Heartbeat | MB Usage ------------------+--------------+---------------+-------+-------+---------------+---------------------+----------------------+----------------------+---------- MiqGenericWorker | stop pending | 1000000000207 | 38374 | 38380 | 1000000000001 | generic | 2017-07-07T21:32:19Z | 2017-07-07T21:32:33Z | 245 MiqUiWorker | started | 1000000000206 | 38234 | | 1000000000001 | http://0.0.0.0:3000 | 2017-07-07T21:29:21Z | 2017-07-07T21:32:34Z | 533 ```
71dc00e
to
8388fdf
Compare
Checked commits jrafanie/manageiq@31c07a1~...8388fdf with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0 |
@gtanzillo this is ready to go and for backport to fine and darga... the original was backported to both, see: https://bugzilla.redhat.com/show_bug.cgi?id=1395736#c34 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍
…_to_complete_message Give active queue worker time to complete message (cherry picked from commit 09d2aae) https://bugzilla.redhat.com/show_bug.cgi?id=1482670
Fine backport details:
|
…_to_complete_message Give active queue worker time to complete message (cherry picked from commit 09d2aae) https://bugzilla.redhat.com/show_bug.cgi?id=1482672
Euwe backport details:
|
…ker_time_to_complete_message Give active queue worker time to complete message (cherry picked from commit 09d2aae) https://bugzilla.redhat.com/show_bug.cgi?id=1482670
causes workers to be killed prematurely before completing the work item.
if the worker has exceeded memory/time thresholds and the work item hasn't
completed in a reasonable time. This reasonable time is the msg_timeout
associated with the queue message.
so we let it try to complete this work item and have a follow up work
item where we ask the worker to exit cleanly on it's own. "Stop pending"
better describes this graceful worker exit workflow.
https://bugzilla.redhat.com/show_bug.cgi?id=1481800