Give active queue worker time to complete message #15529

jrafanie · 2017-07-07T21:42:44Z

Let queue workers process an active message
- In e5f4bd3, we added a 10 minute timeout that would give workers a little time to complete their work after they exceed their memory threshold before we'd kill them. This
  causes workers to be killed prematurely before completing the work item.
- What we really want is for the work item to complete but kill the worker
  if the worker has exceeded memory/time thresholds and the work item hasn't
  completed in a reasonable time. This reasonable time is the msg_timeout
  associated with the queue message.
The stop is pending, it's not actively stopping (clarify rails evm:status output)
- The worker is probably working on a queue message that takes a long time
  so we let it try to complete this work item and have a follow up work
  item where we ask the worker to exit cleanly on it's own. "Stop pending"
  better describes this graceful worker exit workflow.

** Using session_store: ActionDispatch::Session::MemCacheStore
Checking EVM status...
 Zone    | Server | Status  |            ID |   PID |  SPID | URL                     | Started On           | Last Heartbeat       | Master? | Active Roles
---------+--------+---------+---------------+-------+-------+-------------------------+----------------------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------
 default | EVM    | started | 1000000000001 | 38192 | 38206 | druby://127.0.0.1:50844 | 2017-07-07T21:29:20Z | 2017-07-07T21:32:34Z | true    | automate:database_operations:database_owner:ems_inventory:ems_operations:event:reporting:scheduler:smartstate:user_interface:web_services:websocket

 Worker Type      | Status       |            ID |   PID | SPID  |     Server id | Queue Name / URL    | Started On           | Last Heartbeat       | MB Usage
------------------+--------------+---------------+-------+-------+---------------+---------------------+----------------------+----------------------+----------
 MiqGenericWorker | stop pending | 1000000000207 | 38374 | 38380 | 1000000000001 | generic             | 2017-07-07T21:32:19Z | 2017-07-07T21:32:33Z |      245
 MiqUiWorker      | started      | 1000000000206 | 38234 |       | 1000000000001 | http://0.0.0.0:3000 | 2017-07-07T21:29:21Z | 2017-07-07T21:32:34Z |      533

https://bugzilla.redhat.com/show_bug.cgi?id=1481800

jrafanie · 2017-07-07T21:45:12Z

@gtanzillo we'll need a BZ for this... What do you think? I hate the evm:status change but if stopping is confusing to users, I'm not sure how else to clarify this without actually changing the stopping to some other value.... it's been stopping for so many years

miq-bot · 2017-07-20T18:36:44Z

This pull request is not mergeable. Please rebase and repush.

https://bugzilla.redhat.com/show_bug.cgi?id=1481800 In e5f4bd3, we added a 10 minute timeout that would give workers a little time to complete their work after they exceed their memory threshold before we'd kill them. This causes workers to be killed prematurely before completing the work item. What we really want is for the work item to complete but kill the worker if the worker has exceeded memory/time thresholds and the work item hasn't completed in a reasonable time. This reasonable time is the msg_timeout associated with the queue message.

https://bugzilla.redhat.com/show_bug.cgi?id=1481800 The worker is probably working on a queue message that takes a long time so we let it try to complete this work item and have a follow up work item where we ask the worker to exit cleanly on it's own. "Stop pending" better describes this graceful worker exit workflow. ``` ** Using session_store: ActionDispatch::Session::MemCacheStore Checking EVM status... Zone | Server | Status | ID | PID | SPID | URL | Started On | Last Heartbeat | Master? | Active Roles ---------+--------+---------+---------------+-------+-------+-------------------------+----------------------+----------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------- default | EVM | started | 1000000000001 | 38192 | 38206 | druby://127.0.0.1:50844 | 2017-07-07T21:29:20Z | 2017-07-07T21:32:34Z | true | automate:database_operations:database_owner:ems_inventory:ems_operations:event:reporting:scheduler:smartstate:user_interface:web_services:websocket Worker Type | Status | ID | PID | SPID | Server id | Queue Name / URL | Started On | Last Heartbeat | MB Usage ------------------+--------------+---------------+-------+-------+---------------+---------------------+----------------------+----------------------+---------- MiqGenericWorker | stop pending | 1000000000207 | 38374 | 38380 | 1000000000001 | generic | 2017-07-07T21:32:19Z | 2017-07-07T21:32:33Z | 245 MiqUiWorker | started | 1000000000206 | 38234 | | 1000000000001 | http://0.0.0.0:3000 | 2017-07-07T21:29:21Z | 2017-07-07T21:32:34Z | 533 ```

miq-bot · 2017-08-15T18:42:48Z

Checked commits jrafanie/manageiq@31c07a1~...8388fdf with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
3 files checked, 0 offenses detected
Everything looks fine. ⭐

jrafanie · 2017-08-16T18:01:16Z

@gtanzillo this is ready to go and for backport to fine and darga... the original was backported to both, see: https://bugzilla.redhat.com/show_bug.cgi?id=1395736#c34

gtanzillo

Looks good! 👍

…_to_complete_message Give active queue worker time to complete message (cherry picked from commit 09d2aae) https://bugzilla.redhat.com/show_bug.cgi?id=1482670

simaishi · 2017-08-17T20:49:22Z

Fine backport details:

$ git log -1
commit edce922e5e46a853f34e9e7aed5538df6adaa19d
Author: Gregg Tanzillo <[email protected]>
Date:   Thu Aug 17 09:43:02 2017 -0400

    Merge pull request #15529 from jrafanie/give_active_queue_worker_time_to_complete_message
    
    Give active queue worker time to complete message
    (cherry picked from commit 09d2aaec085ab1d58512a36ca1c68f5cc1e3da7c)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1482670

…_to_complete_message Give active queue worker time to complete message (cherry picked from commit 09d2aae) https://bugzilla.redhat.com/show_bug.cgi?id=1482672

simaishi · 2017-08-17T21:04:10Z

Euwe backport details:

$ git log -1
commit e893a4e22cfbdcd18bcd2c397954cbc027310b21
Author: Gregg Tanzillo <[email protected]>
Date:   Thu Aug 17 09:43:02 2017 -0400

    Merge pull request #15529 from jrafanie/give_active_queue_worker_time_to_complete_message
    
    Give active queue worker time to complete message
    (cherry picked from commit 09d2aaec085ab1d58512a36ca1c68f5cc1e3da7c)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1482672

…ker_time_to_complete_message Give active queue worker time to complete message (cherry picked from commit 09d2aae) https://bugzilla.redhat.com/show_bug.cgi?id=1482670

jrafanie requested a review from gtanzillo July 7, 2017 21:42

miq-bot added the wip label Jul 7, 2017

miq-bot added the unmergeable label Jul 20, 2017

chessbyte assigned gtanzillo Jul 24, 2017

jrafanie force-pushed the give_active_queue_worker_time_to_complete_message branch from 2b057ba to 71dc00e Compare August 11, 2017 18:28

miq-bot removed the unmergeable label Aug 11, 2017

jrafanie added 2 commits August 15, 2017 14:31

jrafanie force-pushed the give_active_queue_worker_time_to_complete_message branch from 71dc00e to 8388fdf Compare August 15, 2017 18:32

jrafanie closed this Aug 15, 2017

jrafanie reopened this Aug 15, 2017

jrafanie changed the title ~~[WIP] Give active queue worker time to complete message~~ Give active queue worker time to complete message Aug 15, 2017

miq-bot removed the wip label Aug 15, 2017

jrafanie closed this Aug 15, 2017

jrafanie reopened this Aug 15, 2017

jrafanie added darga/yes bug core/workers labels Aug 16, 2017

gtanzillo approved these changes Aug 17, 2017

View reviewed changes

gtanzillo added this to the Sprint 67 Ending Aug 21, 2017 milestone Aug 17, 2017

gtanzillo merged commit 09d2aae into ManageIQ:master Aug 17, 2017

simaishi added the fine/yes label Aug 17, 2017

simaishi added the fine/backported label Aug 17, 2017

simaishi removed the fine/yes label Aug 17, 2017

simaishi added euwe/backported and removed euwe/yes labels Aug 17, 2017

jrafanie deleted the give_active_queue_worker_time_to_complete_message branch September 20, 2017 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give active queue worker time to complete message #15529

Give active queue worker time to complete message #15529

jrafanie commented Jul 7, 2017 •

edited

Loading

jrafanie commented Jul 7, 2017

miq-bot commented Jul 20, 2017

miq-bot commented Aug 15, 2017

jrafanie commented Aug 16, 2017

gtanzillo left a comment

simaishi commented Aug 17, 2017

simaishi commented Aug 17, 2017

Give active queue worker time to complete message #15529

Give active queue worker time to complete message #15529

Conversation

jrafanie commented Jul 7, 2017 • edited Loading

jrafanie commented Jul 7, 2017

miq-bot commented Jul 20, 2017

miq-bot commented Aug 15, 2017

jrafanie commented Aug 16, 2017

gtanzillo left a comment

Choose a reason for hiding this comment

simaishi commented Aug 17, 2017

simaishi commented Aug 17, 2017

jrafanie commented Jul 7, 2017 •

edited

Loading