Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give active queue worker time to complete message #15529

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Jul 7, 2017

  • Let queue workers process an active message 
    • In e5f4bd3, we added a 10 minute timeout that would give workers a little time to complete their work after they exceed their memory threshold before we'd kill them. This
      causes workers to be killed prematurely before completing the work item.
    • What we really want is for the work item to complete but kill the worker
      if the worker has exceeded memory/time thresholds and the work item hasn't
      completed in a reasonable time. This reasonable time is the msg_timeout
      associated with the queue message.
  • The stop is pending, it's not actively stopping (clarify rails evm:status output)
    • The worker is probably working on a queue message that takes a long time
      so we let it try to complete this work item and have a follow up work
      item where we ask the worker to exit cleanly on it's own. "Stop pending"
      better describes this graceful worker exit workflow.
** Using session_store: ActionDispatch::Session::MemCacheStore
Checking EVM status...
 Zone    | Server | Status  |            ID |   PID |  SPID | URL                     | Started On           | Last Heartbeat       | Master? | Active Roles
---------+--------+---------+---------------+-------+-------+-------------------------+----------------------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------
 default | EVM    | started | 1000000000001 | 38192 | 38206 | druby://127.0.0.1:50844 | 2017-07-07T21:29:20Z | 2017-07-07T21:32:34Z | true    | automate:database_operations:database_owner:ems_inventory:ems_operations:event:reporting:scheduler:smartstate:user_interface:web_services:websocket

 Worker Type      | Status       |            ID |   PID | SPID  |     Server id | Queue Name / URL    | Started On           | Last Heartbeat       | MB Usage
------------------+--------------+---------------+-------+-------+---------------+---------------------+----------------------+----------------------+----------
 MiqGenericWorker | stop pending | 1000000000207 | 38374 | 38380 | 1000000000001 | generic             | 2017-07-07T21:32:19Z | 2017-07-07T21:32:33Z |      245
 MiqUiWorker      | started      | 1000000000206 | 38234 |       | 1000000000001 | http://0.0.0.0:3000 | 2017-07-07T21:29:21Z | 2017-07-07T21:32:34Z |      533

https://bugzilla.redhat.com/show_bug.cgi?id=1481800

@jrafanie jrafanie requested a review from gtanzillo July 7, 2017 21:42
@jrafanie
Copy link
Member Author

jrafanie commented Jul 7, 2017

@gtanzillo we'll need a BZ for this... What do you think? I hate the evm:status change but if stopping is confusing to users, I'm not sure how else to clarify this without actually changing the stopping to some other value.... it's been stopping for so many years

@miq-bot miq-bot added the wip label Jul 7, 2017
@miq-bot
Copy link
Member

miq-bot commented Jul 20, 2017

This pull request is not mergeable. Please rebase and repush.

@jrafanie jrafanie force-pushed the give_active_queue_worker_time_to_complete_message branch from 2b057ba to 71dc00e Compare August 11, 2017 18:28
https://bugzilla.redhat.com/show_bug.cgi?id=1481800

In e5f4bd3, we added a 10 minute
timeout that would give workers a little time to complete their work
after they exceed their memory threshold before we'd kill them.  This
causes workers to be killed prematurely before completing the work item.

What we really want is for the work item to complete but kill the worker
if the worker has exceeded memory/time thresholds and the work item hasn't
completed in a reasonable time.  This reasonable time is the msg_timeout
associated with the queue message.
https://bugzilla.redhat.com/show_bug.cgi?id=1481800

The worker is probably working on a queue message that takes a long time
so we let it try to complete this work item and have a follow up work
item where we ask the worker to exit cleanly on it's own.  "Stop pending"
better describes this graceful worker exit workflow.

```
** Using session_store: ActionDispatch::Session::MemCacheStore
Checking EVM status...
 Zone    | Server | Status  |            ID |   PID |  SPID | URL                     | Started On           | Last Heartbeat       | Master? | Active Roles
---------+--------+---------+---------------+-------+-------+-------------------------+----------------------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------
 default | EVM    | started | 1000000000001 | 38192 | 38206 | druby://127.0.0.1:50844 | 2017-07-07T21:29:20Z | 2017-07-07T21:32:34Z | true    | automate:database_operations:database_owner:ems_inventory:ems_operations:event:reporting:scheduler:smartstate:user_interface:web_services:websocket

 Worker Type      | Status       |            ID |   PID | SPID  |     Server id | Queue Name / URL    | Started On           | Last Heartbeat       | MB Usage
------------------+--------------+---------------+-------+-------+---------------+---------------------+----------------------+----------------------+----------
 MiqGenericWorker | stop pending | 1000000000207 | 38374 | 38380 | 1000000000001 | generic             | 2017-07-07T21:32:19Z | 2017-07-07T21:32:33Z |      245
 MiqUiWorker      | started      | 1000000000206 | 38234 |       | 1000000000001 | http://0.0.0.0:3000 | 2017-07-07T21:29:21Z | 2017-07-07T21:32:34Z |      533
```
@jrafanie jrafanie force-pushed the give_active_queue_worker_time_to_complete_message branch from 71dc00e to 8388fdf Compare August 15, 2017 18:32
@jrafanie jrafanie closed this Aug 15, 2017
@jrafanie jrafanie reopened this Aug 15, 2017
@jrafanie jrafanie changed the title [WIP] Give active queue worker time to complete message Give active queue worker time to complete message Aug 15, 2017
@miq-bot miq-bot removed the wip label Aug 15, 2017
@miq-bot
Copy link
Member

miq-bot commented Aug 15, 2017

Checked commits jrafanie/manageiq@31c07a1~...8388fdf with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
3 files checked, 0 offenses detected
Everything looks fine. ⭐

@jrafanie
Copy link
Member Author

@gtanzillo this is ready to go and for backport to fine and darga... the original was backported to both, see: https://bugzilla.redhat.com/show_bug.cgi?id=1395736#c34

Copy link
Member

@gtanzillo gtanzillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 👍

@gtanzillo gtanzillo added this to the Sprint 67 Ending Aug 21, 2017 milestone Aug 17, 2017
@gtanzillo gtanzillo merged commit 09d2aae into ManageIQ:master Aug 17, 2017
simaishi pushed a commit that referenced this pull request Aug 17, 2017
…_to_complete_message

Give active queue worker time to complete message
(cherry picked from commit 09d2aae)

https://bugzilla.redhat.com/show_bug.cgi?id=1482670
@simaishi
Copy link
Contributor

Fine backport details:

$ git log -1
commit edce922e5e46a853f34e9e7aed5538df6adaa19d
Author: Gregg Tanzillo <[email protected]>
Date:   Thu Aug 17 09:43:02 2017 -0400

    Merge pull request #15529 from jrafanie/give_active_queue_worker_time_to_complete_message
    
    Give active queue worker time to complete message
    (cherry picked from commit 09d2aaec085ab1d58512a36ca1c68f5cc1e3da7c)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1482670

@simaishi simaishi removed the fine/yes label Aug 17, 2017
simaishi pushed a commit that referenced this pull request Aug 17, 2017
…_to_complete_message

Give active queue worker time to complete message
(cherry picked from commit 09d2aae)

https://bugzilla.redhat.com/show_bug.cgi?id=1482672
@simaishi
Copy link
Contributor

Euwe backport details:

$ git log -1
commit e893a4e22cfbdcd18bcd2c397954cbc027310b21
Author: Gregg Tanzillo <[email protected]>
Date:   Thu Aug 17 09:43:02 2017 -0400

    Merge pull request #15529 from jrafanie/give_active_queue_worker_time_to_complete_message
    
    Give active queue worker time to complete message
    (cherry picked from commit 09d2aaec085ab1d58512a36ca1c68f5cc1e3da7c)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1482672

@jrafanie jrafanie deleted the give_active_queue_worker_time_to_complete_message branch September 20, 2017 20:41
d-m-u pushed a commit to d-m-u/manageiq that referenced this pull request Jun 6, 2018
…ker_time_to_complete_message

Give active queue worker time to complete message
(cherry picked from commit 09d2aae)

https://bugzilla.redhat.com/show_bug.cgi?id=1482670
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants