Send "keep alive" (empty) messages to avoid idle timeouts when streaming #1187

AlanCoding · 2023-02-01T15:00:30Z

The problem we want to address is that we have ansible-runner worker running, talking to an ansible-runner process agent. Due to the infrastructure set up between them (usually K8S), people can have issues with timeout triggers dropping the connection and doing other bad things.

This proposes that ansible-runner, itself, sends a do-nothing messages from worker-->process at a periodic interval, configured by that user. The default is probably to not send these messages.

We may also provide a do-nothing callback in the process code.

The text was updated successfully, but these errors were encountered:

kurokobo · 2023-02-01T15:47:58Z

AWX is one use case where this issue can cause unintended job failures.
The issue seems to be particularly apparent in AKS.

Related to; Jobs with long running tasks and Inventory sync terminating after 5 minutes awx#12530
Context: Jobs with long running tasks and Inventory sync terminating after 5 minutes awx#12530 (comment)
Workaround: Jobs with long running tasks and Inventory sync terminating after 5 minutes awx#12530 (comment)

AlanCoding · 2023-02-01T16:13:02Z

^ wow, your comment linked there is very close to exactly what this would be.

This issue could be summarized as implementing a supported means of doing what you did as a part of the ansible-runner code.

I think the trickiest thing is that we need to do this as a part of the while child.isalive(): main loop. So I'm leaning towards having a general callback method for this purpose. It could also avoid having to run a separate thread or process in other cases, where someone is using ansible-runner and has to do unrelated periodic tasks.

nitzmahone · 2023-02-08T21:31:57Z

I've played with about 5 different solutions to this- my favorite so far is actually hiding this behavior entirely in the Worker class, since IIUC we're not trying to solve this for a naked run, (only worker), correct?

The pexpect isalive loop is definitely attractive for not requiring a new source of concurrency to implement this, but it has some issues:

while there's a rough correlation between stdout activity from core and event stdout activity from Worker, that's not always guaranteed. If what we really care about is idle stdout from Worker, we should be directly monitoring that (which isn't reliably possible from that location).
the expect timeout would need to be changed to use the minimum of the configured expect timeout or some fraction of the configured keepalive timeout, and would likely introduce some "wobble" into the keepalive timing (though maybe we don't care so long as we can guarantee a keepalive will be issued at least as frequently as configured)

Just injecting surrogate events from a worker thread that's monitoring writes to the output we actually care about (logically or directly) is probably simpler, and also avoids adding further complexity to the pexpect wakeup code (which is already pretty gnarly IMO). The thread is cheap, and the logic can be mostly/completely isolated in Worker.

One thing we do need to decide is if this should interact with idle_timeout- my initial thought is "no", since that's presumably about a playbook that's been sitting too long (dunno what the actual use case was for that setting). Hiding the keepalive stuff completely in Worker will also keep those concerns completely separate, assuming that's what we want.

AlanCoding · 2023-02-09T03:30:31Z

The default pexpect_timeout is 5 seconds.

ansible-runner/ansible_runner/config/_base.py

Line 183 in c50532a

self.pexpect_timeout = self.settings.get('pexpect_timeout', 5)

As a matter of the practical ask here, what people are really struggling with is K8S rules which will time out log connections after minutes or hours. In AWX there isn't even a mechanism to allow users to change the pexpect_timeout value, and very few users would have any idea of what it is. Yes your 2nd point is technically correct but I just don't think anyone cares, and to avoid incorrect behavior it would be easy to error if the requested idle message period was smaller than pexpect_timeout. You could even have it specified as a multiple of the pexpect_timeout in order to have the time deltas almost accurate.

I'm not fully sure what your point 1 is saying. But I think it suggests that the expected behavior isn't clear. The documented functionality could be:

Send a message every X seconds, regardless of whatever else is going on
Send a message after ever X seconds after the last message from worker

Either one would solve the problem, and the latter option would do so with fewer keep-alive messages in total. But again, I think we've got a grasp on the type of problem we want to solve so I'm not very worried about keep-alive message spam, should we do the former.

What I am worried about is our ability to debug anything that goes wrong in worker. Sometimes we have a hard enough time getting output from it at all.

qiwangrh · 2023-02-09T16:34:17Z

The related Jira: https://issues.redhat.com/browse/AAP-6482

luckass1 · 2023-03-06T16:29:46Z

I have the latest version of AWX 21.12.0 running in OKE and I have the same problem. What is the procedure to implement the solution proposed in #1187 ??

AlanCoding · 2023-03-06T17:56:44Z

@luckass1 with ansible/awx#13608 the idea is to pick it up automatically in some release in the future.

But you should try it now! If you have your default EE (or a custom EE) updated with this patch, you could change the pod spec in the way done in that AWX PR. Go to instance groups in the UI and you should be able to edit the default group's pod spec there.

github-actions bot added the needs_triage New item that needs to be triaged label Feb 1, 2023

nitzmahone mentioned this issue Feb 9, 2023

Emit periodic keepalive events from Worker #1191

Merged

4 tasks

AlanCoding mentioned this issue Feb 10, 2023

AWX Office Hours Agenda - Feb 14 2023 ansible/awx#13431

Closed

nitzmahone closed this as completed in #1191 Mar 1, 2023

luckass1 mentioned this issue Mar 6, 2023

Sleep 650 job throwing error within 5 minutes of its launch ansible/awx#13161

Open

9 tasks

sivel removed the needs_triage New item that needs to be triaged label May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send "keep alive" (empty) messages to avoid idle timeouts when streaming #1187

Send "keep alive" (empty) messages to avoid idle timeouts when streaming #1187

AlanCoding commented Feb 1, 2023

kurokobo commented Feb 1, 2023

AlanCoding commented Feb 1, 2023

nitzmahone commented Feb 8, 2023

AlanCoding commented Feb 9, 2023

qiwangrh commented Feb 9, 2023

luckass1 commented Mar 6, 2023

AlanCoding commented Mar 6, 2023

Send "keep alive" (empty) messages to avoid idle timeouts when streaming #1187

Send "keep alive" (empty) messages to avoid idle timeouts when streaming #1187

Comments

AlanCoding commented Feb 1, 2023

kurokobo commented Feb 1, 2023

AlanCoding commented Feb 1, 2023

nitzmahone commented Feb 8, 2023

AlanCoding commented Feb 9, 2023

qiwangrh commented Feb 9, 2023

luckass1 commented Mar 6, 2023

AlanCoding commented Mar 6, 2023