Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send "keep alive" (empty) messages to avoid idle timeouts when streaming #1187

Closed
AlanCoding opened this issue Feb 1, 2023 · 7 comments · Fixed by #1191
Closed

Send "keep alive" (empty) messages to avoid idle timeouts when streaming #1187

AlanCoding opened this issue Feb 1, 2023 · 7 comments · Fixed by #1191

Comments

@AlanCoding
Copy link
Member

The problem we want to address is that we have ansible-runner worker running, talking to an ansible-runner process agent. Due to the infrastructure set up between them (usually K8S), people can have issues with timeout triggers dropping the connection and doing other bad things.

This proposes that ansible-runner, itself, sends a do-nothing messages from worker-->process at a periodic interval, configured by that user. The default is probably to not send these messages.

We may also provide a do-nothing callback in the process code.

@github-actions github-actions bot added the needs_triage New item that needs to be triaged label Feb 1, 2023
@kurokobo
Copy link

kurokobo commented Feb 1, 2023

AWX is one use case where this issue can cause unintended job failures.
The issue seems to be particularly apparent in AKS.

@AlanCoding
Copy link
Member Author

^ wow, your comment linked there is very close to exactly what this would be.

This issue could be summarized as implementing a supported means of doing what you did as a part of the ansible-runner code.

I think the trickiest thing is that we need to do this as a part of the while child.isalive(): main loop. So I'm leaning towards having a general callback method for this purpose. It could also avoid having to run a separate thread or process in other cases, where someone is using ansible-runner and has to do unrelated periodic tasks.

@nitzmahone
Copy link
Member

I've played with about 5 different solutions to this- my favorite so far is actually hiding this behavior entirely in the Worker class, since IIUC we're not trying to solve this for a naked run, (only worker), correct?

The pexpect isalive loop is definitely attractive for not requiring a new source of concurrency to implement this, but it has some issues:

  1. while there's a rough correlation between stdout activity from core and event stdout activity from Worker, that's not always guaranteed. If what we really care about is idle stdout from Worker, we should be directly monitoring that (which isn't reliably possible from that location).
  2. the expect timeout would need to be changed to use the minimum of the configured expect timeout or some fraction of the configured keepalive timeout, and would likely introduce some "wobble" into the keepalive timing (though maybe we don't care so long as we can guarantee a keepalive will be issued at least as frequently as configured)

Just injecting surrogate events from a worker thread that's monitoring writes to the output we actually care about (logically or directly) is probably simpler, and also avoids adding further complexity to the pexpect wakeup code (which is already pretty gnarly IMO). The thread is cheap, and the logic can be mostly/completely isolated in Worker.

One thing we do need to decide is if this should interact with idle_timeout- my initial thought is "no", since that's presumably about a playbook that's been sitting too long (dunno what the actual use case was for that setting). Hiding the keepalive stuff completely in Worker will also keep those concerns completely separate, assuming that's what we want.

@AlanCoding
Copy link
Member Author

The default pexpect_timeout is 5 seconds.

self.pexpect_timeout = self.settings.get('pexpect_timeout', 5)

As a matter of the practical ask here, what people are really struggling with is K8S rules which will time out log connections after minutes or hours. In AWX there isn't even a mechanism to allow users to change the pexpect_timeout value, and very few users would have any idea of what it is. Yes your 2nd point is technically correct but I just don't think anyone cares, and to avoid incorrect behavior it would be easy to error if the requested idle message period was smaller than pexpect_timeout. You could even have it specified as a multiple of the pexpect_timeout in order to have the time deltas almost accurate.

I'm not fully sure what your point 1 is saying. But I think it suggests that the expected behavior isn't clear. The documented functionality could be:

  • Send a message every X seconds, regardless of whatever else is going on
  • Send a message after ever X seconds after the last message from worker

Either one would solve the problem, and the latter option would do so with fewer keep-alive messages in total. But again, I think we've got a grasp on the type of problem we want to solve so I'm not very worried about keep-alive message spam, should we do the former.

What I am worried about is our ability to debug anything that goes wrong in worker. Sometimes we have a hard enough time getting output from it at all.

@qiwangrh
Copy link

qiwangrh commented Feb 9, 2023

The related Jira: https://issues.redhat.com/browse/AAP-6482

@luckass1
Copy link

luckass1 commented Mar 6, 2023

I have the latest version of AWX 21.12.0 running in OKE and I have the same problem. What is the procedure to implement the solution proposed in #1187 ??

@AlanCoding
Copy link
Member Author

@luckass1 with ansible/awx#13608 the idea is to pick it up automatically in some release in the future.

But you should try it now! If you have your default EE (or a custom EE) updated with this patch, you could change the pod spec in the way done in that AWX PR. Go to instance groups in the UI and you should be able to edit the default group's pod spec there.

@sivel sivel removed the needs_triage New item that needs to be triaged label May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants