Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EAGAIN hazard in isolated runs using rsync + oc rsh #6692

Closed
domq opened this issue Apr 14, 2020 · 9 comments
Closed

EAGAIN hazard in isolated runs using rsync + oc rsh #6692

domq opened this issue Apr 14, 2020 · 9 comments
Labels

Comments

@domq
Copy link

domq commented Apr 14, 2020

ISSUE TYPE
  • Bug Report
SUMMARY

Because rsync and oc rsh disagree on whether socket IO should be blocking or nonblocking by default, and awx/playbooks/check_isolated.yml uses a combination of both to pump the JSON job events out of runner containers running on Kubernetes, there exists a hazard of EAGAIN being treated as a crippling error by oc rsh (killing a goroutine, but leaving the process alive) and causing rsync to deadlock.

ENVIRONMENT
  • AWX version: X.Y.Z
  • AWX install method: openshift, minishift, docker on linux, docker for mac, boot2docker
  • Ansible version: X.Y.Z
  • Operating System:
  • Web Browser:
STEPS TO REPRODUCE
  1. Deploy AWX on Kubernetes
  2. Create a container group to run on Kubernetes and fit it with the appropriate credentials
  3. Create a job template that uses the container group in a way that will create a lot of job events per time unit, i.e. slicing = 1, high inventory count, lots of short tasks in the playbook, verbosity set to 3
  4. Execute the job template
EXPECTED RESULTS

The job should succeed or fail depending solely on the underlying Ansible tasks' outcomes.

ACTUAL RESULTS

rsync deadlocks. That is, two rsync processes are visible both in the awx-task master container and on the runner container, but they make no progress. After exactly 10 minutes in that state, awx/playbooks/check_isolated.yml kills rsync and terminates in failure, AWX continues with no error (because failure is the expected outcome), and soon finds a .~tmp~ directory (left over by the dead rsync) in a place where it shouldn't be, causing a misleading Python backtrace to appear in the AWX UI.

ADDITIONAL INFORMATION AND INVESTIGATION
  • During the 10 minutes it takes for rsync to die, the logs are frozen in the Web UI and JobEvent.objects.count() makes no progress from the vantage point of awx-manage shell_plus
  • Killing rsync from within the ephemeral runner container, results
    in the same ISADirectory backtrace, except it does so immediately (as opposed to
    after 10 minutes).

Running rsync manually by copying and pasting the deadlocked command line, demonstrates that the issue does not lie in Ansible / AWX code. e.g.

find /tmp/awx_*/artifacts/ -type f -name "*.json" -delete
/usr/bin/rsync --delay-updates -F --compress --delete-after --archive --rsh="oc rsh --config=/tmp/awx_792_zk0tq7xm/.kubeconfig9kwpv7nq" --out-format="<<CHANGED>>%i %n%L" awx-job-792:/tmp/awx_792_zk0tq7xm/artifacts/ /tmp/awx_792_zk0tq7xm/artifacts/

(where the find -delete is there to increase rsync workload, and thus the probability for the bug to be triggered).

In case of deadlock, a log line like this one will appear in the middle of the rsync spew:

E0413 17:35:40.920139    6976 v2.go:147] write /dev/stdout: resource temporarily unavailable

Investigation shows that v2.go:147 is a goroutine that pumps stdout around, and is implemented in terms of io.Copy (where 0cbc58b is the precise commit reported by oc version in the ansible/awx_task:10.0 image that I am using)

After that log message is shown, part of oc rsh's byte-pumping apparatus becomes inoperative, and therefore rsync stops making progress.

Second-hand information only below this point:

  • this has possibly happened before while attempting to use rsync on top of another piece of golang pipework (here, docker exec)
  • In this previous instance, telling rsync to use blocking I/O solved the problem.
@domq
Copy link
Author

domq commented Apr 14, 2020

Here is some speculation that will need substantiating:

  • Because io.Copy does not have access to a shareable event loop abstraction that would let it wait and retry on EAGAIN, it likely has no choice but to assume EAGAIN never happens (i.e., blocking I/O)
  • The fact that rsync --blocking-io helped, might indicate that rsync conflates blocking I/O for its “own” use (i.e. on the files), versus the ones that it creates only to pass them to the --rsh= forked process (i.e. on the sockets)

@domq
Copy link
Author

domq commented Apr 14, 2020

In fact it is even debatable whether this is is an rsync bug, an oc bug, or should somehow be worked around in AWX's own pipework.

@domq
Copy link
Author

domq commented Apr 14, 2020

Here is why this issue doesn't show up with “ordinary” rsync over ssh.

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Apr 14, 2020

Wow. First, thanks for spending so much time on this one - it's pretty gross 😂 and very complicated.

Having read all of this, this comment captures my first impression:

In fact it is even debatable whether this is is an rsync bug, an oc bug, or should somehow be worked around in AWX's own pipework.

Based on your description, this really sounds like a bug in oc itself. Have you considered reaching out to the OpenShift folks and seeing if they have any ideas?

Also, any chance you've tried and found better results by adding --blocking-io here?

https://github.com/ansible/awx/blob/devel/awx/playbooks/check_isolated.yml#L41
https://github.com/ansible/awx/blob/devel/awx/playbooks/check_isolated.yml#L54

@shanemcd
Copy link
Member

Yeah, holy shit. Thank you for digging so deep into this. I see you found my synchronize + oc rsh hack. I knew that it was probably a matter of time before someone found a fatal flaw with it. 😄

If @ryanpetrello's suggestion doesn't work, we might want to move away from using run_isolated.yml and check_isolated.yml for container groups. Those (as the name implies) were for our Isolated Node feature (not an upstream concept), but I reused them because it seemed to work. If we're learning that it doesn't, it would be easy enough to add a conditional right before these playbooks are invoked and handle the container groups case differently.

@domq
Copy link
Author

domq commented Apr 16, 2020

Also, any chance you've tried and found better results by adding --blocking-io here?

https://github.com/ansible/awx/blob/devel/awx/playbooks/check_isolated.yml#L41
https://github.com/ansible/awx/blob/devel/awx/playbooks/check_isolated.yml#L54

I've been distracted by other work lately, but I'll give this a try ASAP.

@domq
Copy link
Author

domq commented Apr 16, 2020

Burn-in test is in progress; looks good so far (3 down, 3 to go).

rsync-EAGAIN

@domq
Copy link
Author

domq commented Apr 16, 2020

Also, any chance you've tried and found better results by adding --blocking-io here?

Indeed, --blocking-io helps. Pull request to follow.

@ryanpetrello
Copy link
Contributor

👍 thanks a ton for testing this, @domq - if you want to let it marinate for a day or two, I'll merge your patch early next week after you're feeling confident cc @shanemcd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants