Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Ansible::Runner] wait on artifacts/ to exist #20667

Closed

Conversation

NickLaMuro
Copy link
Member

@NickLaMuro NickLaMuro commented Oct 7, 2020

Before returning a result, wait on the artifacts/ directory to be created before returning a response object.

This will ensure that calling .running? will return false properly when the ansible-runner has ended, and not before it has even had a chance to start.

Thanks for Jason for the find on this bug.

Note: This should only really be a problem in a docker/podified environment, where launching the process and waiting for a response happens within the same background job. This isn't something that can happen on an Appliance, since it will have an inherent delay when monitoring the result via a new background job.

TODO

  • Maybe raise an error if we get through the entire loop...

Links

Steps for Testing/QA

I was using a test script for this in docker to run this.

$ docker run --rm -it manageiq/manageiq:latest-jansa /bin/bash
[root@5f4c439ca069 vmdb]# cd /var/www/vmdb/miq
[root@5f4c439ca069 vmdb]# curl -O https://raw.githubusercontent.com/ansible/test-playbooks/master/sleep.yml
[root@5f4c439ca069 vmdb]# cat script.rb
require 'pathname'

class Rails
  def self.root
    Pathname.new("/var/www/miq/vmdb")
  end
end

class Vmdb
  module Logging
  end
end

require 'awesome_spawn'
require 'ansible/runner'
require 'ansible/content'
require 'ansible/runner/response'
require 'ansible/runner/response_async'
require 'tmpdir'
require 'active_support/all'

response = Ansible::Runner.run_async({}, {}, "/var/www/miq/vmdb/hello_world.yml")
puts response.base_dir
puts response.running?

200.times do
  puts response.running?
end
[root@5f4c439ca069 vmdb]# ruby -I lib script.rb
/tmp/ansible-runner20201007-120-2d8yu1
false
false
true
true
...
true
true
false
false
false
false
  • A good run will return true values first... then false false values.
  • A bad run (where it isn't working correctly) with return some false values, followed by some true values, and then false values.

The idea is that true equals the process is running. If we get false prior to receiving some true values, we can reliably expect that .running? is a good indicator that the ansible-runner process was ever running properly.

The example run above shows it not working properly, and a correct run would be when the first to false output values are omitted.

@NickLaMuro
Copy link
Member Author

The only advantage to this PR I see is that it will be easier to backport, since there isn't a gem dependency. However, I think that #20666 probably makes sense for the long term.

@Fryguy
Copy link
Member

Fryguy commented Oct 7, 2020

Do we want to wait on the artifacts dir or the pid file? I think the latter is more accurate, though it's hard to tell.

@NickLaMuro
Copy link
Member Author

@Fryguy maybe more accurate, but seems like also more ephemeral, so it could exist then not within the time period of sleep(0.1), so we might miss it.

Figured it would be better to catch something that would remain after the run instead of something that could go away based off an error.

Before returning a result, wait on the `artifacts/` directory to be
created before returning a response object.

This will ensure that calling `.running?` will return `false` properly
when the `ansible-runner` has ended, and not before it has even had a
chance to start.

Thanks for Jason for the find on this bug.
@NickLaMuro NickLaMuro force-pushed the ansible-runner-wait-on branch from 1cb36ed to 764d71d Compare October 7, 2020 18:30
@miq-bot
Copy link
Member

miq-bot commented Oct 7, 2020

Checked commit NickLaMuro@764d71d with ruby 2.6.3, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
1 file checked, 0 offenses detected
Everything looks fine. 👍

@NickLaMuro NickLaMuro closed this Oct 7, 2020
@Fryguy
Copy link
Member

Fryguy commented Oct 7, 2020

We may want to use this one for backport...let's discuss in #20666


def wait_on(dir)
100.times do
Dir.exist?(dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you wanted break if Dir.exist? ?

Copy link
Member Author

@NickLaMuro NickLaMuro Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct... how did I miss that...

Anyway, thanks, will change in the new PR.

Edit: This is what I get for having to re-write something I tested in-place as a POC... without the raise though, it ✌️ technically ✌️ had the same effect...

@agrare
Copy link
Member

agrare commented Oct 7, 2020

@Fryguy maybe more accurate, but seems like also more ephemeral, so it could exist then not within the time period of sleep(0.1), so we might miss it.

👍 As long as artifacts/ is guaranteed to be created after pid

@NickLaMuro
Copy link
Member Author

👍 As long as artifacts/ is guaranteed to be created after pid

/me pulls up runner code base to look for proof...

@agrare
Copy link
Member

agrare commented Oct 7, 2020

The listen gem has them showing up in the same set of updates

/tmp/ansible-runner20201007-63398-188ezai/daemon.log
/tmp/ansible-runner20201007-63398-188ezai/pid, /tmp/ansible-runner20201007-63398-188ezai/artifacts/result/command

But I think inotify was reporting pid was first, let me double check

@agrare
Copy link
Member

agrare commented Oct 7, 2020

rb-inotify gives a better view of the events in order:

[:create]: daemon.log
[:create]: pid
[:modify]: pid
[:isdir, :create]: artifacts

so 👍 from me

@NickLaMuro
Copy link
Member Author

@agrare okay...

tl; dr: Yes, it is created after...

Long answer...

So, to start, the pidfile is created here:

https://github.com/ansible/ansible-runner/blob/1194519d275bf99ad296b07e2bc8ce33e37a7c34/ansible_runner/__main__.py#L882

(and is created in the same fashion in 1.4.6 as well, as opposed to the devel link I shared above)

That library is the python-daemon library, which also makes use of the lockfile lib as well. There is a bunch of code that I could link to the library... but to save some braincells, basically the TimeoutPIDLockFile that is called is from lockfile, and that is used by the python-daemon library to generate the lock file when the context is entered.

On the ansible-runner side, the definition of the artifacts_dir variable is initialized here:

https://github.com/ansible/ansible-runner/blob/blob/stable/1.4.x/ansible_runner/__main__.py#L285-L289

The code that creates the artifacts/ directory is called from a top level in ansible_runner/runner.py in the run() method:

https://github.com/ansible/ansible-runner/blob/stable/1.4.x/ansible_runner/runner.py#L89-L99

When that, the runner daemon context, is eventually entered here:

https://github.com/ansible/ansible-runner/blob/stable/1.4.x/ansible_runner/__main__.py#L559

it will create the pidfile in the python-daemon lib first, lock it, and then after that it will call the code that actually creates the artifacts directory as part of the .run() of ansible-runner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants