Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Replace subprocess.Popen with safer get_subprocess_output #1892

Merged
merged 1 commit into from
Nov 2, 2015

Conversation

JohnLZeller
Copy link
Contributor

Issue came up when investigating hostdown issues.
https://datadog.zendesk.com/agent/tickets/30510

It seems as though, periodically, a subprocess call hangs and ultimately causes the watchdog to reset the agent. During this period of time, there is then a host down event, followed by a host recovered event posted to the customers events page. We noticed that it seems as though the hanging happens in various checks (not the same for every host down event). Offending checks have included the customers own custom iostat check, as well as our own postfix check, and even resources/processes.py.

The theory is that this is caused by a hanging call to subprocess.Popen. There is a known issue where the use of a subprocess.PIPE for stdout (or even stderr) could cause the process to hang once the subprocess.PIPE fills up. The solution to this is to use get_subprocess_output() in our own utils/subprocess_output.py, which uses a file for stdout and stderr, removing such a low memory capacity issue.

To avoid further chasing this problem around the agent, this PR attempts to replace the use of subprocess.Popen anywhere it shows up in the agent with get_subprocess_output(). There are a few instances where further work will be needed to refactor the code in a clean way, and for those instances, I have added a FIXME note.

@JohnLZeller
Copy link
Contributor Author

I have tested these on personal_chef, and fixed any issues that came up.

@@ -55,8 +57,13 @@ def _get_queue_count(self, directory, queues, tags):
# can dd-agent user run sudo?
test_sudo = os.system('setsid sudo -l < /dev/null')
if test_sudo == 0:
count = os.popen('sudo find %s -type f | wc -l' % queue_path)
count = count.readlines()[0].strip()
find = get_subprocess_output(['sudo', 'find', queue_path, '-type', 'f'], self.log)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second command is

wc -l

so you can change this whole block with:

count = len(get_subprocess_output(['sudo', 'find', queue_path, '-type', 'f'], self.log).splitlines())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, clever :) I missed it! shakes fist at the sky

@JohnLZeller
Copy link
Contributor Author

systemStats['cpuCores'] = int(wc.communicate()[0])
grep = get_subprocess_output(['grep', 'model name', '/proc/cpuinfo'], log)
# Must use tempfiles to redirect stdout to second command
tempgrep = tempfile.TemporaryFile()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here

@JohnLZeller
Copy link
Contributor Author

@remh incorporated all your changes :)

Sorry I sorta jumped the 🔫 on squashing the commits.
Here are the changes:
postfix.py Line 59
config.py Line 585
subprocess_output.py Line 21

All changes have been tested! 👍

@@ -385,8 +385,7 @@ def _get_server_pid(self, db):
if pid is None:
try:
if sys.platform.startswith("linux"):
ps = subprocess.Popen(['ps', '-C', 'mysqld', '-o', 'pid'],
stdout=subprocess.PIPE, close_fds=True).communicate()[0]
ps = get_subprocess_output(['ps', '-C', 'mysqld', '-o', 'pid'], self.log)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use splitlines()

@remh
Copy link

remh commented Sep 28, 2015

@JohnLZeller can you fix the conflicts please ?

def _get_version_info(self, varnishstat_path):
# Get the varnish version from varnishstat
# FIXME: Use get_subprocess_output() instead of subprocess.Popen
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do it while you're at it ?

@JohnLZeller
Copy link
Contributor Author

Fixed a couple of the FIXME's. Needs a little more love, and some testing.

@remh
Copy link

remh commented Sep 30, 2015

Thanks @JohnLZeller can you fix the conflicts please ?

@remh remh added this to the 5.5.1 milestone Sep 30, 2015
@JohnLZeller
Copy link
Contributor Author

Okay, fixed the other places where there were existing FIXME comments for subprocess, as well as merged master to catch this branch up, and resolved conflicts.

Had to make some changes to the way the get_subprocess_output() function works, in order to allow it to work for all the cases we needed it. I did not however end up using it for jmxfetch.py because it would have made get_subprocess_output() too messy. Instead I implemented a safer way of calling subprocess for jmxfetch.py.

I need to do some testing and refactoring on broken tests in the 'morrow. Getting closer.

@JohnLZeller
Copy link
Contributor Author

Could use some feedback.

@olivielpeau would you mind reviewing half? And @yannmh could you review the other half?

df_out = utils.subprocess_output.get_subprocess_output(
self.DF_COMMAND + ['-k'], self.log
)
df_out, err, rtcode = get_subprocess_output(self.DF_COMMAND + ['-k'], self.log)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment when using functions that return multiple values: you can discard the values that you don't use with _. It's just a convention but makes it clear that you're not using the value afterwards. For instance here you can use:

df_out, _, _ = get_subprocess_output(self.DF_COMMAND + ['-k'], self.log)

@olivielpeau
Copy link
Member

Great job, looks way cleaner now!

Looks good to me apart from my comments 👍

@remh remh modified the milestones: 5.5.2, 5.5.1 Oct 19, 2015
@JohnLZeller
Copy link
Contributor Author

Thanks @olivielpeau :)

Made the changes you mentioned, thanks for the good feedback!

@remh
Copy link

remh commented Oct 21, 2015

@JohnLZeller can you squash your commits please ?

@JohnLZeller
Copy link
Contributor Author

@remh will do

@yannmh yannmh modified the milestones: 5.6.0, 5.5.2 Oct 23, 2015
@JohnLZeller
Copy link
Contributor Author

@remh okay, I have squashed and rebased to matched up with the current state of master

@JohnLZeller JohnLZeller changed the title Replace subprocess.Popen with safer get_subprocess_output [core] Replace subprocess.Popen with safer get_subprocess_output Oct 27, 2015
remh pushed a commit that referenced this pull request Nov 2, 2015
[core] Replace subprocess.Popen with safer get_subprocess_output
@remh remh merged commit bfc4cb4 into master Nov 2, 2015
@miketheman
Copy link
Contributor

I'm very interested in trying out some of these methods in a new check, where might I find a recent build that includes this code?

@miketheman
Copy link
Contributor

Reason I ask is that http://apt.datadoghq.com/dists/nightly/ has datadog-agent_5.5.0.git.6.ff669c3-1_amd64.deb - which is back on Sept 17.

@remh
Copy link

remh commented Nov 4, 2015

http://apt.datad0g.com/dists/nightly/
On Nov 3, 2015 9:38 PM, "Mike Fiedler" [email protected] wrote:

Reason i ask is that http://apt.datadoghq.com/dists/nightly/ has
datadog-agent_5.5.0.git.6.ff669c3-1_amd64.deb - which is back on Sept 17
ff669c3.


Reply to this email directly or view it on GitHub
#1892 (comment).

@miketheman
Copy link
Contributor

Thanks @remh. I must have missed that when looking.

@olivielpeau olivielpeau deleted the zeller/subprocess branch November 4, 2015 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants