Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get metrics regarding opened file handlers #853

Closed
ssbarnea opened this issue Mar 6, 2014 · 26 comments
Closed

Get metrics regarding opened file handlers #853

ssbarnea opened this issue Mar 6, 2014 · 26 comments
Milestone

Comments

@ssbarnea
Copy link

ssbarnea commented Mar 6, 2014

Another common problem with systems is the number of opened files.
Data dog should provide metrics regarding their use and more important to present one metric that measure % use, allowing us to add alerts if usage is above, let's say 80%.

The numeric value is not a big use by itself, but when measured agains the maximum value, which is configurable its value grows considerably.

http://www.cyberciti.biz/tips/linux-procfs-file-descriptors.html

@clutchski
Copy link
Contributor

This is a good idea. Thanks very much.

@remh
Copy link
Contributor

remh commented Mar 6, 2014

The name is confusing but it already exists in the process check:

https://github.com/DataDog/dd-agent/blob/master/checks.d/process.py#L13

@remh remh closed this as completed Mar 6, 2014
@remh
Copy link
Contributor

remh commented Mar 7, 2014

Sorry misread your use case to get it as a percentage.
Reopening it.

@remh remh reopened this Mar 7, 2014
@clutchski
Copy link
Contributor

I think we also want the system limit, not just handles open per process.

@ssbarnea
Copy link
Author

ssbarnea commented Mar 7, 2014

@clutchski you are right. Now we may have another problem, there is a system limit and a limit per process, probably we will need both. From my experience you may want to monitor file handler for the monitored processes, as these are the ones that do "surprise you from time to time.

Also, I have no idea on how to enable these metrics. I checked the process.yaml file and it contains information only on how to monitor different processes not on how to enable these metrics (obviously I tried to search them using the web UI and they are not there)

And regarding documentation, the best way to improve it is to improve the yaml templates and to include all supported parameters to them. If something is to hard / complex to explain in the yaml file, you can alway put an URL to a knowledge base article :)

@ssbarnea
Copy link
Author

ssbarnea commented Mar 7, 2014

Just discovered that psutil was not installed. Should I open another bug as "the installed does not try to install psutil by default"?

I installed psutil but now what do I need to do? do I need to restart dd-agent, change someone in the config? ... i wasn't able to see any error related to psutil in the dd-agent logs.

@remh
Copy link
Contributor

remh commented Mar 7, 2014

@ssbarnea

We don't bundle check dependencies with the agent to avoid conflicts with existing versions on the user's system.

But we are working towards a self contained agent which would actually install these dependencies. So there is no need to open another bug for that.

Regarding the process check, currently it doesn't collect the system limit but it collects the number of open files descriptors for your watched processes.

Can you get in touch with [email protected] to help you configure the check ?
Thanks

@remh remh added this to the 4.3.x milestone Mar 7, 2014
@ssbarnea
Copy link
Author

I will contact support, they are really good and also quick :)

Now, just as a customer experience: I find annoying that by default only ⅓ of the functionality is available just because you do not have the required libraries installed. I hope the next installer will try to install them one way or another, I don't care how. The 2nd annoyance is that the default .yaml files are far from being extensive enough. I do think that you should make a rule of updating these with all available options, to use them as primary source of documentation. Most linux tools do have config files with commented options inside and most time this is all you need in order to configure the products. That's what I call self-documented. Thanks.

Also it would be great to build a list of metrics with a description for each one. So we would know which one we do want to track or not and also to know exactly what a metric means, sometimes the name is not explicit enough and you may not be aware of the range of values it will take, unit of measure, ....

@remh remh modified the milestones: 5.1.0, 4.3.x May 8, 2014
@remh remh modified the milestones: Future, 5.1.0 Sep 26, 2014
@ssbarnea
Copy link
Author

Please do try to install psutils when installing agent, otherwise you are just providing a bad user experience. It is ok to ignore if it fails but doing an apt-get install psutils would be a great UX improvement.

@remh
Copy link
Contributor

remh commented Apr 13, 2015

Thanks for the feedback @ssbarnea

As of Agent 5.0.0, psutil is bundled in the deb, rpm and msi packages of the agent, and is installed on the fly with source installs.

$ /opt/datadog-agent/embedded/bin/python -c "import psutil; print psutil.__version__"
2.2.1

We will work on this issue to implement the count of opened file handlers, as it's an important metric but feel free to open a pull request if you've already done so!

Thanks again for the feedback!

@remh remh modified the milestones: 5.4.0, Future Apr 13, 2015
@ssbarnea
Copy link
Author

I am quite busy fixing other broken things around but be sure that if I implement something in datadog I will make pull requests. I prefer not to run my owned patched versions.

I had an outage due to file handlers being ousted for one of the monitored processes (nginx) and it took me some time to find out the cause.

So if Data dog can monitor the % of file handlers it would be perfect as we can have a single rule: if % open files (curr/max) is over 90% raise alarm.

I do like being able to have relative conditions as it is much easier to manage them and also you do not have to update the monitors when you tune the configurations on the server side.

On 13 Apr 2015, at 15:49, Remi Hakim [email protected] wrote:

Thanks for the feedback @ssbarnea https://github.com/ssbarnea
As of Agent 5.0.0, psutil is bundled in the deb, rpm and msi packages of the agent, and is installed on the fly with source installs.

$ /opt/datadog-agent/embedded/bin/python -c "import psutil; print psutil.version"
2.2.1
We will work on this issue to implement the count of opened file handlers, as it's an important metric but feel free to open a pull request if you've already done so!

Thanks again for the feedback!


Reply to this email directly or view it on GitHub #853 (comment).

@remh remh modified the milestones: 5.5.0, 5.4.0 May 11, 2015
@remh remh modified the milestones: Contribution needed, 5.5.0 Jul 30, 2015
@remh
Copy link
Contributor

remh commented Jan 4, 2016

Looks like we could get that from /proc/sys/fs/file-nr

@ssbarnea what do you think ?

@remh remh modified the milestones: 5.8.0, Contribution needed Jan 6, 2016
@ssbarnea
Copy link
Author

This doesn't seem to fix the issue, we need to be able to read the number of file descriptors per user and this seems to return the same result for any user.

@remh
Copy link
Contributor

remh commented Jan 20, 2016

Thanks for the feedback @ssbarnea

It's not possible to get the number of open FD per user without root access.
Getting the number of open FD per user can also potentially generates hundreds of different timeseries for a use case that's not pretty clear. It will also be way slower than just reading from /proc/sys/fs/file-nr as it will have to go through all running PIDs (that's basically what lsof does).

Reading the number of open FD and the limit in /proc/sys/fs/file-nr would be on the other hand pretty straightforward, fast to execute, doesn't require root access and will let you the visibility to detect FD leaks.

So it's likely the way we will go. What do you think ?

@ssbarnea
Copy link
Author

We are running ~6 serious JVM applications on the same bare-metal machine, each o them under its own username, and they all have custom ulimits. We never went out of filehandlers for the system itself but every 3-4 months we have an issue releated to them, caused by either a bug or just normal usage increase.

If we would monitor only the global number of file handlers we would not be able to stop who is generating the problem.

As an workaround I could setup the same limits for all applicaitions, having them at 90% of total system limits for each of them and monitor only the total values.

I do agree that under no circumstance we should count all FD for each PID.

Needing root acees is not a problem from my point of view, doing a proper monitoring almost always required root access. There are way to secure this, allow datadog user to run a specific command that runs as root could be one option.

@ssbarnea
Copy link
Author

I hope someone from DataDog will pull this DataDog/ansible-datadog#13 which is needed for this bug.

@remh
Copy link
Contributor

remh commented Jan 29, 2016

@ssbarnea thanks for the feedback
We closed DataDog/ansible-datadog#13 as psutil is already included in the agent.

One way to do that would be for you to add lsof access to dd-agent in the sudoer file.

Then we could have the process check to call lsof on the pids found by the process check.
Would that work ?

@ssbarnea
Copy link
Author

Yes it, in addition with the correct configuration of lsof via ansible this would work. Thanks!

@alexef
Copy link

alexef commented Aug 24, 2016

Just for the record, and to save others the time searching for it, currently Datadog only supports open file handlers per process. For system, system.processes. open_file_descriptors is just a sum of the monitored processes open_fd(s).

@lalarsson
Copy link

@alexef Thanks for the information.

Is there any possibilty that the total open file handlers per system will be included in the future?

@tomstockton
Copy link

+1 - @remh what happened to monitoring the relevant values in /proc/sys/fs/file-nr? This would be really useful. I could write a custom check but I feel like it should be a core metric.

@jippi
Copy link
Contributor

jippi commented Dec 1, 2016

👍 the aggregate of /proc/sys/fs/file-nr would be super useful!

@remh remh reopened this Jan 25, 2017
@remh
Copy link
Contributor

remh commented Jan 25, 2017

Reopening we can indeed add the content of /proc/sys/fs/file-nr although it's not as precise.

@abeluck
Copy link

abeluck commented Nov 9, 2017

Any movement on this issue? This is quite an important metrics for us.

@pdecat
Copy link

pdecat commented Nov 9, 2017

@abeluck, I've got a PR at DataDog/integrations-core#715 but some changes were requested before it can be considered to be merged. I don't have time to implement them right now, though.

FWIW, we are using this patch as is since August.

@olivielpeau
Copy link
Member

On Linux, the Agent now reports the total number of open file handles over the system limit (as the fraction system.fs.file_handles.in_use), by default. The value is collected from /proc/sys/fs/file-nr.
So I'll go ahead and close this issue.

root permissions are needed to collect per-process values. @pdeca's PR here DataDog/integrations-core#1235 implements a way to grab these metrics without making the whole agent run as root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants