-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make /proc/pid/io lookup failures skippable #129
Make /proc/pid/io lookup failures skippable #129
Conversation
The buildkite agent is itself a container isn't it? Can it run an ubuntu container for us here? Can we just test this directly in CI in this repository? |
@cmacknz so, to clarify, we need to run it in a container where we're trying to monitor the host OS. In order to test this specific issue in a non-flaky way, we'd need to fetch a running process from outside the container, and attempt to read that. Not sure how possible that is with our current CI configs. The next-best way is to just monitor a process that doesn't belong to the same user the running beat instance, which is (slightly) easier in theory, but a lot of the CI environments are isolated enough that we can only see PIDs from the currently running test user. |
In the past we've occasionally run into adjacent problems testing the intersection of cgroups+various container configs. It's not enough to run the test in a container, the test needs to have control of the container orchestration, so it can mount in host file systems, alter container security options, have knowledge of the host OS, etc. |
CI does not want you to escape the container, so not very possible. At least not without a dedicated CI running specifically for this purpose.
Can you try this and confirm?
Is there anything we can do with https://pkg.go.dev/io/fs to mock out the filesystem conditions we want to observe? Or by just capturing representative content from /proc and point the code at that? Do we really need live running processes for this? Or can we create a reference instance of the part of the /proc tree we care about instead? I am wary of declaring system metrics as untestable considering our recent experience with changes in here have unintended severe consequences. |
Already done in a previous PR, that's why the
So, it depends on what we're trying to test. If we just want to test the more basic condition of "how does this library behave when it runs into some kind of permission failure" we can test that, but the more nefarious set of errors here are "does metricbeat behave as expected when monitoring the host system from inside a container" that's a little harder. This isn't a normal file permission error, as the man page explains, "Permission to access this file is governed by a ptrace access mode PTRACE_MODE_READ_FSCREDS check". This is really only relevant to docker, where it runs containers in a more locked-down permissions context, which means that a container needs I do believe we should have some kind of CI or testing platform that's a little more low-level. In addition to issues like this, where it would help to spin up our own containers, there's a lot of edge cases that would benefit from a test matrix that allows for more fine-grained control over the specific OS. But that's a separate thing. I'm gonna poke around and see if the test environment lets me make a user and start a process as it. |
The buildkite docker plugin supports We should be able to try creating a build step with the docker plugin, and if it doesn't work immediately I would ask eng prod if it can be setup. It should allow privileged access to the underlying buildkite agent, which is what you'd have if you just ran on it directly, which seems reasonable. |
Could we maybe use |
Oh, that's a fun idea, I like that. No idea how practical it is. Much to my surprise, the CI environment lets me make a new user. I've started trying to build out some kind of test that'll allow us to run the system code against a process that doesn't belong to the current user. |
So, it's still kind of a hack, but I've come up with a test that checks the base scenario: can we read from a process where we don't have sufficient without failing? This is a broader test than what actually spurred this issue, but I'd argue this is something of a positive, as it means we're testing for changes that might cause issues when beats is run as non-root. |
What does this PR do?
This fixes a change in the behavior of the system metrics where fetching process metrics on a container without
SYS_PTRACE
results in a failure, as/proc/pid/io
requiresSYS_PTRACE
.This flies under the radar in a handful of use cases, as our docs mention
SYS_PTRACE
in other contexts, and many users monitoring system data will (in my experience) end up passing--privileged
due to a variety of issues with permissions inproc
. This issue also doesn't appear outside of docker.There is a test already that does cover this (although it'll catch it for a permissions error, not a container capability error):
TestFetchProcessFromOtherUser
, however, that test is usually skipped in CI, as the CI environment lacks more than one user. The exact test condition itself can't really be tested without running in a container.If we want to test this more thoroughly, we'll need some kind of integration test in beats that runs in a container and specifically looks out for skipped processes. These kinds of tests are often flaky, so they tend not to happen. Once we get buildkite set up in the beats repo, we should add a test that specifically looks for skipped metrics, probably by spinning up multiple sub-processes and having beats monitor them directly.
Why is it important?
This changes the behavior of system metrics inside a container.
Checklist
CHANGELOG.md