-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor of metricbeat process-gathering metrics and system/process #30076
Refactor of metricbeat process-gathering metrics and system/process #30076
Conversation
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
This pull request doesn't have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, the new changes are definitely easier to follow than what was there before.
There is so much platform specific code in here that testing will probably do a better job of validating this than code review.
Alright, whenever I try running the windows tests on my local machine, they're incredibly flaky, and I'm not sure if I just somehow made them more flaky. |
|
||
@unittest.skipUnless(re.match("(?i)linux|darwin|freebsd", sys.platform), "os") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to disable tests with a big refactor like this for Windows concerns me a little bit if we don't know what the root cause is. Can we prove that the tests are equally flaky on main?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, I'm super baffled by this, and I don't have enough experience with the testing setup or windows. I can run them on my windows boxes, and the tests will pass, but the entire suite of unit tests, not just the ones impacted by the refactor, are super flaky on any windows machine I test. The tests I commented out on windows are flaky as there's a small chance that the parallel unit tests will return "wrong" events that don't have the full processor data.
I assume I can fix this by refactoring the python tests to add a few filter functions, currently not sure if that's worth it, considering these tests are 6 years old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing about Windows in particular that should cause flakiness. If you are testing on older hardware sometimes the slightly slower execution speed can reveal timing issues that result in flakiness.
I wouldn't focus on the age of the tests but whether they are actually providing valuable test coverage. If they are it is worth trying to improve them (or just replace them with something better like an equivalent Go test). If they aren't testing anything useful we could remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, the flakiness is really odd, considering it's running on a fairly nice windows PC, but I'm still seeing inconsistent timeouts. I think it's worth trying to improve the tests, but they need enough cleanup/additions that I'm wondering if I should do it in a separate PR, since they need a good bit of cleanup.
For anyone watching, I'm still tweaking with the windows tests. I did a little more testing on darwin and different linux distros, however we'll need QA involvement for the upcoming release, since the diff here is so huge. Also tempted to make an issue for refactoring the unit tests, since they are huge, and could use some basic improvements. |
thanks @fearful-symmetry for working on this! |
Update regarding CI test failures: The python unit tests are weirdly probabilistic, and something about the windows CI environment in particular is more prone to timeouts and weird events that require additional filtering logic. I'm tweaking things to get windows CI to pass, but part of the issue is that post-refactor, the code is much more strict about not reporting fields that don't technically exist, so the python testing framework's fast-and-loose approach doesn't work as well. Parallel to this, I'm working on another PR to clean up the tests (my python linter lights up the entire code, it's wild) and hopefully add some new features to make it more deterministic. |
…into feature/use-with-kind-k8s-env * 'feature/use-with-kind-k8s-env' of github.com:v1v/beats: (52 commits) ci: home is declared within withBeatsEnv ci: use withKindEnv step ci: use getBranchesFromAliases and support next-patch-8 (elastic#30400) Update fields.yml (elastic#29609) Heartbeat: fix browser metrics and trace mappings (elastic#30258) Apply light edits to 8.0 changelog (elastic#30351) packetbeat/beater: make sure Npcap installation runs before interfaces are needed (elastic#30396) Add a ring-buffer reporter to libbeat (elastic#28750) Osquerybeat: Add install verification for osquerybeat (elastic#30388) update windows matrix support (elastic#30373) Refactor of metricbeat process-gathering metrics and system/process (elastic#30076) adjust next changelog wording (elastic#30371) [Metricbeat] azure: move event report into loop validDim loop (elastic#29945) fix: report GitHub Check before the cache (elastic#30372) Add support for non-unique keys in Kafka output headers (elastic#30369) ci: 6 major branch reached EOL (elastic#30357) reduce Elastic Agent shut down time by stopping processes concurrently (elastic#29650) [Filebeat] Add message to register encode/decode debug logs (elastic#30271) [libbeat] kafka message header support (elastic#29940) Heartbeat: set duration to zero for syntax errors (elastic#30227) ...
We have performed testing for system integration for "Process" event dataset on various operating system and found it working fine. Below is our observations: Platforms tested: Mac OS 12, Ubuntu 20, Linux centos 7.3, SLES 15 and Windows 10
Build Details: Please let us know if we are missing anything here to validate. Thanks |
What does this PR do?
This is a total refactor of the code used to gather process metrics on metricbeat. This was done with a few goals:
gosigar
MapStr
typenull zero
metrics by using theOpt
library.hostfs
handling consistent, no more global hostfs settingAs of now, I haven't added any reviewers to this, as I'm still not done testing, and I imagine it's gonna take me a good day or two to just make CI happy. However, it's mostly done, and I'd at least like to have it up so CI can fail.
How the new code works
In order to improve performance, we try to make as few iterations over a list of processes as possible. In order to accomplish this, the code frequently between generic OS-agnostic functions and OS-specific implementations. When
system/process
fetches a list of processes, the steps are something like this:Get()
from an external library, such as metricbeatFetchPids()
method. This fetches and iterates over the PIDs using whatever OS-specific methods are available, then for each pid:pidFill()
, an OS-Agnostic function that:GetInfoForPid
function that fills out basic process info (name and PID at minimum)FillPidMetrics()
function that fetches the rest of the process metricstop
settingMapStr
types, and returnsThis contains no user-facing changes.
Why is it important?
This closes #29796 and represents a major part of the ongoing cleanup of system code and further deprecation of of external libraries.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally