Refactor of metricbeat process-gathering metrics and system/process #30076

fearful-symmetry · 2022-01-27T23:31:28Z

What does this PR do?

This is a total refactor of the code used to gather process metrics on metricbeat. This was done with a few goals:

Depend less on upstream libraries such as gosigar
Performance improvements
Reduce use of the MapStr type
Get rid of any danging null zero metrics by using the Opt library.
Prepare for upcoming changes to metricset/input design by decoupling metrics-gathering code from the logic of metricbeat
Make hostfs handling consistent, no more global hostfs setting

As of now, I haven't added any reviewers to this, as I'm still not done testing, and I imagine it's gonna take me a good day or two to just make CI happy. However, it's mostly done, and I'd at least like to have it up so CI can fail.

How the new code works

In order to improve performance, we try to make as few iterations over a list of processes as possible. In order to accomplish this, the code frequently between generic OS-agnostic functions and OS-specific implementations. When system/process fetches a list of processes, the steps are something like this:

A call to Get() from an external library, such as metricbeat
- A call to an OS-Specific FetchPids() method. This fetches and iterates over the PIDs using whatever OS-specific methods are available, then for each pid:
  - Call pidFill(), an OS-Agnostic function that:
    - Calls an OS-Specific GetInfoForPid function that fills out basic process info (name and PID at minimum)
    - Decides if a process has been filtered out by user regex
    - Calls an OS-specific FillPidMetrics() function that fetches the rest of the process metrics
    - Calculates percentages and cgroups metrics if applicable
  - Append the completed PID to a hashmap (for internal tracking) and an array (for reporting)
- Sorts and reduces the returned list of pids based on a user-supplied top setting
- Transforms the remaining metrics to a list of MapStr types, and returns

This contains no user-facing changes.

Why is it important?

This closes #29796 and represents a major part of the ongoing cleanup of system code and further deprecation of of external libraries.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

Manually run + check integration tests in linux, MacOS and Windows

How to test this PR locally

Pull down PR
Build
Run metricbeat with system/process, change various flags, make sure everything works and the data looks correct.

…w-refactor

elasticmachine · 2022-01-27T23:31:30Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

botelastic · 2022-01-27T23:31:34Z

This pull request doesn't have a Team:<team> label.

elasticmachine · 2022-01-27T23:57:56Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-02-11T18:31:10.336+0000
Duration: 135 min 7 sec

Test stats 🧪

Test	Results
Failed	0
Passed	47686
Skipped	4293
Total	51979

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

…w-refactor

metricbeat/tests/system/test_processors.py

cmacknz

This looks good to me, the new changes are definitely easier to follow than what was there before.

There is so much platform specific code in here that testing will probably do a better job of validating this than code review.

fearful-symmetry · 2022-02-03T18:45:16Z

Alright, whenever I try running the windows tests on my local machine, they're incredibly flaky, and I'm not sure if I just somehow made them more flaky.

…refactor

cmacknz · 2022-02-07T19:32:28Z

metricbeat/tests/system/test_processors.py


+    @unittest.skipUnless(re.match("(?i)linux|darwin|freebsd", sys.platform), "os")


Having to disable tests with a big refactor like this for Windows concerns me a little bit if we don't know what the root cause is. Can we prove that the tests are equally flaky on main?

Yah, I'm super baffled by this, and I don't have enough experience with the testing setup or windows. I can run them on my windows boxes, and the tests will pass, but the entire suite of unit tests, not just the ones impacted by the refactor, are super flaky on any windows machine I test. The tests I commented out on windows are flaky as there's a small chance that the parallel unit tests will return "wrong" events that don't have the full processor data.

I assume I can fix this by refactoring the python tests to add a few filter functions, currently not sure if that's worth it, considering these tests are 6 years old.

There is nothing about Windows in particular that should cause flakiness. If you are testing on older hardware sometimes the slightly slower execution speed can reveal timing issues that result in flakiness.

I wouldn't focus on the age of the tests but whether they are actually providing valuable test coverage. If they are it is worth trying to improve them (or just replace them with something better like an equivalent Go test). If they aren't testing anything useful we could remove them.

Yah, the flakiness is really odd, considering it's running on a fairly nice windows PC, but I'm still seeing inconsistent timeouts. I think it's worth trying to improve the tests, but they need enough cleanup/additions that I'm wondering if I should do it in a separate PR, since they need a good bit of cleanup.

fearful-symmetry · 2022-02-08T01:40:36Z

For anyone watching, I'm still tweaking with the windows tests. I did a little more testing on darwin and different linux distros, however we'll need QA involvement for the upcoming release, since the diff here is so huge. Also tempted to make an issue for refactoring the unit tests, since they are huge, and could use some basic improvements.

…refactor

ph · 2022-02-10T19:30:30Z

thanks @fearful-symmetry for working on this!

fearful-symmetry · 2022-02-10T21:29:52Z

Update regarding CI test failures: The python unit tests are weirdly probabilistic, and something about the windows CI environment in particular is more prone to timeouts and weird events that require additional filtering logic.

I'm tweaking things to get windows CI to pass, but part of the issue is that post-refactor, the code is much more strict about not reporting fields that don't technically exist, so the python testing framework's fast-and-loose approach doesn't work as well.

Parallel to this, I'm working on another PR to clean up the tests (my python linter lights up the entire code, it's wild) and hopefully add some new features to make it more deterministic.

…refactor

…into feature/use-with-kind-k8s-env * 'feature/use-with-kind-k8s-env' of github.com:v1v/beats: (52 commits) ci: home is declared within withBeatsEnv ci: use withKindEnv step ci: use getBranchesFromAliases and support next-patch-8 (elastic#30400) Update fields.yml (elastic#29609) Heartbeat: fix browser metrics and trace mappings (elastic#30258) Apply light edits to 8.0 changelog (elastic#30351) packetbeat/beater: make sure Npcap installation runs before interfaces are needed (elastic#30396) Add a ring-buffer reporter to libbeat (elastic#28750) Osquerybeat: Add install verification for osquerybeat (elastic#30388) update windows matrix support (elastic#30373) Refactor of metricbeat process-gathering metrics and system/process (elastic#30076) adjust next changelog wording (elastic#30371) [Metricbeat] azure: move event report into loop validDim loop (elastic#29945) fix: report GitHub Check before the cache (elastic#30372) Add support for non-unique keys in Kafka output headers (elastic#30369) ci: 6 major branch reached EOL (elastic#30357) reduce Elastic Agent shut down time by stopping processes concurrently (elastic#29650) [Filebeat] Add message to register encode/decode debug logs (elastic#30271) [libbeat] kafka message header support (elastic#29940) Heartbeat: set duration to zero for syntax errors (elastic#30227) ...

dikshachauhan-qasource · 2022-03-15T13:03:47Z

Hi @fearful-symmetry

We have performed testing for system integration for "Process" event dataset on various operating system and found it working fine. Below is our observations:

Platforms tested: Mac OS 12, Ubuntu 20, Linux centos 7.3, SLES 15 and Windows 10

Agent on different operating system able to stream data for system integration under data stream tab for all metrics and validated for "process" event focusedly.
After toggling on/off the event collection under system integration, data continues to stream when resumed back.
Attempted to restart agent service and validated data streaming.
Validated data under discover tab.

Screenshots:

Build Details:
Version : v8.2.0_Snapshot
Build : 51077
Commit : 595b6948a1492a04fff14c8d6b2f8b50308681c0
Artifact used: https://snapshots.elastic.co/8.2.0-304d6052/summary-8.2.0-SNAPSHOT.html

Please let us know if we are missing anything here to validate.

Thanks
QAS

fearful-symmetry added 14 commits November 15, 2021 10:04

start work on process refactoring

b4d6724

finish new pid fetcher

3c3e6b0

continue work on linux process metrics refactoring

11acfe7

continue work on linux implementation

2d846f5

complete first draft of linux support

057da2f

Merge remote-tracking branch 'upstream/master' into system-process-ne…

46ee831

…w-refactor

finish linux, start MacOS

ddecc2d

fix tests

49816b6

fix darwin calls

765f5f4

finish first pass at darwin

f9e5dd6

add aix, start cleanup

784a7a9

add new benchmark

d9f8771

start cleanup, fix misc. config values

578babf

fix darwin caching

88b3858

fearful-symmetry added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team 8.2-candidate backport-v8.2.0 Automated backport with mergify labels Jan 27, 2022

fearful-symmetry self-assigned this Jan 27, 2022

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 27, 2022

format

1bb007e

jlind23 added v8.2.0 and removed 8.2-candidate labels Jan 28, 2022

fearful-symmetry added 4 commits January 28, 2022 10:35

clean up code, fix some tests

7ca6436

Merge remote-tracking branch 'upstream/master' into system-process-ne…

ebe1a9c

…w-refactor

move over root event logic, fix a ton of little bugs

efcdc66

fix tests, add default hostfs to init

3a96cf7

cmacknz reviewed Feb 3, 2022

View reviewed changes

metricbeat/tests/system/test_processors.py Show resolved Hide resolved

cmacknz approved these changes Feb 3, 2022

View reviewed changes

fearful-symmetry added 6 commits February 3, 2022 12:40

continued attempts at making windows tests less flaky

4bd598a

Merge remote-tracking branch 'upstream/main' into system-process-new-…

de6e3ea

…refactor

try to fix formatting

9d968eb

give up, disable flaky windows tests

712acb2

Merge remote-tracking branch 'upstream/main' into system-process-new-…

0dcf56a

…refactor

remove old code

f98d3c2

cmacknz reviewed Feb 7, 2022

View reviewed changes

another shot at playing with windows tests

ffac2a9

fearful-symmetry added 6 commits February 8, 2022 09:00

clean up now-working tests

aa69465

Merge remote-tracking branch 'upstream/main' into system-process-new-…

b542e35

…refactor

Merge remote-tracking branch 'upstream/main' into system-process-new-…

8e73a09

…refactor

Merge remote-tracking branch 'upstream/main' into system-process-new-…

2f1b1ed

…refactor

Merge remote-tracking branch 'upstream/main' into system-process-new-…

85c99f1

…refactor

try to give python unit tests a little more room

09051eb

fix conditionals in tests

2857e08

fearful-symmetry added 3 commits February 10, 2022 15:40

try to tune python tests a little

45859c4

just rewrite test

96b8917

Merge remote-tracking branch 'upstream/main' into system-process-new-…

4cbce53

…refactor

kvch approved these changes Feb 14, 2022

View reviewed changes

fearful-symmetry merged commit 48868e6 into elastic:main Feb 14, 2022

fearful-symmetry mentioned this pull request Feb 10, 2022

Elastic-agent must be able to manage global variables within a beat elastic/elastic-agent#163

Closed

rdner mentioned this pull request Aug 21, 2023

Metricbeat fails to get information for protected processes under Windows #17314

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of metricbeat process-gathering metrics and system/process #30076

Refactor of metricbeat process-gathering metrics and system/process #30076

fearful-symmetry commented Jan 27, 2022

elasticmachine commented Jan 27, 2022

botelastic bot commented Jan 27, 2022

elasticmachine commented Jan 27, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

cmacknz left a comment

fearful-symmetry commented Feb 3, 2022

cmacknz Feb 7, 2022

fearful-symmetry Feb 7, 2022

cmacknz Feb 7, 2022

fearful-symmetry Feb 8, 2022

fearful-symmetry commented Feb 8, 2022

ph commented Feb 10, 2022

fearful-symmetry commented Feb 10, 2022

dikshachauhan-qasource commented Mar 15, 2022


		@unittest.skipUnless(re.match("(?i)linux\|darwin\|freebsd", sys.platform), "os")

Refactor of metricbeat process-gathering metrics and system/process #30076

Refactor of metricbeat process-gathering metrics and system/process #30076

Conversation

fearful-symmetry commented Jan 27, 2022

What does this PR do?

How the new code works

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

elasticmachine commented Jan 27, 2022

botelastic bot commented Jan 27, 2022

elasticmachine commented Jan 27, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

cmacknz left a comment

Choose a reason for hiding this comment

fearful-symmetry commented Feb 3, 2022

cmacknz Feb 7, 2022

Choose a reason for hiding this comment

fearful-symmetry Feb 7, 2022

Choose a reason for hiding this comment

cmacknz Feb 7, 2022

Choose a reason for hiding this comment

fearful-symmetry Feb 8, 2022

Choose a reason for hiding this comment

fearful-symmetry commented Feb 8, 2022

ph commented Feb 10, 2022

fearful-symmetry commented Feb 10, 2022

dikshachauhan-qasource commented Mar 15, 2022

elasticmachine commented Jan 27, 2022 •

edited by jenkins-beats-ci bot

Loading