Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of metricbeat process-gathering metrics and system/process #30076

Merged

Conversation

fearful-symmetry
Copy link
Contributor

What does this PR do?

This is a total refactor of the code used to gather process metrics on metricbeat. This was done with a few goals:

  • Depend less on upstream libraries such as gosigar
  • Performance improvements
  • Reduce use of the MapStr type
  • Get rid of any danging null zero metrics by using the Opt library.
  • Prepare for upcoming changes to metricset/input design by decoupling metrics-gathering code from the logic of metricbeat
  • Make hostfs handling consistent, no more global hostfs setting

As of now, I haven't added any reviewers to this, as I'm still not done testing, and I imagine it's gonna take me a good day or two to just make CI happy. However, it's mostly done, and I'd at least like to have it up so CI can fail.

How the new code works

In order to improve performance, we try to make as few iterations over a list of processes as possible. In order to accomplish this, the code frequently between generic OS-agnostic functions and OS-specific implementations. When system/process fetches a list of processes, the steps are something like this:

  • A call to Get() from an external library, such as metricbeat
    • A call to an OS-Specific FetchPids() method. This fetches and iterates over the PIDs using whatever OS-specific methods are available, then for each pid:
      • Call pidFill(), an OS-Agnostic function that:
        • Calls an OS-Specific GetInfoForPid function that fills out basic process info (name and PID at minimum)
        • Decides if a process has been filtered out by user regex
        • Calls an OS-specific FillPidMetrics() function that fetches the rest of the process metrics
        • Calculates percentages and cgroups metrics if applicable
      • Append the completed PID to a hashmap (for internal tracking) and an array (for reporting)
    • Sorts and reduces the returned list of pids based on a user-supplied top setting
    • Transforms the remaining metrics to a list of MapStr types, and returns

This contains no user-facing changes.

Why is it important?

This closes #29796 and represents a major part of the ongoing cleanup of system code and further deprecation of of external libraries.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Manually run + check integration tests in linux, MacOS and Windows

How to test this PR locally

  • Pull down PR
  • Build
  • Run metricbeat with system/process, change various flags, make sure everything works and the data looks correct.

@fearful-symmetry fearful-symmetry added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team 8.2-candidate backport-v8.2.0 Automated backport with mergify labels Jan 27, 2022
@fearful-symmetry fearful-symmetry self-assigned this Jan 27, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 27, 2022
@botelastic
Copy link

botelastic bot commented Jan 27, 2022

This pull request doesn't have a Team:<team> label.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jan 27, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-02-11T18:31:10.336+0000

  • Duration: 135 min 7 sec

Test stats 🧪

Test Results
Failed 0
Passed 47686
Skipped 4293
Total 51979

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, the new changes are definitely easier to follow than what was there before.

There is so much platform specific code in here that testing will probably do a better job of validating this than code review.

@fearful-symmetry
Copy link
Contributor Author

Alright, whenever I try running the windows tests on my local machine, they're incredibly flaky, and I'm not sure if I just somehow made them more flaky.


@unittest.skipUnless(re.match("(?i)linux|darwin|freebsd", sys.platform), "os")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to disable tests with a big refactor like this for Windows concerns me a little bit if we don't know what the root cause is. Can we prove that the tests are equally flaky on main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, I'm super baffled by this, and I don't have enough experience with the testing setup or windows. I can run them on my windows boxes, and the tests will pass, but the entire suite of unit tests, not just the ones impacted by the refactor, are super flaky on any windows machine I test. The tests I commented out on windows are flaky as there's a small chance that the parallel unit tests will return "wrong" events that don't have the full processor data.

I assume I can fix this by refactoring the python tests to add a few filter functions, currently not sure if that's worth it, considering these tests are 6 years old.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing about Windows in particular that should cause flakiness. If you are testing on older hardware sometimes the slightly slower execution speed can reveal timing issues that result in flakiness.

I wouldn't focus on the age of the tests but whether they are actually providing valuable test coverage. If they are it is worth trying to improve them (or just replace them with something better like an equivalent Go test). If they aren't testing anything useful we could remove them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, the flakiness is really odd, considering it's running on a fairly nice windows PC, but I'm still seeing inconsistent timeouts. I think it's worth trying to improve the tests, but they need enough cleanup/additions that I'm wondering if I should do it in a separate PR, since they need a good bit of cleanup.

@fearful-symmetry
Copy link
Contributor Author

For anyone watching, I'm still tweaking with the windows tests. I did a little more testing on darwin and different linux distros, however we'll need QA involvement for the upcoming release, since the diff here is so huge. Also tempted to make an issue for refactoring the unit tests, since they are huge, and could use some basic improvements.

@ph
Copy link
Contributor

ph commented Feb 10, 2022

thanks @fearful-symmetry for working on this!

@fearful-symmetry
Copy link
Contributor Author

Update regarding CI test failures: The python unit tests are weirdly probabilistic, and something about the windows CI environment in particular is more prone to timeouts and weird events that require additional filtering logic.

I'm tweaking things to get windows CI to pass, but part of the issue is that post-refactor, the code is much more strict about not reporting fields that don't technically exist, so the python testing framework's fast-and-loose approach doesn't work as well.

Parallel to this, I'm working on another PR to clean up the tests (my python linter lights up the entire code, it's wild) and hopefully add some new features to make it more deterministic.

@fearful-symmetry fearful-symmetry merged commit 48868e6 into elastic:main Feb 14, 2022
v1v added a commit to v1v/beats that referenced this pull request Feb 21, 2022
…into feature/use-with-kind-k8s-env

* 'feature/use-with-kind-k8s-env' of github.com:v1v/beats: (52 commits)
  ci: home is declared within withBeatsEnv
  ci: use withKindEnv step
  ci: use getBranchesFromAliases and support next-patch-8 (elastic#30400)
  Update fields.yml (elastic#29609)
  Heartbeat: fix browser metrics and trace mappings (elastic#30258)
  Apply light edits to 8.0 changelog (elastic#30351)
  packetbeat/beater: make sure Npcap installation runs before interfaces are needed (elastic#30396)
  Add a ring-buffer reporter to libbeat (elastic#28750)
  Osquerybeat: Add install verification for osquerybeat (elastic#30388)
  update windows matrix support (elastic#30373)
  Refactor of metricbeat process-gathering metrics and system/process (elastic#30076)
  adjust next changelog wording (elastic#30371)
  [Metricbeat] azure: move event report into loop validDim loop (elastic#29945)
  fix: report GitHub Check before the cache (elastic#30372)
  Add support for non-unique keys in Kafka output headers (elastic#30369)
  ci: 6 major branch reached EOL (elastic#30357)
  reduce Elastic Agent shut down time by stopping processes concurrently (elastic#29650)
  [Filebeat] Add message to register encode/decode debug logs (elastic#30271)
  [libbeat] kafka message header support (elastic#29940)
  Heartbeat: set duration to zero for syntax errors (elastic#30227)
  ...
@dikshachauhan-qasource
Copy link

Hi @fearful-symmetry

We have performed testing for system integration for "Process" event dataset on various operating system and found it working fine. Below is our observations:

Platforms tested: Mac OS 12, Ubuntu 20, Linux centos 7.3, SLES 15 and Windows 10

  • Agent on different operating system able to stream data for system integration under data stream tab for all metrics and validated for "process" event focusedly.
  • After toggling on/off the event collection under system integration, data continues to stream when resumed back.
  • Attempted to restart agent service and validated data streaming.
  • Validated data under discover tab.

Screenshots:
image
image

Build Details:
Version : v8.2.0_Snapshot
Build : 51077
Commit : 595b6948a1492a04fff14c8d6b2f8b50308681c0
Artifact used: https://snapshots.elastic.co/8.2.0-304d6052/summary-8.2.0-SNAPSHOT.html

Please let us know if we are missing anything here to validate.

Thanks
QAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.2.0 Automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor system/process metricset
7 participants