Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix thread safety in process code #43

Merged
merged 5 commits into from
Jul 25, 2022

Conversation

fearful-symmetry
Copy link
Contributor

@fearful-symmetry fearful-symmetry commented Jul 22, 2022

What does this PR do?

This fixes elastic/beats#32467

This code originates from metricbeat, which never worried too much about threadsafety, since a single metricset just lives in its own thread. However, the code has carried over to beat's self-monitoring in report/. A previous PR fixed a bug where GetSelf() would not calculate percentages, as it never accessed its own internal map of processes that it uses to calculate percentages. However, setup.go was creating a single instance of the process Stat struct to send to multiple function callbacks in the monitoring, hence a concurrent map access.

Wanted to get this PR up fast, still testing on non-linux systems.

Why is it important?

Potential concurrent hashmap access issue.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added an entry in CHANGELOG.md
  • Test on darwin
  • Test on Windows

@fearful-symmetry fearful-symmetry added bug Something isn't working Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.3.0 labels Jul 22, 2022
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner July 22, 2022 17:51
@fearful-symmetry fearful-symmetry self-assigned this Jul 22, 2022
@fearful-symmetry fearful-symmetry requested review from rdner and leehinman and removed request for a team July 22, 2022 17:51
@cmacknz cmacknz self-requested a review July 22, 2022 17:52
@cmacknz
Copy link
Member

cmacknz commented Jul 22, 2022

To unblock 8.3.3 as safely as possible we are going to downgrade the system metrics package in affected versions: elastic/beats#32467 (comment)

We will merge this to beats main and 8.3 after the 8.3.3 release completes (really after the next build candidate is generated).

@cmacknz
Copy link
Member

cmacknz commented Jul 22, 2022

Is there a test we can write to prove this is thread safe? Maybe the test we need should be added in beats?

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 22, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-07-25T18:09:13.223+0000

  • Duration: 9 min 58 sec

Test stats 🧪

Test Results
Failed 0
Passed 173
Skipped 3
Total 176

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@fearful-symmetry
Copy link
Contributor Author

Is there a test we can write to prove this is thread safe? Maybe the test we need should be added in beats?

@cmacknz Yah, I was wondering about that. If anything, I think the test should go in the report library, since it doesn't have any tests, and that's where the threaded code actually originated.

@cmacknz
Copy link
Member

cmacknz commented Jul 22, 2022

Yah, I was wondering about that. If anything, I think the test should go in the report library, since it doesn't have any tests, and that's where the threaded code actually originated.

Should we just document that this code isn't thread safe and put the locking in beats? I haven't looked at that code yet so I don't know how hard it is, but it seems a bit weird for the fix and the test to live in two different places.

@fearful-symmetry
Copy link
Contributor Author

Should we just document that this code isn't thread safe and put the locking in beats?

Well, theoretically it should be threadsafe now. The reporting/monitoring code that actually creates and spins up per-thread callbacks are in setup.go and elastic-agent-libs/reporting, so any further thread safety fixes would have to go there.

If we wanted to be extra paranoid, we could have just have each monitoring callback function have its own instance of Stat, but if we're confident this code is thread safe that seems unnecessary.

@fearful-symmetry
Copy link
Contributor Author

FWIW, a test on the monitoring code would almost be more an integration test, since we're hitting both the monitoring subsystem and any metrics code attached to it.

@fearful-symmetry
Copy link
Contributor Author

Added a small test to the report library. System metrics like this are always a tad awkward to write tests for, since a lot of metrics will just be context/OS dependent.

@cmacknz
Copy link
Member

cmacknz commented Jul 25, 2022

Needs a CHANGELOG entry :)

@LeeDr
Copy link

LeeDr commented Aug 11, 2022

fearful-symmetry merged commit c8f77ef into elastic:main 17 days ago

This commit may have been made to the 8.3 branch, but it was made well after 8.3.0 was released so I think this should have made it into the 8.3.3 release and should be labeled with that.

@cmacknz cmacknz added v8.4.0 and removed v8.3.0 labels Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic from concurrent map write after system metrics package update
4 participants