Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BeatV2Manager to configure inputs and set log level #34066

Merged
merged 14 commits into from
Dec 22, 2022

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Dec 16, 2022

What does this PR do?

This refactors the BeatV2Manager so it works correctly with the Elastic Agent V2 model of components/units. The log level is now computed from the defined units and the log level is now updated with the logp.SetLevel.

Why is it important?

Previously the BeatV2Manager would act incorrectly when a new unit was added to the beat from the Elastic Agent replacing the already existing unit with the new unit configuration instead of merging the configuration so the beat would operating with 2 inputs. Previous the log level was not set-able by the V2 control protocol, this now works.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@blakerouse blakerouse self-assigned this Dec 16, 2022
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 16, 2022
@mergify
Copy link
Contributor

mergify bot commented Dec 16, 2022

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @blakerouse? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@blakerouse blakerouse added the backport-v8.6.0 Automated backport with mergify label Dec 16, 2022
@blakerouse
Copy link
Contributor Author

I am keeping this in draft at the moment unit I get the unit tests done, but I have tested this manually with the Elastic Agent and it is working correctly. Still not comfortable to land this until I have unit tests to cover it. Wanted to get it up early so others can review and test it as well.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Dec 16, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-12-22T15:20:44.130+0000

  • Duration: 105 min 37 sec

Test stats 🧪

Test Results
Failed 0
Passed 25155
Skipped 1954
Total 27109

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@fearful-symmetry fearful-symmetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions, mostly about overall logic.

x-pack/libbeat/management/managerV2.go Show resolved Hide resolved
units map[string]*client.Unit
mainUnit string
// track individual units given to us by the V2 API
mx sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the above comment makes it look like this mutex is just used for controlling access to the hashmap, but it's used in other places. Maybe rename it or move it to the top of the struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can move it. I originally have multiple mutex for different parts but then I would still need to grab the unitsMx so I just changed it to a single mutex because it was safer than causing some deadlock in lock ordering.

x-pack/libbeat/management/managerV2.go Show resolved Hide resolved
// `reload` method and will be marked stopped in that code path)
continue
}
err := unit.UpdateState(status, message, payload)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we're manually looping over the unit UpdateState here? Is there a reason why we just can't call the global UpdateStatus or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global UpdateStatus will grab the mutex for units, to prevent that we call this logic on the units that reload just performed. We also would not want to update the status of units that might have been added to the cm.units that is different than the actual units that are being processed in the reload (because they where passed in).

cm.deleteUnit(change.Unit)
}
case <-cm.reloadCh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why there's so much logic involved in the reloading process? triggerReload() and reload() are both only called in this for block, it might just be simpler to call reload() in a goroutine or something? Also, why is reload() getting its own map of the units and not just using cm.units?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because performing reload might result in the config logic of the beat to call UpdateStatus which will then grab the mutex for the units.

So to ensure that a dead lock doesn't occur we ensure that the actual reload logic is performed in the main loop of the manager, but not in a path that would hold that mutex. So that is why a copy is sent to the reload function. Each client.Unit has an internal lock for state so it is also save to have a pointer to the same unit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you mention a deadlock starting with UpdateStatus I assume you mean the Reload() call blocking while the beat waits for UpdateStatus?

I wonder if we can do something like have a addUnit() return a new copy of the unit map that's sent to reload() so we can avoid the extra reloadCh? If not, can you at least add a comment explaining the loop, was a tad confusing at first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining the potential for deadlock here so it's obvious the next time we read this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment as well as add some debounce to consolidate unit changes into a single reload. It makes it simpler to understand as well.

// now update the statuses of all units
cm.mx.Lock()
status := getUnitState(cm.status)
message := cm.message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind using the global message? Seems like we should have a status message specifically mentioning the reload operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically setting the units to Healthy, or what ever beats has set with the UpdateStatus for the current state of the beat. The state is global because beats doesn't have a way of passing each unit to each running input and have it handle its own state.

x-pack/libbeat/management/managerV2.go Outdated Show resolved Hide resolved
cm.deleteUnit(change.Unit)
}
case <-cm.reloadCh:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining the potential for deadlock here so it's obvious the next time we read this code?

@blakerouse
Copy link
Contributor Author

Okay I think this is ready to be merged. I have added unit tests, answered all questions, updated comments, an added debounce to unit changes. I also merged #34049 into this PR so its already present and ready to go once this lands.

@blakerouse blakerouse marked this pull request as ready for review December 20, 2022 20:53
@blakerouse blakerouse requested review from a team as code owners December 20, 2022 20:53
@blakerouse blakerouse requested review from aleksmaus and removed request for a team December 20, 2022 20:53
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few new questions, but generally this looks good to me.

We will need a matching change to the Beat spec files to:

  1. Enable the output_restart/restart_on_output_change option for each Beat.
  2. Restore the default log level to INFO.

x-pack/libbeat/management/managerV2.go Outdated Show resolved Hide resolved
x-pack/libbeat/management/managerV2.go Show resolved Hide resolved
x-pack/libbeat/management/managerV2.go Show resolved Hide resolved
x-pack/libbeat/management/managerV2.go Show resolved Hide resolved
x-pack/libbeat/management/config.go Outdated Show resolved Hide resolved
x-pack/libbeat/management/managerV2_test.go Show resolved Hide resolved
x-pack/libbeat/management/managerV2_test.go Outdated Show resolved Hide resolved
@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Dec 21, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 21, 2022
@cmacknz
Copy link
Member

cmacknz commented Dec 21, 2022

@axw and @oren-zohar you will want to make sure you pick up this change to libbeat in APM server and Cloudbeat when it merges. The major highlights of what this addresses are:

  1. Allow changing the log level of the logp logger dynamically in response to configuration changes from the Elastic agent. The previous behaviour where the agent knew to restart Beats when the log level changed has been removed.
  2. Fixes several potential bugs in how input configurations are reloaded dynamically, including avoiding a situation where a Beat with multiple units would incorrectly shut down when only one of them was removed.
  3. Restores an old work around for dynamic output configuration reloads not working in libbeat, by automatically restarting the Beat to have the change take effect. See Beats no longer restart automatically when the output configuration changes. elastic-agent#1913. This does not apply to APM server.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the extra testing!

@blakerouse
Copy link
Contributor Author

/test

@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 6 Code Smells

No Coverage information No Coverage information
3.9% 3.9% Duplication

@blakerouse blakerouse merged commit 15d9a87 into elastic:main Dec 22, 2022
mergify bot pushed a commit that referenced this pull request Dec 22, 2022
* Refactor the V2 manager.

* Add debounce to unit changes.

* add stop functionality for output config changes

* Add tests.

* Fix typo.

* Fix code review, add more to the test.

* Re-order the processor injection so proper order is maintained.

* Fix unit tests.

* Copy global processors per stream to ensure that multiple streams don't get the same slice.

Co-authored-by: Alex Kristiansen <[email protected]>
(cherry picked from commit 15d9a87)
blakerouse added a commit that referenced this pull request Dec 22, 2022
)

* Refactor the V2 manager.

* Add debounce to unit changes.

* add stop functionality for output config changes

* Add tests.

* Fix typo.

* Fix code review, add more to the test.

* Re-order the processor injection so proper order is maintained.

* Fix unit tests.

* Copy global processors per stream to ensure that multiple streams don't get the same slice.

Co-authored-by: Alex Kristiansen <[email protected]>
(cherry picked from commit 15d9a87)

Co-authored-by: Blake Rouse <[email protected]>
@cmacknz
Copy link
Member

cmacknz commented Dec 22, 2022

@axw and @oren-zohar reminder ping now that this has merged that you will want to pick up this change in libbeat on both main and 8.6.

You will also want to ensure that the elastic-agent-client version is bumped to https://github.com/elastic/elastic-agent-client/releases/tag/v7.0.3 in your go.mod files if the libbeat update does not do this automatically.

@axw
Copy link
Member

axw commented Dec 23, 2022

Thanks @cmacknz, apm-server main and 8.6 are both updated, and I've confirmed that elastic-agent-client was bumped to 7.0.3.

@oren-zohar oren-zohar mentioned this pull request Dec 25, 2022
2 tasks
@oren-zohar
Copy link
Contributor

Thanks @cmacknz @blakerouse, cloudbeat was updated and I can confirm this fix works as expected 🚀

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* Refactor the V2 manager.

* Add debounce to unit changes.

* add stop functionality for output config changes

* Add tests.

* Fix typo.

* Fix code review, add more to the test.

* Re-order the processor injection so proper order is maintained.

* Fix unit tests.

* Copy global processors per stream to ensure that multiple streams don't get the same slice.

Co-authored-by: Alex Kristiansen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team
Projects
None yet
6 participants