Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in otelcol component wrappers #2027

Merged
merged 3 commits into from
Nov 15, 2024

Conversation

thampiotr
Copy link
Contributor

@thampiotr thampiotr commented Nov 4, 2024

PR Description

We have a general issue with OTel components where consumers may be used before the Start functions in OTel have finished running. This is because in OTel Start functions are non-blocking and sometimes do work to set things up, like it was the case for batch_processor. In Alloy, however we have Run function that is blocking for the lifetime of the component. As soon as it's called, we consider the component Running. In OTel, however, the Start function should be called and exit to consider a component running.

The solution here is to pause the Consumer until we are sure that the OTel component scheduler has called Start on all OTel components. The consumer will block any attempts to feed data to it.

Which issue(s) this PR fixes

Notes to the Reviewer

thampiotr@MacWork ~/w/alloy (update-otel-112-fix-race)> go test ./internal/component/otelcol/internal/scheduler/ -count 50 -race
ok  	github.com/grafana/alloy/internal/component/otelcol/internal/scheduler	3.543s
thampiotr@MacWork ~/w/alloy (update-otel-112-fix-race)> go test ./internal/component/otelcol/internal/lazyconsumer/ -count 50 -race
ok  	github.com/grafana/alloy/internal/component/otelcol/internal/lazyconsumer	17.495s

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@thampiotr thampiotr force-pushed the update-otel-112-fix-race branch from f07ea86 to 122a517 Compare November 4, 2024 17:47
@thampiotr thampiotr marked this pull request as ready for review November 5, 2024 10:10
@thampiotr thampiotr requested a review from a team as a code owner November 5, 2024 10:10
Base automatically changed from update-otel-112 to main November 5, 2024 10:20
Copy link
Contributor

@wildum wildum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, I think that's a step in the right direction.
The logic was not super clear to me at first in the scheduler, the idea is that the components are created in the paused state so they are not paused on the first run but are always resumed after being started because they are paused when the scheduler stops running, right? Maybe we could have this explanation in the scheduler code

@clayton-cornell clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Nov 5, 2024
Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some edits to align with the same changes in #2012 Might also be "fixed" up when the PR conflicts are resolved?

@thampiotr thampiotr marked this pull request as draft November 5, 2024 17:36
@thampiotr thampiotr force-pushed the update-otel-112-fix-race branch from c7891f6 to 3a1524d Compare November 7, 2024 15:49
@thampiotr thampiotr force-pushed the update-otel-112-fix-race branch from c2d75e8 to 7d84754 Compare November 7, 2024 15:51
@thampiotr thampiotr removed the type/docs Docs Squad label across all Grafana Labs repos label Nov 12, 2024
@thampiotr thampiotr marked this pull request as ready for review November 12, 2024 12:51
@thampiotr thampiotr changed the title Fix race condition in otelcol processors Fix race condition in otelcol component wrappers Nov 12, 2024
Copy link
Contributor

@wildum wildum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, did you test it with the version of the batch processor that was causing the flaky tests?

@thampiotr
Copy link
Contributor Author

LGTM, did you test it with the version of the batch processor that was causing the flaky tests?

Yes. There is a test added in batch_processor that fails if the changes here are removed.

@thampiotr thampiotr merged commit f24c2f9 into main Nov 15, 2024
18 checks passed
@thampiotr thampiotr deleted the update-otel-112-fix-race branch November 15, 2024 11:50
vaxvms pushed a commit to vaxvms/alloy that referenced this pull request Nov 20, 2024
* Fix race condition in otelcol processors

* fix issues

* Test the pausing scheduler

---------

Co-authored-by: William Dumont <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants