Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make otel scheduler sync #2262

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Make otel scheduler sync #2262

wants to merge 2 commits into from

Conversation

wildum
Copy link
Contributor

@wildum wildum commented Dec 11, 2024

The fix in #2027 did not address all the race conditions that could happen with the scheduler. We noticed it when getting rid of our Otel fork brought back the flakiness of a batch processor test.

With this change, the scheduling of otel components is not an async operation anymore. This simplifies the logic and ensures that the components are started before they can start consuming.

@wildum wildum requested a review from a team as a code owner December 11, 2024 16:50
Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OTel contract states that Start must return quickly, so I suppose sync would be fine. Do we know if the Collector does this async or sync? It'd be best to just do what it does, just to be on the safe side.

Regarding the issue with the batch processor - does it happen because something is wrong on the OTel side, or because #2027 maybe doesn't work for some edge case?

@ptodev
Copy link
Contributor

ptodev commented Dec 11, 2024

It might be good for @thampiotr to opine here, given that this PR reverts changes in #2027.

@wildum
Copy link
Contributor Author

wildum commented Dec 12, 2024

The OTel contract states that Start must return quickly, so I suppose sync would be fine. Do we know if the Collector does this async or sync? It'd be best to just do what it does, just to be on the safe side.

It's done sync in the otel collector. The service starts everything sync (https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/service.go#L230) and it's called sync here (https://github.com/open-telemetry/opentelemetry-collector/blob/main/otelcol/collector.go#L228)

Regarding the issue with the batch processor - does it happen because something is wrong on the OTel side, or because #2027 maybe doesn't work for some edge case?

Piotr and I had a discussion about it here: https://raintank-corp.slack.com/archives/C02GSU8SHBN/p1733846257348439
TLDR is that on Update we will create a new underlying unstarted Otel component. Because we start it async, there is always a window where the lazy consumer is unpaused and the unstarted component might start consuming before being started.

Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a changelog entry? If it's a problem end users could see, then mentioning the error in the changelog could indicate to them that they shouldn't see it again.

@thampiotr
Copy link
Contributor

It might be good for @thampiotr to opine here, given that this PR reverts changes in #2027.

It doesn't revert all the changes, it's reusing the approach.

Copy link
Contributor

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I like it more. Thanks for fixing @wildum 🚀

@wildum wildum mentioned this pull request Dec 12, 2024
3 tasks
// A message is already queued for refreshing running components so we
// don't have to do anything here.
}
level.Debug(cs.log).Log("msg", "scheduling components", "count", len(cc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
level.Debug(cs.log).Log("msg", "scheduling components", "count", len(cc))
level.Debug(cs.log).Log("msg", "scheduling otelcol components", "count", len(cc))

It's quite an ambiguous log line, but if you think there is value in keeping it I'm happy to have it as a debug log.

* Don't start components until Run is called

* Update consumers after stopping the component

* Minor fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants