Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Target Allocator not saving targets when collector instances take time to come up #2351

Merged

Conversation

rashmichandrashekar
Copy link
Contributor

Description:
Fixing a bug - In a kubernetes cluster, have scrape jobs configured using TargetAllocator. Due to timing issue(which is very common with kubernetes clusters, where no ordering can be guaranteed), the instances are not discovered until after the first set of targets are discovered. The configuration applied until the instances are discovered are lost.
This PR fixes that issue where it saves the targets until the collector instances come up

Link to tracking Issue:
#2350

Testing:
Tested that in such scenarios, the targets are not lost and assigned to collector instances when they come up

@swiatekm
Copy link
Contributor

Can you add a unit test for this case?

Also, does the same problem not apply to the other allocation strategy as well?

@rashmichandrashekar
Copy link
Contributor Author

Can you add a unit test for this case?

Also, does the same problem not apply to the other allocation strategy as well?

Thanks @swiatekm-sumo. I have made the changed for the other hashing algo as well and added a couple unit tests. please lmk if anything else is needed.

@swiatekm
Copy link
Contributor

Can you also check the case where you add targets to a strategy with 0 collectors, and later add collectors to see if targets get allocated to them correctly?

There's also a lint error you need to fix.

@rashmichandrashekar
Copy link
Contributor Author

Can you also check the case where you add targets to a strategy with 0 collectors, and later add collectors to see if targets get allocated to them correctly?

There's also a lint error you need to fix.

Thanks @swiatekm-sumo . That's a good test case, I have added that. Please take a look.

@rashmichandrashekar
Copy link
Contributor Author

@swiatekm-sumo - I see an e2e test failure due to timeouts during CR creation. Could you please rerun the tests? Thanks!

@rashmichandrashekar
Copy link
Contributor Author

@swiatekm-sumo - All the checks seem to have passed, could you pls help get this merged? Thanks!

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there should be a way to do this without all the special cases, but we can refactor it later now that we have tests for them. I don't think there's any point in keeping this PR held up on DRY issues, so I'm approving it.

Thank you for the contribution!

@rashmichandrashekar
Copy link
Contributor Author

Thanks @swiatekm-sumo! I dont have permissions to merge, could you pls merge this?
or maybe @jaronoff97 can help merge this?

@jaronoff97 jaronoff97 merged commit 181fefa into open-telemetry:main Nov 17, 2023
27 checks passed
ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
…take time to come up (open-telemetry#2351)

* change to account for targets discoverd before instances

* adding change log

* adding instance changed to leat weighted as well

* adding tests

* adding unit tests

* updaintg tests

* fixing lint error and updating tests and fixing bug in least weighted algo

* fix indent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants