Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[target-allocator] Introduce "per node" allocation strategy to target allocator #2430

Merged
merged 38 commits into from
Feb 9, 2024

Conversation

matej-g
Copy link
Contributor

@matej-g matej-g commented Dec 8, 2023

Description:
Resolves #1828.

Introduces new allocation strategy "per node" that is intended to distribute targets to collectors on the basis of the node on which the targets reside, i.e. each collector will be instructed to scrape targers running on the collector's node.

This strategy is intended to be only used with a collector in deamonset mode (running as agent on each node). It's not suitable for other modes.

Link to tracking Issue: #1828

Testing:

  • Added unit tests
  • Also did manual testing

Documentation: Updated API documentation.

Signed-off-by: Matej Gera <[email protected]>
Signed-off-by: Matej Gera <[email protected]>
Signed-off-by: Matej Gera <[email protected]>
@matej-g matej-g requested review from a team December 8, 2023 13:47
@matej-g
Copy link
Contributor Author

matej-g commented Dec 8, 2023

I also noticed we are reusing a lot of code between allocation strategies. I was thinking about adjusting the abstraction (perhaps allocator does not need to be an interface, we could maybe have more specific interface that would implement only methods that differ across strategies). But I did not want to introduce too many changes at once.

Signed-off-by: Matej Gera <[email protected]>
@swiatekm
Copy link
Contributor

I also noticed we are reusing a lot of code between allocation strategies. I was thinking about adjusting the abstraction (perhaps allocator does not need to be an interface, we could maybe have more specific interface that would implement only methods that differ across strategies). But I did not want to introduce too many changes at once.

Yeah, we can do that afterwards in a separate PR.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good at first glance. Can you add a E2E test for it? I think it'd be easiest to check by collecting kubelet metrics via the following scrape config:

        - job_name: kubelet
          scheme: https
          authorization:
            credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: true
          honor_labels: true
          kubernetes_sd_configs:
            - role: node
          metric_relabel_configs:
            - action: keep
              regex: "kubelet_running_pods"
              source_labels: [__name__]

Have a look at the existing E2E test for Prometheus CR for reference.

@matej-g
Copy link
Contributor Author

matej-g commented Dec 18, 2023

Thanks for the pointer @swiatekm-sumo, I have added an E2E test and also reorganized the tests a bit so that TA E2E test can be run separately under separate target (hopefully you don't mind 🙂).

Also regarding logging unassignable jobs / targets (#1828 (comment)), I tried to make it in a way that will not lead to overly noisy logs, i.e. avoid reporting on each targets loading and only do it on diff.

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I really like the improvement here. I have one suggestion which may help with making it more efficient, interested in hearing your thoughts. Thank you!

cmd/otel-allocator/allocation/per_node.go Show resolved Hide resolved
@swiatekm
Copy link
Contributor

Thanks for the pointer @swiatekm-sumo, I have added an E2E test and also reorganized the tests a bit so that TA E2E test can be run separately under separate target (hopefully you don't mind 🙂).

Is there a reason to do this? If you only want the target allocator tests, you can use kuttl's --test flag to select them.

And if you really want to, you should make sure they're run in the CI as well.

@matej-g
Copy link
Contributor Author

matej-g commented Dec 19, 2023

Is there a reason to do this? If you only want the target allocator tests, you can use kuttl's --test flag to select them.

And if you really want to, you should make sure they're run in the CI as well.

I think it makes sense with respect to how other test cases are organized, since we have usually different test directories for different components (OpAMP bridge, auto-instrumentation). But you're right, it was missing from the list in the action config, I fixed that now.

apis/v1alpha1/collector_webhook.go Outdated Show resolved Hide resolved
@@ -190,12 +190,8 @@ func (allocator *perNodeAllocator) handleTargets(diff diff.Changes[*target.Item]

// Check for unassigned targets
if len(unassignedTargetsForJobs) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can just become an integer that's incremented on line 186, i don't think there's a need for a map, unless we care about deduping (which should happen already in the filter strategy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if there can be a scenario where we have multiple targets per job, that cannot be assigned. But I see the point, I simplified this to just count the number of targets, that should hopefully give enough info when debugging.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I've left some comments around trying to simplify the bookeeping code in the strategy implementation, but I don't think any of it is necessarily blocking - if you'd like to clean it up in a following PR, I'm ok with that as well.

cmd/otel-allocator/allocation/per_node.go Outdated Show resolved Hide resolved
cmd/otel-allocator/allocation/per_node.go Outdated Show resolved Hide resolved
cmd/otel-allocator/allocation/per_node.go Outdated Show resolved Hide resolved
cmd/otel-allocator/allocation/strategy.go Outdated Show resolved Hide resolved
cmd/otel-allocator/allocation/per_node.go Outdated Show resolved Hide resolved
cmd/otel-allocator/allocation/per_node.go Outdated Show resolved Hide resolved
cmd/otel-allocator/allocation/per_node.go Outdated Show resolved Hide resolved
@matej-g
Copy link
Contributor Author

matej-g commented Jan 19, 2024

Hey @swiatekm-sumo, thanks for the detailed review. I'm planning to address all of it, not sure how fast I will get to it. I'm happy to do it in this PR or the next one, whatever will be easier for maintainers to review.

I'm also in no rush to merge this (I'm using a fork for my needs), but I'm not sure if there are other users waiting for this feature.

@matej-g
Copy link
Contributor Author

matej-g commented Feb 6, 2024

Hey folks,
Finally managed to get back to this, I tried to address all the feedback:

  • I simplified the code as suggested where it was unnecessarily complicated from the methods taken over from other strategies
  • Corrected the documentation where it referenced incorrect information
  • Made the unassigned metric a gauge instead of counter
  • Added unit test for edge cases (analogous from consistent hashing)

Hopefully this is ready for the final review, thank you @swiatekm-sumo @jaronoff97 🙇

Signed-off-by: Matej Gera <[email protected]>
@jaronoff97
Copy link
Contributor

@matej-g thank you! what was the reason for changing the metric type? that may break anyone who has existing dashboards for the TA, we should just rename it too if we do want it to be a gauge

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one concern around the metric type change, otherwise this LGTM.

@matej-g
Copy link
Contributor Author

matej-g commented Feb 6, 2024

Hey @jaronoff97, this was based off of the suggestion in #2430 (comment) - this is a new metric we're introducing with this change, so it should not be breaking

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's right. Looks good to me! I'll wait on @swiatekm-sumo 's review to merge. Thank you so much!

@jaronoff97 jaronoff97 merged commit 02e44fb into open-telemetry:main Feb 9, 2024
29 checks passed
@jaronoff97
Copy link
Contributor

@matej-g thanks for this awesome work!! I really appreciate it.

@swiatekm
Copy link
Contributor

swiatekm commented Feb 9, 2024

👏

I think it's now possible to carry out nearly all data collection in K8s from a DaemonSet, which is pretty significant.

@matej-g
Copy link
Contributor Author

matej-g commented Feb 12, 2024

My pleasure, thanks you both for helping me move it over the finish line!

ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TargetAllocator for daemonset
4 participants