[target-allocator] Introduce "per node" allocation strategy to target allocator #2430

matej-g · 2023-12-08T13:47:57Z

Description:
Resolves #1828.

Introduces new allocation strategy "per node" that is intended to distribute targets to collectors on the basis of the node on which the targets reside, i.e. each collector will be instructed to scrape targers running on the collector's node.

This strategy is intended to be only used with a collector in deamonset mode (running as agent on each node). It's not suitable for other modes.

Link to tracking Issue: #1828

Testing:

Added unit tests
Also did manual testing

Documentation: Updated API documentation.

Signed-off-by: Matej Gera <[email protected]>

cmd/otel-allocator/allocation/per_node.go

matej-g · 2023-12-08T13:55:47Z

I also noticed we are reusing a lot of code between allocation strategies. I was thinking about adjusting the abstraction (perhaps allocator does not need to be an interface, we could maybe have more specific interface that would implement only methods that differ across strategies). But I did not want to introduce too many changes at once.

Signed-off-by: Matej Gera <[email protected]>

swiatekm · 2023-12-11T14:16:22Z

I also noticed we are reusing a lot of code between allocation strategies. I was thinking about adjusting the abstraction (perhaps allocator does not need to be an interface, we could maybe have more specific interface that would implement only methods that differ across strategies). But I did not want to introduce too many changes at once.

Yeah, we can do that afterwards in a separate PR.

swiatekm

The implementation looks good at first glance. Can you add a E2E test for it? I think it'd be easiest to check by collecting kubelet metrics via the following scrape config:

        - job_name: kubelet
          scheme: https
          authorization:
            credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: true
          honor_labels: true
          kubernetes_sd_configs:
            - role: node
          metric_relabel_configs:
            - action: keep
              regex: "kubelet_running_pods"
              source_labels: [__name__]

Have a look at the existing E2E test for Prometheus CR for reference.

Signed-off-by: Matej Gera <[email protected]>

matej-g · 2023-12-18T10:12:24Z

Thanks for the pointer @swiatekm-sumo, I have added an E2E test and also reorganized the tests a bit so that TA E2E test can be run separately under separate target (hopefully you don't mind 🙂).

Also regarding logging unassignable jobs / targets (#1828 (comment)), I tried to make it in a way that will not lead to overly noisy logs, i.e. avoid reporting on each targets loading and only do it on diff.

Signed-off-by: Matej Gera <[email protected]>

jaronoff97

Overall I really like the improvement here. I have one suggestion which may help with making it more efficient, interested in hearing your thoughts. Thank you!

cmd/otel-allocator/allocation/per_node.go

swiatekm · 2023-12-19T11:36:15Z

Thanks for the pointer @swiatekm-sumo, I have added an E2E test and also reorganized the tests a bit so that TA E2E test can be run separately under separate target (hopefully you don't mind 🙂).

Is there a reason to do this? If you only want the target allocator tests, you can use kuttl's --test flag to select them.

And if you really want to, you should make sure they're run in the CI as well.

Signed-off-by: Matej Gera <[email protected]>

matej-g · 2023-12-19T12:25:53Z

Is there a reason to do this? If you only want the target allocator tests, you can use kuttl's --test flag to select them.

And if you really want to, you should make sure they're run in the CI as well.

I think it makes sense with respect to how other test cases are organized, since we have usually different test directories for different components (OpAMP bridge, auto-instrumentation). But you're right, it was missing from the list in the action config, I fixed that now.

Signed-off-by: Matej Gera <[email protected]>

apis/v1alpha1/collector_webhook.go

jaronoff97 · 2024-01-17T23:37:17Z

cmd/otel-allocator/allocation/per_node.go

@@ -190,12 +190,8 @@ func (allocator *perNodeAllocator) handleTargets(diff diff.Changes[*target.Item]

 	// Check for unassigned targets
 	if len(unassignedTargetsForJobs) > 0 {


this can just become an integer that's incremented on line 186, i don't think there's a need for a map, unless we care about deduping (which should happen already in the filter strategy)

I was wondering if there can be a scenario where we have multiple targets per job, that cannot be assigned. But I see the point, I simplified this to just count the number of targets, that should hopefully give enough info when debugging.

Signed-off-by: Matej Gera <[email protected]>

swiatekm

Looks good overall. I've left some comments around trying to simplify the bookeeping code in the strategy implementation, but I don't think any of it is necessarily blocking - if you'd like to clean it up in a following PR, I'm ok with that as well.

cmd/otel-allocator/allocation/per_node.go

cmd/otel-allocator/allocation/strategy.go

cmd/otel-allocator/allocation/per_node.go

cmd/otel-allocator/allocation/per_node_test.go

matej-g · 2024-01-19T09:25:32Z

Hey @swiatekm-sumo, thanks for the detailed review. I'm planning to address all of it, not sure how fast I will get to it. I'm happy to do it in this PR or the next one, whatever will be easier for maintainers to review.

I'm also in no rush to merge this (I'm using a fork for my needs), but I'm not sure if there are other users waiting for this feature.

Signed-off-by: Matej Gera <[email protected]>

matej-g · 2024-02-06T10:55:50Z

Hey folks,
Finally managed to get back to this, I tried to address all the feedback:

I simplified the code as suggested where it was unnecessarily complicated from the methods taken over from other strategies
Corrected the documentation where it referenced incorrect information
Made the unassigned metric a gauge instead of counter
Added unit test for edge cases (analogous from consistent hashing)

Hopefully this is ready for the final review, thank you @swiatekm-sumo @jaronoff97 🙇

Signed-off-by: Matej Gera <[email protected]>

jaronoff97 · 2024-02-06T15:04:52Z

@matej-g thank you! what was the reason for changing the metric type? that may break anyone who has existing dashboards for the TA, we should just rename it too if we do want it to be a gauge

jaronoff97

Just one concern around the metric type change, otherwise this LGTM.

matej-g · 2024-02-06T15:55:13Z

Hey @jaronoff97, this was based off of the suggestion in #2430 (comment) - this is a new metric we're introducing with this change, so it should not be breaking

jaronoff97

Ah that's right. Looks good to me! I'll wait on @swiatekm-sumo 's review to merge. Thank you so much!

jaronoff97 · 2024-02-09T15:35:56Z

@matej-g thanks for this awesome work!! I really appreciate it.

swiatekm · 2024-02-09T15:44:35Z

👏

I think it's now possible to carry out nearly all data collection in K8s from a DaemonSet, which is pretty significant.

matej-g · 2024-02-12T08:28:32Z

My pleasure, thanks you both for helping me move it over the finish line!

… allocator (open-telemetry#2430)

matej-g added 6 commits December 8, 2023 14:03

Add node name to Collector struct

6defa78

Signed-off-by: Matej Gera <[email protected]>

Add per node allocation algorithm

318fd6c

Signed-off-by: Matej Gera <[email protected]>

More docs

d298b69

Signed-off-by: Matej Gera <[email protected]>

Add tests

886a80c

Signed-off-by: Matej Gera <[email protected]>

Update APIs and docs with new strategy

23089e3

Signed-off-by: Matej Gera <[email protected]>

Add changelog

7f82f69

Signed-off-by: Matej Gera <[email protected]>

matej-g requested review from a team December 8, 2023 13:47

matej-g commented Dec 8, 2023

View reviewed changes

cmd/otel-allocator/allocation/per_node.go Outdated Show resolved Hide resolved

Fix lint

959c3c5

Signed-off-by: Matej Gera <[email protected]>

matej-g mentioned this pull request Dec 8, 2023

TargetAllocator for daemonset #1828

Closed

Add more labels to match node name

082e678

Signed-off-by: Matej Gera <[email protected]>

swiatekm reviewed Dec 11, 2023

View reviewed changes

matej-g added 4 commits December 18, 2023 09:34

Better handling of unassignable targets / jobs

2d29ae6

Signed-off-by: Matej Gera <[email protected]>

Adjust webhook validation

c08f560

Signed-off-by: Matej Gera <[email protected]>

Add E2E test; put TA E2E tests into separate target

8c0f9e6

Signed-off-by: Matej Gera <[email protected]>

Merge remote-tracking branch 'origin/main' into per-node-strategy

5ad0079

Format; fix mistakenly changed test image

acf7644

Signed-off-by: Matej Gera <[email protected]>

matej-g force-pushed the per-node-strategy branch from 01e3c23 to acf7644 Compare December 18, 2023 10:30

jaronoff97 approved these changes Dec 18, 2023

View reviewed changes

cmd/otel-allocator/allocation/per_node.go Show resolved Hide resolved

matej-g added 2 commits December 19, 2023 13:20

Use node name instead of collector name for collector map

57d23bd

Signed-off-by: Matej Gera <[email protected]>

Add TA E2E tests to GH action

0e09fc8

Signed-off-by: Matej Gera <[email protected]>

matej-g and others added 3 commits December 19, 2023 14:30

Merge branch 'main' into per-node-strategy

0c8239e

Bump kuttl timeout

9c6c6e6

Signed-off-by: Matej Gera <[email protected]>

Merge remote-tracking branch 'origin/main' into per-node-strategy

6d08c78

matej-g and others added 5 commits January 15, 2024 10:06

Merge remote-tracking branch 'origin/main' into per-node-strategy

0c3fbdc

Put TA config validation into separate method

0b3363e

Signed-off-by: Matej Gera <[email protected]>

Adjust logging and add metric for jobs with unassigned targets

02ee921

Signed-off-by: Matej Gera <[email protected]>

Merge branch 'main' into per-node-strategy

d2d46c9

Merge branch 'main' into per-node-strategy

485d9d6

matej-g requested a review from jaronoff97 January 17, 2024 12:16

jaronoff97 reviewed Jan 17, 2024

View reviewed changes

matej-g added 2 commits January 18, 2024 11:18

Address feedback

b8befdf

Signed-off-by: Matej Gera <[email protected]>

Merge remote-tracking branch 'origin/main' into per-node-strategy

f7eeb5e

swiatekm reviewed Jan 18, 2024

View reviewed changes

matej-g added 4 commits February 1, 2024 16:46

Merge remote-tracking branch 'origin/main' into per-node-strategy

253b340

Simplify flow according to feedback; correct docs

9075a84

Signed-off-by: Matej Gera <[email protected]>

Add more unit tests for edge cases

339ed5c

Signed-off-by: Matej Gera <[email protected]>

Merge remote-tracking branch 'origin/main' into per-node-strategy

7defcdc

matej-g requested review from swiatekm and jaronoff97 February 6, 2024 10:57

Fix test case assert file

5d31f67

Signed-off-by: Matej Gera <[email protected]>

jaronoff97 reviewed Feb 6, 2024

View reviewed changes

jaronoff97 approved these changes Feb 6, 2024

View reviewed changes

Merge branch 'main' into per-node-strategy

4f195d5

swiatekm approved these changes Feb 9, 2024

View reviewed changes

jaronoff97 merged commit 02e44fb into open-telemetry:main Feb 9, 2024
29 checks passed

ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024

[target-allocator] Introduce "per node" allocation strategy to target…

77de9c6

… allocator (open-telemetry#2430)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[target-allocator] Introduce "per node" allocation strategy to target allocator #2430

[target-allocator] Introduce "per node" allocation strategy to target allocator #2430

matej-g commented Dec 8, 2023

matej-g commented Dec 8, 2023

swiatekm commented Dec 11, 2023

swiatekm left a comment •

edited

Loading

matej-g commented Dec 18, 2023

jaronoff97 left a comment

swiatekm commented Dec 19, 2023

matej-g commented Dec 19, 2023

jaronoff97 Jan 17, 2024

matej-g Jan 18, 2024

swiatekm left a comment

matej-g commented Jan 19, 2024

matej-g commented Feb 6, 2024

jaronoff97 commented Feb 6, 2024

jaronoff97 left a comment

matej-g commented Feb 6, 2024

jaronoff97 left a comment

jaronoff97 commented Feb 9, 2024

swiatekm commented Feb 9, 2024 •

edited

Loading

matej-g commented Feb 12, 2024

		@@ -190,12 +190,8 @@ func (allocator perNodeAllocator) handleTargets(diff diff.Changes[target.Item]

		// Check for unassigned targets
		if len(unassignedTargetsForJobs) > 0 {

[target-allocator] Introduce "per node" allocation strategy to target allocator #2430

[target-allocator] Introduce "per node" allocation strategy to target allocator #2430

Conversation

matej-g commented Dec 8, 2023

matej-g commented Dec 8, 2023

swiatekm commented Dec 11, 2023

swiatekm left a comment • edited Loading

Choose a reason for hiding this comment

matej-g commented Dec 18, 2023

jaronoff97 left a comment

Choose a reason for hiding this comment

swiatekm commented Dec 19, 2023

matej-g commented Dec 19, 2023

jaronoff97 Jan 17, 2024

Choose a reason for hiding this comment

matej-g Jan 18, 2024

Choose a reason for hiding this comment

swiatekm left a comment

Choose a reason for hiding this comment

matej-g commented Jan 19, 2024

matej-g commented Feb 6, 2024

jaronoff97 commented Feb 6, 2024

jaronoff97 left a comment

Choose a reason for hiding this comment

matej-g commented Feb 6, 2024

jaronoff97 left a comment

Choose a reason for hiding this comment

jaronoff97 commented Feb 9, 2024

swiatekm commented Feb 9, 2024 • edited Loading

matej-g commented Feb 12, 2024

swiatekm left a comment •

edited

Loading

swiatekm commented Feb 9, 2024 •

edited

Loading