Telemetry webhook validation #482

Miles-Garnsey · 2022-03-28T01:12:14Z

What this PR does:

Validates telemetry specifications within the K8ssandraCluster, having regard for the installation status of Prometheus across each K8s DC in the cluster and the telemetry requested for it.

This is preferred to the current situation where validation occurs during reconciliation for two reasons:

It ensures that users get fast feedback on the failure via the resource being rejected, instead of them needing to consult logs to determine why service monitors are not created.
It ensures that invalid resources do not pass through into the cluster. When this happens, additional updates to the resource cannot propagate, which can cause stability risks if a broken resource is not noticed and then subsequent changes need to be made.

Which issue(s) this PR fixes:
Fixes #235

Checklist

Changes manually tested
Automated Tests added/updated
Documentation added/updated
CHANGELOG.md updated (not required for documentation PRs)
CLA Signed: DataStax CLA

Miles-Garnsey · 2022-03-29T00:05:07Z

I'm trying to figure out what is causing the tests to fail on this one. It isn't really straight forward, especially considering some tests are apparently still flaky. I'm uncertain if I can progress this without #472 being merged.

One source of new flakes appears to be the envtest StopDatacenter, which fails due to the following error at line 184:

                	            	Operation cannot be fulfilled on k8ssandraclusters.k8ssandra.io "stop-dc-test": the object has been modified; please apply your changes to the latest version and try again

This definitely should not be happening, as this is a transient failure due to optimistic locking and should never be fatal. It doesn't happen locally but occurs on GHA because the Patch call has no timing tolerance. Wrapping in an eventually now to see if that improves things.

Miles-Garnsey · 2022-03-29T01:12:36Z

I've wrapped the patch operation in an Eventually:

patch := client.MergeFromWithOptions(kc.DeepCopy(), client.MergeFromWithOptimisticLock{})
	kc.Spec.Cassandra.Datacenters[0].Stopped = true
	require.Eventually(t, func() bool {
		err := f.Client.Patch(ctx, kc, patch)
		return err == nil
	}, timeout, interval, "timeout waiting to patch dc1 while stopping")

The StopDatacenter test still fails on GHA, and now fails locally as well (BONUS!) This is going well.

jsanda · 2022-03-29T01:19:55Z

You shouldn't need to wrap writes in Eventually calls unless you are doing so to add retry logic. It's not needed for consistency though.

Miles-Garnsey · 2022-03-29T02:35:34Z

You shouldn't need to wrap writes in Eventually calls unless you are doing so to add retry logic. It's not needed for consistency though.

Yes, but I need to add retry logic because of the optimistic locking error above Operation cannot be fulfilled on k8ssandraclusters.k8ssandra.io "stop-dc-test": the object has been modified; please apply your changes to the latest version and try again

burmanm · 2022-03-29T04:16:28Z

Writing old object 100 times will not make a difference, it's still the old object. You need to refresh it (Get)

Miles-Garnsey · 2022-03-29T04:36:36Z

Writing old object 100 times will not make a difference, it's still the old object. You need to refresh it (Get)

Oh wow, how did I miss that. Thanks @burmanm you just saved me a fair bit of time honestly.

…y. It is breaking the unit test workflow on GHA.

…ar imports.

controllers/k8ssandra/cassandra_telemetry_reconciler.go

apis/k8ssandra/v1alpha1/k8ssandracluster_webhook.go

…ses.

… utils from `pkg/test`. Add test for telemetry validation in webhook.

…is possible when commonlabels defined.

…lly defined in KCluster.

Miles-Garnsey · 2022-04-01T05:57:47Z

I think this PR is ready for preliminary review.

I've fixed 5 instances where tests were fragile due to being timing sensitive.

There are 8 tests still failing. All fail due to timing related issues, 6/8 are in the multi-cluster suite.

I'm going to investigate the two that are failing in the single-cluster suite, I'm hoping the multi-cluster failures will be dealt with by #472, but we can look to make the calls to the remote cluster concurrent if that doesn't resolve things.

(Edited as a different set of tests is now failing...)

test/webhooks/k8ssandracluster_webhook_test.go

pkg/validation/telemetry_validation.go

pkg/test/fakeclient.go

controllers/stargate/stargate_telemetry_reconciler.go

apis/k8ssandra/v1alpha1/k8ssandracluster_webhook.go

Make patches in stargate tests async-safe. Make e2e cleanup retry on failure.

adejanovski · 2024-01-15T10:08:37Z

This is largely outdated and we never managed to reach a consensus so I'll go ahead and close it.
We can rediscuss this if we think we have a need and a path forward.

Miles-Garnsey requested a review from a team as a code owner March 28, 2022 01:12

Miles-Garnsey changed the title ~~Telemetry validation webhook~~ Telemetry webhook validation Mar 28, 2022

Miles-Garnsey force-pushed the feature/telemetry-webhook branch 3 times, most recently from a7a62a5 to f698284 Compare March 28, 2022 04:11

Miles-Garnsey force-pushed the feature/telemetry-webhook branch from 1f58be8 to 2611db0 Compare March 29, 2022 03:44

Miles-Garnsey force-pushed the feature/telemetry-webhook branch from 055e6ff to 7a526ba Compare March 29, 2022 05:07

Miles-Garnsey added 5 commits March 29, 2022 16:23

Fix a stray env: element that was left hanging after an update toda…

0c18e26

…y. It is breaking the unit test workflow on GHA.

Refactor telemetry validation out in pkg/validation to avoid circul…

7a7f683

…ar imports.

Add telemetry spec validation to webhook.

bfd0968

Add logic to handle case where CR Datacenters slice is empty.

8c344b8

Make stop DC test more timing tolerant.

c93a7cf

Miles-Garnsey force-pushed the feature/telemetry-webhook branch from 7a526ba to c93a7cf Compare March 29, 2022 05:25

jeffbanks reviewed Mar 29, 2022

View reviewed changes

controllers/k8ssandra/cassandra_telemetry_reconciler.go Show resolved Hide resolved

jeffbanks reviewed Mar 29, 2022

View reviewed changes

apis/k8ssandra/v1alpha1/k8ssandracluster_webhook.go Outdated Show resolved Hide resolved

Rename clientGetter variable in webhook for telemetry validation.

9816d92

Miles-Garnsey mentioned this pull request Mar 30, 2022

Refactor webhooks #493

Draft

5 tasks

Miles-Garnsey added 7 commits March 31, 2022 12:14

Fix nil pointer dereference when spec.datacenters does not exist.

ffc7a22

Another nil dereference fixed.

b1d9250

Refactor fakeclient and add MockClientCache for remote client test ca…

9f00a74

…ses.

Refactor webhook tests out into test/webhooks so that we can use test…

9253fe2

… utils from `pkg/test`. Add test for telemetry validation in webhook.

Fix edge case in telemetry merge logic where assignment to a nil map …

ad518a5

…is possible when commonlabels defined.

Fix nil pointer dereference in tests due to stargate not being initia…

5237a97

…lly defined in KCluster.

Another nil map issue in merge functionality.

ce49d27

Fix more bugs.

c3e6992

Miles-Garnsey mentioned this pull request Mar 31, 2022

Refactor webhook logic #497

Open

Miles-Garnsey added 2 commits April 1, 2022 10:25

Remove debugging code from webhook test and add another unit test.

322d720

Final NPE fixed and unit test added for it.

c3545fd

Miles-Garnsey force-pushed the feature/telemetry-webhook branch from ae3557f to 706a70b Compare April 1, 2022 03:04

Miles-Garnsey added 2 commits April 1, 2022 14:10

Make CreateSingleReaper e2e test timing tolerant.

8b5df14

Make removeDcFromCluster timing tolerant.

347fffc

Miles-Garnsey force-pushed the feature/telemetry-webhook branch from 706a70b to 347fffc Compare April 1, 2022 03:10

Miles-Garnsey added 5 commits April 1, 2022 14:12

Remove some extraneous logging.

dd62da4

Go fmt and vet changes.

5e81138

controller-gen copyright update.

449aefa

Make CreateSingleDatacenterCluster timing tolerant.

2508ff3

More async safety for testRemoveReaperFromK8ssandraCluster().

ce1964e

burmanm requested changes Apr 1, 2022

View reviewed changes

jsanda mentioned this pull request Apr 4, 2022

Validate telemetry configurations + provide safety #235

Open

Make fixture deployments retry in e2e tests.

3c06752

Make patches in stargate tests async-safe. Make e2e cleanup retry on failure.

adejanovski closed this Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry webhook validation #482

Telemetry webhook validation #482

Miles-Garnsey commented Mar 28, 2022 •

edited

Loading

Miles-Garnsey commented Mar 29, 2022 •

edited

Loading

Miles-Garnsey commented Mar 29, 2022 •

edited

Loading

jsanda commented Mar 29, 2022

Miles-Garnsey commented Mar 29, 2022

burmanm commented Mar 29, 2022

Miles-Garnsey commented Mar 29, 2022

Miles-Garnsey commented Apr 1, 2022 •

edited

Loading

adejanovski commented Jan 15, 2024

Telemetry webhook validation #482

Telemetry webhook validation #482

Conversation

Miles-Garnsey commented Mar 28, 2022 • edited Loading

Miles-Garnsey commented Mar 29, 2022 • edited Loading

Miles-Garnsey commented Mar 29, 2022 • edited Loading

jsanda commented Mar 29, 2022

Miles-Garnsey commented Mar 29, 2022

burmanm commented Mar 29, 2022

Miles-Garnsey commented Mar 29, 2022

Miles-Garnsey commented Apr 1, 2022 • edited Loading

adejanovski commented Jan 15, 2024

Miles-Garnsey commented Mar 28, 2022 •

edited

Loading

Miles-Garnsey commented Mar 29, 2022 •

edited

Loading

Miles-Garnsey commented Mar 29, 2022 •

edited

Loading

Miles-Garnsey commented Apr 1, 2022 •

edited

Loading