Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry webhook validation #482

Closed

Conversation

Miles-Garnsey
Copy link
Member

@Miles-Garnsey Miles-Garnsey commented Mar 28, 2022

What this PR does:

Validates telemetry specifications within the K8ssandraCluster, having regard for the installation status of Prometheus across each K8s DC in the cluster and the telemetry requested for it.

This is preferred to the current situation where validation occurs during reconciliation for two reasons:

  1. It ensures that users get fast feedback on the failure via the resource being rejected, instead of them needing to consult logs to determine why service monitors are not created.
  2. It ensures that invalid resources do not pass through into the cluster. When this happens, additional updates to the resource cannot propagate, which can cause stability risks if a broken resource is not noticed and then subsequent changes need to be made.

Which issue(s) this PR fixes:
Fixes #235

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@Miles-Garnsey Miles-Garnsey requested a review from a team as a code owner March 28, 2022 01:12
@Miles-Garnsey Miles-Garnsey changed the title Telemetry validation webhook Telemetry webhook validation Mar 28, 2022
@Miles-Garnsey Miles-Garnsey force-pushed the feature/telemetry-webhook branch 3 times, most recently from a7a62a5 to f698284 Compare March 28, 2022 04:11
@Miles-Garnsey
Copy link
Member Author

Miles-Garnsey commented Mar 29, 2022

I'm trying to figure out what is causing the tests to fail on this one. It isn't really straight forward, especially considering some tests are apparently still flaky. I'm uncertain if I can progress this without #472 being merged.

One source of new flakes appears to be the envtest StopDatacenter, which fails due to the following error at line 184:

                	            	Operation cannot be fulfilled on k8ssandraclusters.k8ssandra.io "stop-dc-test": the object has been modified; please apply your changes to the latest version and try again

This definitely should not be happening, as this is a transient failure due to optimistic locking and should never be fatal. It doesn't happen locally but occurs on GHA because the Patch call has no timing tolerance. Wrapping in an eventually now to see if that improves things.

@Miles-Garnsey
Copy link
Member Author

Miles-Garnsey commented Mar 29, 2022

I've wrapped the patch operation in an Eventually:

patch := client.MergeFromWithOptions(kc.DeepCopy(), client.MergeFromWithOptimisticLock{})
	kc.Spec.Cassandra.Datacenters[0].Stopped = true
	require.Eventually(t, func() bool {
		err := f.Client.Patch(ctx, kc, patch)
		return err == nil
	}, timeout, interval, "timeout waiting to patch dc1 while stopping")

The StopDatacenter test still fails on GHA, and now fails locally as well (BONUS!) This is going well.

@jsanda
Copy link
Contributor

jsanda commented Mar 29, 2022

You shouldn't need to wrap writes in Eventually calls unless you are doing so to add retry logic. It's not needed for consistency though.

@Miles-Garnsey
Copy link
Member Author

You shouldn't need to wrap writes in Eventually calls unless you are doing so to add retry logic. It's not needed for consistency though.

Yes, but I need to add retry logic because of the optimistic locking error above Operation cannot be fulfilled on k8ssandraclusters.k8ssandra.io "stop-dc-test": the object has been modified; please apply your changes to the latest version and try again

@Miles-Garnsey Miles-Garnsey force-pushed the feature/telemetry-webhook branch from 1f58be8 to 2611db0 Compare March 29, 2022 03:44
@burmanm
Copy link
Contributor

burmanm commented Mar 29, 2022

Writing old object 100 times will not make a difference, it's still the old object. You need to refresh it (Get)

@Miles-Garnsey
Copy link
Member Author

Writing old object 100 times will not make a difference, it's still the old object. You need to refresh it (Get)

Oh wow, how did I miss that. Thanks @burmanm you just saved me a fair bit of time honestly.

@Miles-Garnsey Miles-Garnsey force-pushed the feature/telemetry-webhook branch from 055e6ff to 7a526ba Compare March 29, 2022 05:07
@Miles-Garnsey Miles-Garnsey force-pushed the feature/telemetry-webhook branch from 7a526ba to c93a7cf Compare March 29, 2022 05:25
@Miles-Garnsey Miles-Garnsey mentioned this pull request Mar 30, 2022
5 tasks
@Miles-Garnsey Miles-Garnsey force-pushed the feature/telemetry-webhook branch from ae3557f to 706a70b Compare April 1, 2022 03:04
@Miles-Garnsey Miles-Garnsey force-pushed the feature/telemetry-webhook branch from 706a70b to 347fffc Compare April 1, 2022 03:10
@Miles-Garnsey
Copy link
Member Author

Miles-Garnsey commented Apr 1, 2022

I think this PR is ready for preliminary review.

I've fixed 5 instances where tests were fragile due to being timing sensitive.

There are 8 tests still failing. All fail due to timing related issues, 6/8 are in the multi-cluster suite.

I'm going to investigate the two that are failing in the single-cluster suite, I'm hoping the multi-cluster failures will be dealt with by #472, but we can look to make the calls to the remote cluster concurrent if that doesn't resolve things.

(Edited as a different set of tests is now failing...)

Make patches in stargate tests async-safe.
Make e2e cleanup retry on failure.
@adejanovski
Copy link
Contributor

This is largely outdated and we never managed to reach a consensus so I'll go ahead and close it.
We can rediscuss this if we think we have a need and a path forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Validate telemetry configurations + provide safety
5 participants