Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: status check for config-connector #6766

Merged
merged 11 commits into from
Nov 4, 2021

Conversation

gsquared94
Copy link
Contributor

@gsquared94 gsquared94 commented Oct 25, 2021

Fixes: #6709

Description

This PR implements skaffold's status checker for config connector resources.

Background:
Status check currently works for only deployments and standalone pods. For either type, we poll for the latest status (running kubectl rollout deployment <foo> for deployments, and kubectl get pods ... for standalone pods). We also check the status of individual pods and containers in the deployment.

For config connector resources:

Config connector resources are created from custom resource definitions (CRDs) and there's no direct command to invoke to get their status. However, they all expose a Ready condition that can be waited on.

Testing instructions

Starting deploy...
 - service.serviceusage.cnrm.cloud.google.com/pubsub.googleapis.com created
 - pubsubtopic.pubsub.cnrm.cloud.google.com/random-xxljslierjlj-sample created
Waiting for deployments to stabilize...
 - config-connector-resource/pubsub.cnrm.cloud.google.com/v1beta1, Kind=PubSubTopic, Name=random-xxljslierjlj-sample is ready. [1/2 deployment(s) still pending]
 - config-connector-resource/serviceusage.cnrm.cloud.google.com/v1beta1, Kind=Service, Name=pubsub.googleapis.com is ready.
Deployments stabilized in 2.576 seconds
Press Ctrl+C to exit
  • In the directory failing run skaffold dev. This should cause the status check to fail
Starting deploy...
 - storagebucket.storage.cnrm.cloud.google.com/random-xxljslierjlj-sample created
Waiting for deployments to stabilize...
 - config-control:config-connector-resource/storage.cnrm.cloud.google.com/v1beta1, Kind=StorageBucket, Name=random-xxljslierjlj-sample: Update call failed: error applying desired state: summary: googleapi: Error 400: The storage class you specified is not valid., invalid
    - config-control:StorageBucket/random-xxljslierjlj-sample: Update call failed: error applying desired state: summary: googleapi: Error 400: The storage class you specified is not valid., invalid
 - config-control:config-connector-resource/storage.cnrm.cloud.google.com/v1beta1, Kind=StorageBucket, Name=random-xxljslierjlj-sample failed. Error: Update call failed: error applying desired state: summary: googleapi: Error 400: The storage class you specified is not valid., invalid.
Cleaning up...
 - storagebucket.storage.cnrm.cloud.google.com "random-xxljslierjlj-sample" deleted
1/1 deployment(s) failed

// See https://cloud.google.com/config-connector/docs/overview
var ConfigConnectorResourceSelector = []GroupKindSelector{
// add preliminary support for config connector services; group name is currently in flux
&wildcardGroupKind{Group: regexp.MustCompile(`([[:alpha:]]+\.)+cnrm\.cloud\.google\.com`)},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briandealwis for the purpose of testing I've removed the restriction of only allowing Service kinds. We can add that back, although if this PR is merged, IMO we can start allowing all resource kinds. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@codecov
Copy link

codecov bot commented Oct 25, 2021

Codecov Report

Merging #6766 (70dafd2) into main (290280e) will decrease coverage by 1.10%.
The diff coverage is 60.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6766      +/-   ##
==========================================
- Coverage   70.48%   69.38%   -1.11%     
==========================================
  Files         515      540      +25     
  Lines       23150    24567    +1417     
==========================================
+ Hits        16317    17045     +728     
- Misses       5776     6389     +613     
- Partials     1057     1133      +76     
Impacted Files Coverage Δ
cmd/skaffold/app/cmd/flags.go 89.00% <ø> (-1.82%) ⬇️
cmd/skaffold/skaffold.go 0.00% <ø> (ø)
cmd/skaffold/app/cmd/lint.go 52.94% <52.94%> (ø)
cmd/skaffold/app/cmd/cmd.go 70.49% <75.00%> (-0.57%) ⬇️
cmd/skaffold/app/cmd/debug.go 100.00% <100.00%> (ø)
cmd/skaffold/app/cmd/runner.go 64.17% <100.00%> (ø)
.../skaffold/kubernetes/status/resource/deployment.go 65.48% <0.00%> (-20.32%) ⬇️
pkg/skaffold/initializer/build/builders.go 42.85% <0.00%> (-17.15%) ⬇️
pkg/skaffold/build/cluster/logs.go 0.00% <0.00%> (-16.67%) ⬇️
pkg/skaffold/event/v2/status_check.go 85.45% <0.00%> (-14.55%) ⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15f88b6...70dafd2. Read the comment docs.

@gsquared94 gsquared94 added the kokoro:force-run forces a kokoro re-run on a PR label Oct 26, 2021
@kokoro-team kokoro-team removed the kokoro:force-run forces a kokoro re-run on a PR label Oct 26, 2021
if result.Message == "" {
status.updateAE(proto.StatusCode_STATUSCHECK_CONFIG_CONNECTOR_IN_PROGRESS, result.Message)
} else {
// config connector status doesn't always correctly parse to failed, but shows InProgress with an error message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a bug that can be referenced here?

if recentEvent == nil || recentEvent.Type == v1.EventTypeNormal {
return
}
// TODO: Add unique error codes for reasons
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to walk the events instead of just using the resource conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something implemented for deployment type resources. I didn't think there's any harm continuing that pattern here. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the events are there for debugging and diagnostics purposes, but the object status should be considered the truth.

I think processing the events risks reporting false positives and should not be used for this purpose:

  1. The list of event types is not fixed and may have more added, including diagnostic or informational events. This code would mis-report such events as errors.
  2. What if the last event is at odds with the actual resource status?

Is there a particular situation you've seen where the events were at odds with the resource status? If there's a specific sequence that you've seen that is a precursor for an error, that might make sense to report.

We could log something at debug level if the event type is not normal. Or even an info message and ask that they report an issue to the Skaffold issue tracker.

Comment on lines 148 to 168
type configConnectorSelector struct {
client kubernetes.Interface
dynClient dynamic.Interface
uResource unstructured.Unstructured
}

func NewConfigConnectorSelector(client kubernetes.Interface, dynClient dynamic.Interface, uResource unstructured.Unstructured) CustomResourceSelector {
return &configConnectorSelector{client: client, dynClient: dynClient, uResource: uResource}
}

func (c *configConnectorSelector) Select(ctx context.Context, namespace string, opts metav1.ListOptions) (*unstructured.UnstructuredList, error) {
_, r, err := util.GroupVersionResource(c.client.Discovery(), c.uResource.GroupVersionKind())
if err != nil {
return nil, fmt.Errorf("failed to query config connector resources: %w", err)
}
resList, err := c.dynClient.Resource(r).Namespace(namespace).List(ctx, opts)
if err != nil {
return nil, fmt.Errorf("failed to query config connector resources: %w", err)
}
return resList, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this code: we create a selector for a resource that returns the list of other resources with this same type and name — which should be uResource, right?

Is this to fetch the latest copy of the uResource?

Copy link
Contributor Author

@gsquared94 gsquared94 Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's to fetch the latest version. Also we poll on this selector so that if there are any new resources of this type that show up later and they are checked on that iteration (as against just fetching the list once and checking the status of known resources individually).

@yuwenma
Copy link
Contributor

yuwenma commented Oct 26, 2021

I have a general question:

skaffold does not specifically tie itself to kubernetes which gives it more room to grow as a standalone CICD tooling. Do we really want it to deal with the kubernetes objects (GVK, unstructured) and mechanisms? If so, is that possible to reuse some existing functions to handle GVK, selector operations? I'm worried it may eventually hit many problems that other kubernetes tools have or have been resolved. @tejal29 @briandealwis

@briandealwis
Copy link
Member

We're happy for pointers @yuwenma!

@tejal29
Copy link
Contributor

tejal29 commented Nov 1, 2021

Thanks @yuwenma. Going to take a look at this in depth today.

@gsquared94
Copy link
Contributor Author

I have a general question:

skaffold does not specifically tie itself to kubernetes which gives it more room to grow as a standalone CICD tooling.
Do we really want it to deal with the kubernetes objects (GVK, unstructured) and mechanisms?

This has become the design philosophy now with the implementation of new deployers like docker deployer. However, the entire codebase already has references to kubernetes internals in deploy and status-check phases. We're already using the following packages in our go.mod file:

	k8s.io/api v0.21.3
	k8s.io/apimachinery v0.22.2
	k8s.io/client-go v0.21.3
	k8s.io/kubectl v0.21.3
	k8s.io/utils v0.0.0-20201110183641-67b214c5f920
	knative.dev/pkg v0.0.0-20201119170152-e5e30edc364a // indirect
	sigs.k8s.io/kustomize/kyaml v0.10.17
	sigs.k8s.io/yaml v1.2.0

The status-check phase currently only works for selected kubernetes resources. The Deployer interface however allows any randomly implemented deployer to define it's own implementation of a Monitor.

If so, is that possible to reuse some existing functions to handle GVK, selector operations? I'm worried it may eventually hit many problems that other kubernetes tools have or have been resolved. @tejal29 @briandealwis

we are using the kstatus library to parse the resource status and other client libraries and constructs. We're happy for pointers to other packages that you think could simplify our status-checker implementation.

if recentEvent == nil || recentEvent.Type == v1.EventTypeNormal {
return
}
// TODO: Add unique error codes for reasons
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the events are there for debugging and diagnostics purposes, but the object status should be considered the truth.

I think processing the events risks reporting false positives and should not be used for this purpose:

  1. The list of event types is not fixed and may have more added, including diagnostic or informational events. This code would mis-report such events as errors.
  2. What if the last event is at odds with the actual resource status?

Is there a particular situation you've seen where the events were at odds with the resource status? If there's a specific sequence that you've seen that is a precursor for an error, that might make sense to report.

We could log something at debug level if the event type is not normal. Or even an info message and ask that they report an issue to the Skaffold issue tracker.

Comment on lines 95 to 96
case kstatus.NotFoundStatus:
status.updateAE(proto.StatusCode_STATUSCHECK_CONFIG_CONNECTOR_NOT_FOUND, result.Message)
ae = proto.ActionableErr{ErrCode: proto.StatusCode_STATUSCHECK_CONFIG_CONNECTOR_NOT_FOUND, Message: result.Message}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how this would happen!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think that we'd run into this. But I added it for completion's sake.

Copy link
Member

@briandealwis briandealwis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. A few more comments.

// See https://cloud.google.com/config-connector/docs/overview
var ConfigConnectorResourceSelector = []GroupKindSelector{
// add preliminary support for config connector services; group name is currently in flux
&wildcardGroupKind{Group: regexp.MustCompile(`([[:alpha:]]+\.)+cnrm\.cloud\.google\.com`)},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

)

// Filter returns the manifest list filtered by the given selectors
func (l *ManifestList) Filter(selectors []GroupKindSelector) (ManifestList, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these functions, but they need tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on UTs

return nil, fmt.Errorf("unmarshaling config: %w", err)
}
gvk := obj.GroupVersionKind()
for _, w := range selectors {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if we should have separate interfaces for allowing resource selectors (taking an unstructured.Ubstructured), or GVK, or group+kind. That can be done separately.

@gsquared94 gsquared94 enabled auto-merge (squash) November 4, 2021 15:40
@gsquared94 gsquared94 merged commit 4a2dd8e into GoogleContainerTools:main Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skaffold should wait for Config Connector resources to be up to date before reporting success
5 participants