-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect Gloo metrics and some snapshots on test failure #10400
base: main
Are you sure you want to change the base?
Conversation
@@ -225,6 +225,8 @@ func (s *testingSuite) TestConfigureVirtualHostOptionsWithSectionNameManualSetup | |||
[]string{"conflict with more specific or older VirtualHostOptions"}, | |||
defaults.KubeGatewayReporter, | |||
) | |||
|
|||
s.Assert().Equal(true, false, "intentionally failing to trigger drump, remove when done debugging") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will be removing this after I have an artifact example to link.
test/kubernetes/e2e/test.go
Outdated
// Fetch the name of the Gloo Gateway controller pod | ||
getGlooPodNameCmd := i.Actions.Kubectl().Command(ctx, "get", "pod", "-n", i.Metadata.InstallNamespace, | ||
"--selector", "gloo=gloo", "--output", "jsonpath='{.items[0].metadata.name}'") | ||
_ = getGlooPodNameCmd.WithStdout(podStdOut).WithStderr(podStdErr).Run() | ||
|
||
// Clean up and check the output | ||
glooPodName := strings.Trim(podStdOut.String(), "'") | ||
if glooPodName == "" { | ||
fmt.Printf("Failed to get the name of the Gloo Gateway controller pod: %s\n", podStdErr.String()) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to make this it's own method if we think we will need it more than in this case.
Visit the preview URL for this PR (updated for commit 34ee46c): https://gloo-edge--pr10400-rolds-test-failure-c-kkld9hrb.web.app (expires Tue, 03 Dec 2024 18:59:38 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 77c2b86e287749579b7ff9cadb81e099042ef677 |
test/kubernetes/e2e/test.go
Outdated
// Using an ephemeral debug pod fetch the metrics from the Gloo Gateway controller | ||
metricsCmd := i.Actions.Kubectl().Command(ctx, "debug", "-n", i.Metadata.InstallNamespace, | ||
"-it", "--image=curlimages/curl:7.83.1", glooPodName, "--", | ||
"curl", "http://localhost:9091/metrics") | ||
_ = metricsCmd.WithStdout(metricsFile).WithStderr(metricsFile).Run() | ||
metricsFile.Close() | ||
|
||
// Get krt snapshot from the Gloo Gateway controller pod and write it to a file | ||
krtSnapshotFilePath := filepath.Join(failureDir, "krt_snapshot.log") | ||
krtSnapshotFile, err := os.OpenFile(krtSnapshotFilePath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, os.ModePerm) | ||
i.Assertions.Require.NoError(err) | ||
|
||
// Using an ephemeral debug pod fetch the krt snapshot from the Gloo Gateway controller | ||
krtSnapshotCmd := i.Actions.Kubectl().Command(ctx, "debug", "-n", i.Metadata.InstallNamespace, | ||
"-it", "--image=curlimages/curl:7.83.1", glooPodName, "--", | ||
"curl", "http://localhost:9095/snapshots/krt") | ||
_ = krtSnapshotCmd.WithStdout(krtSnapshotFile).WithStderr(krtSnapshotFile).Run() | ||
krtSnapshotFile.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while these make sense it would be ideal if we could add them to the main admin gloo endpoint tooling as its a pretty general case. I know its annoying to do it on a first pass (ie one case) but Im pretty sure I would love to have it on all cases
https://github.com/solo-io/gloo/blob/main/test/helpers/kube_dump.go#L46 is an example of our end dump and it includes hitting the stats endpoint like https://github.com/solo-io/gloo/blob/main/test/helpers/kube_dump.go#L311 which would be a nice setup that we can add to these types of tests
…b.com/solo-io/gloo into rolds/test_failure_collect_metrics_dump
test/kubernetes/e2e/test.go
Outdated
"-it", "--image=curlimages/curl:7.83.1", glooPodName, "--", | ||
"curl", "http://localhost:9091/metrics") | ||
cmdErr = metricsCmd.WithStdout(metricsFile).WithStderr(metricsFile).Run() | ||
if cmdErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably doesn't need the if check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it without and it was always being treated at a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't like something about it explicitly being a pointer, but I'm not sure and didn't dig in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good!
test/kubernetes/e2e/test.go
Outdated
podStdOut := bytes.NewBuffer(nil) | ||
podStdErr := bytes.NewBuffer(nil) | ||
|
||
// Fetch the name of the Gloo Gateway controller pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! I had written up some thoughts around this in https://github.com/solo-io/solo-projects/issues/7298. I think what you've added is a great start, though I wonder if we can iterate by making some of these actions (capturing details about a cluster, and dumping them in files) availabe to users as well in the CLI. That way improvements we build to debug our tests, also benefit customers debugging their environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this work seems to build off of k8sgateway#10239. I would take a look at that PR and see what we can do to avoid duplicate functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it easier for customers to get snapshots and other useful debugging information is a good idea. My goal of this PR is to ensure that we/engs can more easily debug test failures in the short-term.
I agree with unifying the code for getting the snapshots between our test frameworks, which if done correctly should make the user face work easier to do later. I will get the framework unification work done in this PR.
I would like to defer any user facing work to future feature tickets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me! More wanted to point out the existing user-facing work as something we could piggy back off of, since it does some of what you are trying to do here. I'm good with taking the phased approach of enabling devs first, and then we can work to enable users after
test/kubernetes/e2e/test.go
Outdated
i.Assertions.Require.NoError(err) | ||
|
||
// Using an ephemeral debug pod fetch the metrics from the Gloo Gateway controller | ||
metricsCmd := i.Actions.Kubectl().Command(ctx, "debug", "-n", i.Metadata.InstallNamespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to have an overlap with https://github.com/solo-io/gloo/blob/main/test/kube2e/helper/curl.go. It might be nice if we could align those two
test/kubernetes/e2e/test.go
Outdated
}() | ||
|
||
adminClient := admincli.NewClient(). | ||
WithReceiver(io.Discard). // adminAssertion can overwrite this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think this comment is relevant. It seems related to tests that perform assertions with the adminClient
test/kubernetes/e2e/test.go
Outdated
WithStderr(krtSnapshotFile). | ||
Run() | ||
if cmdErr != nil { | ||
i.Assertions.Require.NoError(cmdErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior of our pre-fail handler if this assertion fails? Does it cause the action to short-circuit, or just ensure that everything makes a best attempt and then we fail the job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which would you like it to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My default position is to fail fast and fail often. So, I think if we are unable to generate snapshots that we should fail promptly, but I suspect there are cases where a test fails and the state of the system isn't able to collect all the debug information and double fails. I don't know if that is a problem if the original failure is still presented to the engineer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My default position is to fail fast and fail often.
I'm with you on this. In this case however, we are attempting to grab potentially critical and relevant details about the cluster, and so I figure a best-effort where we capture all failures, might make sense. We should still error the job (I think) if the snapshot generation/capture fails partially, but I think it's safer if that still produces some output for people to consume.
@@ -74,6 +74,8 @@ gloo: | |||
limits: | |||
cpu: 1000m | |||
memory: 10Gi | |||
stats: | |||
enabled: true # enable stats server for gloo so we can collect the metrics in CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the comment! I wonder if we should consider making stats on the gloo pod default to on in future releases? https://github.com/solo-io/solo-projects/issues/7292 is an example of something similar. If you think this could be valuable, could you write up an RFE to capture it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most controllers I've worked with have metrics on by default as it's pretty typical to be running Prometheus Operator, making scraping easy (PodMonitor/ServiceMonitor resources).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree with you. in fact, I was surprised to see this wasn't the default. Thanks for writing up the issue.
…b.com/solo-io/gloo into rolds/test_failure_collect_metrics_dump
Description
I've encountered a couple of cases where this information would have been helpful in debugging
Interesting decisions
I wanted to use the ephemeral debug containers and had to first fetch the pod name because
kubectl debug
doesn't supportdeployment/gloo
as a pod.Testing steps
Take a look at the artifacts found this failed test.
Checklist: