Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm tests failing with v0.71.0 operator #1574

Closed
Allex1 opened this issue Mar 13, 2023 · 15 comments · Fixed by #1582
Closed

Helm tests failing with v0.71.0 operator #1574

Allex1 opened this issue Mar 13, 2023 · 15 comments · Fixed by #1582

Comments

@Allex1
Copy link
Contributor

Allex1 commented Mar 13, 2023

Hi all,

Trying to upgrade the helm stack to use operator v0.71.0 and tests are failing in a non obvious way.

cloning the 0.71 release tag I get https://github.com/open-telemetry/opentelemetry-helm-charts/actions/runs/4396632296/jobs/7699210715#step:13:991

cloning main (as I saw #1553 is not included in the release) I get https://github.com/open-telemetry/opentelemetry-helm-charts/actions/runs/4402557919/jobs/7710082951

I'm not exactly familiar with the operator tests so I could use some help. @moh-osman3 can you have a look ?

Thanks

@moh-osman3
Copy link
Contributor

I'm unsure what's going on here, strange this didn't fail for the operator tests in this repo. From your second test (cloning main) it seems that a step is skipped (https://github.com/open-telemetry/opentelemetry-helm-charts/actions/runs/4402557919/jobs/7710082951#step:13:761). This test should be running this chart which should have the desired behavior fields. Once this is installed in step 2, it should be validated in script in step 3 . But since this 02-install.yaml doesn't get installed this will cause a segfault instead.

It's strange that 0.71.0 test which doesn't include my recent changes, fail as well, it seems scaledown doesn't happen in time despite stabilization window being set to 15 seconds. I wonder if there's something obvious I'm missing here when I added to these e2e tests. (cc @iblancasa @pavolloffay)

@iblancasa
Copy link
Contributor

When cloning the 0.71 release tag

  1. For me, the error you're finding sounds like a timeout. It takes some time for the HPA to scale down the collector and, maybe, the value for stabilizationWindowSeconds is not small enough. Also, the tests are running in parallel... which is more resource-consuming and the HPA will take more time to scale down. Is it something you can reproduce always? And locally?

When cloning master

  1. I can see some errors related to the namespace creation/deletion. It is a problem related directly to KUTTL. Running sequentially can workaround the problem

@moh-osman3
Copy link
Contributor

@Allex1 I created a PR which adds nil checks and reduces the stabilization window. This should at least remove the segfault for your test based off of main. I also reduced the stabilization window in case for some reason it is too high.

Can you run the helm tests again based off my PR's branch? Hopefully the reason for failure will become more obvious

@Allex1
Copy link
Contributor Author

Allex1 commented Mar 14, 2023

@moh-osman3 currently fails https://github.com/open-telemetry/opentelemetry-helm-charts/actions/runs/4414191149/jobs/7735581591

logger.go:42: 09:46:18 | autoscale/3-install | Incorrect scaleDown stabilization window found for HPA simplest-set-utilization-collector

I'm also running locally

@moh-osman3
Copy link
Contributor

moh-osman3 commented Mar 14, 2023

Hmm since it's failing on this step, that means it must have already correctly validated the metrics. This means metrics have been updated properly, but behavior is not being updated properly which is a bug. I'm wondering what's different in how the helm repo sets up the environment for the operator e2e tests? i.e. why is this bug is found in this test but not the e2e tests the operator runs?

Behavior also seems to update properly in my own remote cluster testing

@moh-osman3
Copy link
Contributor

I reproduced this locally in the helm tests and now realize that there is not a bug with updates of behavior (failure for behavior updates is because helm test is running operator v0.71.0 which doesn't include my HPA behavior change), but rather primary issue is scale down is not happening as expected. i.e. it seems that the scaleDown stabilizationwindow is not taking effect and scaledown will take the default 300s to take place.

I am unsure why the scaledown setting is being ignored but I noticed that the helm repo is using kubernetes v1.23 while the operator repo is using kubernetes v1.25. I don't think this is a bug from the operator (since these tests work in kubernetes v1.25).

@moh-osman3
Copy link
Contributor

To fix this I removed the behavior update from the e2e tests and I added a timeout of 360s to give enough time for the hpa to scaledown. @Allex1 can you try running this again with my latest PR updates?

@pavolloffay @iblancasa I'd appreciate some guidance on the best way forward for this issue, is a timeout okay? Do you have any idea for why this scaledown window is not being used in kubernetes v1.23

@Allex1
Copy link
Contributor Author

Allex1 commented Mar 15, 2023

@moh-osman3 Thanks, started the CI again. We can probably test against kubernetes 1.25 as well in the helm repo.

@moh-osman3
Copy link
Contributor

@Allex1 Still seems to fail, which is strange because the autoscale test passes locally for me. I'm really unsure what's going on

@Allex1
Copy link
Contributor Author

Allex1 commented Mar 15, 2023

yeah, not sure either, fails locally as well with the helm tests

@iblancasa
Copy link
Contributor

To fix this I removed the behavior update from the e2e tests and I added a timeout of 360s to give enough time for the hpa to scaledown. @Allex1 can you try running this again with my latest PR updates?

Since the problem is in the Helm project, it would make more sense to adjust in their repository:

$ kuttl test --help | grep -a timeout
      --timeout int                   The timeout to use as default for TestSuite configuration. (default 30)

Regarding the problem: I'll try to reproduce it locally.

@Allex1
Copy link
Contributor Author

Allex1 commented Mar 15, 2023

@iblancasa the timeout arg can be passed from the testsuite config as well which we already did.

Strangely enough my pr that failed CI passes just fine locally ran via act after some teaks

| --- PASS: kuttl/harness/autoscale (498.10s)

@iblancasa
Copy link
Contributor

@iblancasa the timeout arg can be passed from the testsuite config as well which we already did.

Strangely enough my pr that failed CI passes just fine locally ran via act after some teaks

| --- PASS: kuttl/harness/autoscale (498.10s)

I know. What I wanted to avoid is to do the modifications in this repository. I don't understand why it takes so long to pass the test in your repository.

I just ran them locally and all the tests passed:

--- PASS: kuttl (384.51s)
    --- PASS: kuttl/harness (0.00s)
        --- PASS: kuttl/harness/smoke-simplest (19.05s)
        --- PASS: kuttl/harness/smoke-sidecar-other-namespace (24.92s)
        --- PASS: kuttl/harness/smoke-pod-annotations (35.70s)
        --- PASS: kuttl/harness/smoke-restarting-deployment (37.50s)
        --- PASS: kuttl/harness/statefulset-features (48.68s)
        --- PASS: kuttl/harness/instrumentation-python (98.93s)
        --- PASS: kuttl/harness/route (138.46s)
        --- PASS: kuttl/harness/instrumentation-java-other-ns (25.15s)
        --- PASS: kuttl/harness/instrumentation-nodejs (114.76s)
        --- PASS: kuttl/harness/instrumentation-java-multicontainer (148.22s)
        --- PASS: kuttl/harness/targetallocator-features (31.23s)
        --- PASS: kuttl/harness/instrumentation-python-multicontainer (145.86s)
        --- PASS: kuttl/harness/smoke-targetallocator (11.06s)
        --- PASS: kuttl/harness/ingress (6.84s)
        --- PASS: kuttl/harness/instrumentation-java (14.44s)
        --- PASS: kuttl/harness/smoke-statefulset (8.97s)
        --- PASS: kuttl/harness/daemonset-features (9.26s)
        --- PASS: kuttl/harness/instrumentation-dotnet (29.21s)
        --- PASS: kuttl/harness/instrumentation-dotnet-multicontainer (32.00s)
        --- PASS: kuttl/harness/instrumentation-nodejs-multicontainer (169.44s)
        --- PASS: kuttl/harness/smoke-sidecar (50.14s)
        --- PASS: kuttl/harness/instrumentation-sdk (8.17s)
        --- PASS: kuttl/harness/autoscale (384.16s)
$ git rev-parse HEAD
10918c501c0f6886701058afee719b29e5bdb276
$ cd opentelemetry-operator/ && git rev-parse HEAD
0e39ee77693146e0924da3ca474a0fe14dc30b3a
$ k version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.26.1
Kustomize Version: v4.5.7
Server Version: v1.25.3

@Allex1
Copy link
Contributor Author

Allex1 commented Mar 15, 2023

locally it works for me as well with opentelemetry-operator v0.72.0 tag :

|     --- PASS: kuttl/harness (0.00s)
|         --- PASS: kuttl/harness/route (2.17s)
|         --- PASS: kuttl/harness/smoke-restarting-deployment (34.73s)
|         --- PASS: kuttl/harness/smoke-pod-annotations (34.74s)
|         --- PASS: kuttl/harness/smoke-sidecar (45.89s)
|         --- PASS: kuttl/harness/targetallocator-features (47.31s)
|         --- PASS: kuttl/harness/instrumentation-sdk (52.95s)
|         --- PASS: kuttl/harness/smoke-sidecar-other-namespace (60.10s)
|         --- PASS: kuttl/harness/statefulset-features (64.49s)
|         --- PASS: kuttl/harness/instrumentation-java-other-ns (74.12s)
|         --- PASS: kuttl/harness/instrumentation-python (111.61s)
|         --- PASS: kuttl/harness/instrumentation-nodejs (112.65s)
|         --- PASS: kuttl/harness/ingress (1.63s)
|         --- PASS: kuttl/harness/smoke-targetallocator (2.36s)
|         --- PASS: kuttl/harness/daemonset-features (3.17s)
|         --- PASS: kuttl/harness/smoke-simplest (3.22s)
|         --- PASS: kuttl/harness/smoke-statefulset (2.19s)
|         --- PASS: kuttl/harness/instrumentation-dotnet (126.61s)
|         --- PASS: kuttl/harness/instrumentation-java (58.52s)
|         --- PASS: kuttl/harness/instrumentation-java-multicontainer (163.34s)
|         --- PASS: kuttl/harness/instrumentation-python-multicontainer (165.12s)
|         --- PASS: kuttl/harness/instrumentation-dotnet-multicontainer (46.46s)
|         --- PASS: kuttl/harness/instrumentation-nodejs-multicontainer (156.03s)
|         --- PASS: kuttl/harness/autoscale (498.36s)
| PASS
[Test Operator Charts/operator-test]   ✅  Success - Main Run e2e tests
[Test Operator Charts/operator-test] ⭐ Run Post Setup
[Test Operator Charts/operator-test] ⭐ Run Post Set up Go
[Test Operator Charts/operator-test]   🐳  docker exec cmd=[node /var/run/act/actions/actions-setup-go@v3/dist/cache-save/index.js] user= workdir=
[Test Operator Charts/operator-test]   ✅  Success - Post Set up Go
[Test Operator Charts/operator-test] ⭐ Run Post Create kind cluster
[Test Operator Charts/operator-test]   🐳  docker exec cmd=[node /var/run/act/actions/[email protected]/cleanup.js] user= workdir=
| Deleting cluster "chart-testing" ...
[Test Operator Charts/operator-test]   ✅  Success - Post Create kind cluster
[Test Operator Charts/operator-test] ⭐ Run Post Set up chart-testing
[Test Operator Charts/operator-test]   🐳  docker cp src=/Users/birca/.cache/act/[email protected]/ dst=/var/run/act/actions/[email protected]/
[Test Operator Charts/operator-test]   ✅  Success - Post Set up chart-testing
[Test Operator Charts/operator-test] ⭐ Run Post actions/setup-python@v2
[Test Operator Charts/operator-test]   🐳  docker exec cmd=[node /var/run/act/actions/actions-setup-python@v2/dist/cache-save/index.js] user= workdir=
[Test Operator Charts/operator-test]   ✅  Success - Post actions/setup-python@v2
[Test Operator Charts/operator-test]   ✅  Success - Post Setup
[Test Operator Charts/operator-test] 🏁  Job succeeded```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants