Helm tests failing with v0.71.0 operator #1574

Allex1 · 2023-03-13T09:36:25Z

Hi all,

Trying to upgrade the helm stack to use operator v0.71.0 and tests are failing in a non obvious way.

cloning the 0.71 release tag I get https://github.com/open-telemetry/opentelemetry-helm-charts/actions/runs/4396632296/jobs/7699210715#step:13:991

cloning main (as I saw #1553 is not included in the release) I get https://github.com/open-telemetry/opentelemetry-helm-charts/actions/runs/4402557919/jobs/7710082951

I'm not exactly familiar with the operator tests so I could use some help. @moh-osman3 can you have a look ?

Thanks

moh-osman3 · 2023-03-13T13:41:43Z

I'm unsure what's going on here, strange this didn't fail for the operator tests in this repo. From your second test (cloning main) it seems that a step is skipped (https://github.com/open-telemetry/opentelemetry-helm-charts/actions/runs/4402557919/jobs/7710082951#step:13:761). This test should be running this chart which should have the desired behavior fields. Once this is installed in step 2, it should be validated in script in step 3 . But since this 02-install.yaml doesn't get installed this will cause a segfault instead.

It's strange that 0.71.0 test which doesn't include my recent changes, fail as well, it seems scaledown doesn't happen in time despite stabilization window being set to 15 seconds. I wonder if there's something obvious I'm missing here when I added to these e2e tests. (cc @iblancasa @pavolloffay)

iblancasa · 2023-03-13T15:15:36Z

When cloning the 0.71 release tag

For me, the error you're finding sounds like a timeout. It takes some time for the HPA to scale down the collector and, maybe, the value for stabilizationWindowSeconds is not small enough. Also, the tests are running in parallel... which is more resource-consuming and the HPA will take more time to scale down. Is it something you can reproduce always? And locally?

When cloning `master`

I can see some errors related to the namespace creation/deletion. It is a problem related directly to KUTTL. Running sequentially can workaround the problem

moh-osman3 · 2023-03-14T09:19:47Z

@Allex1 I created a PR which adds nil checks and reduces the stabilization window. This should at least remove the segfault for your test based off of main. I also reduced the stabilization window in case for some reason it is too high.

Can you run the helm tests again based off my PR's branch? Hopefully the reason for failure will become more obvious

Allex1 · 2023-03-14T09:47:39Z

@moh-osman3 currently fails https://github.com/open-telemetry/opentelemetry-helm-charts/actions/runs/4414191149/jobs/7735581591

logger.go:42: 09:46:18 | autoscale/3-install | Incorrect scaleDown stabilization window found for HPA simplest-set-utilization-collector

I'm also running locally

moh-osman3 · 2023-03-14T10:14:02Z

Hmm since it's failing on this step, that means it must have already correctly validated the metrics. This means metrics have been updated properly, but behavior is not being updated properly which is a bug. I'm wondering what's different in how the helm repo sets up the environment for the operator e2e tests? i.e. why is this bug is found in this test but not the e2e tests the operator runs?

Behavior also seems to update properly in my own remote cluster testing

iblancasa · 2023-03-14T11:04:50Z

@moh-osman3 https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/.github/workflows/operator-test.yaml

moh-osman3 · 2023-03-15T07:11:55Z

I reproduced this locally in the helm tests and now realize that there is not a bug with updates of behavior (failure for behavior updates is because helm test is running operator v0.71.0 which doesn't include my HPA behavior change), but rather primary issue is scale down is not happening as expected. i.e. it seems that the scaleDown stabilizationwindow is not taking effect and scaledown will take the default 300s to take place.

I am unsure why the scaledown setting is being ignored but I noticed that the helm repo is using kubernetes v1.23 while the operator repo is using kubernetes v1.25. I don't think this is a bug from the operator (since these tests work in kubernetes v1.25).

moh-osman3 · 2023-03-15T07:13:42Z

To fix this I removed the behavior update from the e2e tests and I added a timeout of 360s to give enough time for the hpa to scaledown. @Allex1 can you try running this again with my latest PR updates?

@pavolloffay @iblancasa I'd appreciate some guidance on the best way forward for this issue, is a timeout okay? Do you have any idea for why this scaledown window is not being used in kubernetes v1.23

Allex1 · 2023-03-15T07:22:54Z

@moh-osman3 Thanks, started the CI again. We can probably test against kubernetes 1.25 as well in the helm repo.

moh-osman3 · 2023-03-15T07:37:17Z

@Allex1 Still seems to fail, which is strange because the autoscale test passes locally for me. I'm really unsure what's going on

Allex1 · 2023-03-15T08:07:10Z

yeah, not sure either, fails locally as well with the helm tests

iblancasa · 2023-03-15T11:33:46Z

To fix this I removed the behavior update from the e2e tests and I added a timeout of 360s to give enough time for the hpa to scaledown. @Allex1 can you try running this again with my latest PR updates?

Since the problem is in the Helm project, it would make more sense to adjust in their repository:

$ kuttl test --help | grep -a timeout
      --timeout int                   The timeout to use as default for TestSuite configuration. (default 30)

Regarding the problem: I'll try to reproduce it locally.

Allex1 · 2023-03-15T12:06:50Z

@iblancasa the timeout arg can be passed from the testsuite config as well which we already did.

Strangely enough my pr that failed CI passes just fine locally ran via act after some teaks

| --- PASS: kuttl/harness/autoscale (498.10s)

iblancasa · 2023-03-15T12:14:37Z

@iblancasa the timeout arg can be passed from the testsuite config as well which we already did.

Strangely enough my pr that failed CI passes just fine locally ran via act after some teaks

| --- PASS: kuttl/harness/autoscale (498.10s)

I know. What I wanted to avoid is to do the modifications in this repository. I don't understand why it takes so long to pass the test in your repository.

I just ran them locally and all the tests passed:

--- PASS: kuttl (384.51s)
    --- PASS: kuttl/harness (0.00s)
        --- PASS: kuttl/harness/smoke-simplest (19.05s)
        --- PASS: kuttl/harness/smoke-sidecar-other-namespace (24.92s)
        --- PASS: kuttl/harness/smoke-pod-annotations (35.70s)
        --- PASS: kuttl/harness/smoke-restarting-deployment (37.50s)
        --- PASS: kuttl/harness/statefulset-features (48.68s)
        --- PASS: kuttl/harness/instrumentation-python (98.93s)
        --- PASS: kuttl/harness/route (138.46s)
        --- PASS: kuttl/harness/instrumentation-java-other-ns (25.15s)
        --- PASS: kuttl/harness/instrumentation-nodejs (114.76s)
        --- PASS: kuttl/harness/instrumentation-java-multicontainer (148.22s)
        --- PASS: kuttl/harness/targetallocator-features (31.23s)
        --- PASS: kuttl/harness/instrumentation-python-multicontainer (145.86s)
        --- PASS: kuttl/harness/smoke-targetallocator (11.06s)
        --- PASS: kuttl/harness/ingress (6.84s)
        --- PASS: kuttl/harness/instrumentation-java (14.44s)
        --- PASS: kuttl/harness/smoke-statefulset (8.97s)
        --- PASS: kuttl/harness/daemonset-features (9.26s)
        --- PASS: kuttl/harness/instrumentation-dotnet (29.21s)
        --- PASS: kuttl/harness/instrumentation-dotnet-multicontainer (32.00s)
        --- PASS: kuttl/harness/instrumentation-nodejs-multicontainer (169.44s)
        --- PASS: kuttl/harness/smoke-sidecar (50.14s)
        --- PASS: kuttl/harness/instrumentation-sdk (8.17s)
        --- PASS: kuttl/harness/autoscale (384.16s)

$ git rev-parse HEAD
10918c501c0f6886701058afee719b29e5bdb276
$ cd opentelemetry-operator/ && git rev-parse HEAD
0e39ee77693146e0924da3ca474a0fe14dc30b3a
$ k version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.26.1
Kustomize Version: v4.5.7
Server Version: v1.25.3

Allex1 · 2023-03-15T17:09:28Z

locally it works for me as well with opentelemetry-operator v0.72.0 tag :

|     --- PASS: kuttl/harness (0.00s)
|         --- PASS: kuttl/harness/route (2.17s)
|         --- PASS: kuttl/harness/smoke-restarting-deployment (34.73s)
|         --- PASS: kuttl/harness/smoke-pod-annotations (34.74s)
|         --- PASS: kuttl/harness/smoke-sidecar (45.89s)
|         --- PASS: kuttl/harness/targetallocator-features (47.31s)
|         --- PASS: kuttl/harness/instrumentation-sdk (52.95s)
|         --- PASS: kuttl/harness/smoke-sidecar-other-namespace (60.10s)
|         --- PASS: kuttl/harness/statefulset-features (64.49s)
|         --- PASS: kuttl/harness/instrumentation-java-other-ns (74.12s)
|         --- PASS: kuttl/harness/instrumentation-python (111.61s)
|         --- PASS: kuttl/harness/instrumentation-nodejs (112.65s)
|         --- PASS: kuttl/harness/ingress (1.63s)
|         --- PASS: kuttl/harness/smoke-targetallocator (2.36s)
|         --- PASS: kuttl/harness/daemonset-features (3.17s)
|         --- PASS: kuttl/harness/smoke-simplest (3.22s)
|         --- PASS: kuttl/harness/smoke-statefulset (2.19s)
|         --- PASS: kuttl/harness/instrumentation-dotnet (126.61s)
|         --- PASS: kuttl/harness/instrumentation-java (58.52s)
|         --- PASS: kuttl/harness/instrumentation-java-multicontainer (163.34s)
|         --- PASS: kuttl/harness/instrumentation-python-multicontainer (165.12s)
|         --- PASS: kuttl/harness/instrumentation-dotnet-multicontainer (46.46s)
|         --- PASS: kuttl/harness/instrumentation-nodejs-multicontainer (156.03s)
|         --- PASS: kuttl/harness/autoscale (498.36s)
| PASS
[Test Operator Charts/operator-test]   ✅  Success - Main Run e2e tests
[Test Operator Charts/operator-test] ⭐ Run Post Setup
[Test Operator Charts/operator-test] ⭐ Run Post Set up Go
[Test Operator Charts/operator-test]   🐳  docker exec cmd=[node /var/run/act/actions/actions-setup-go@v3/dist/cache-save/index.js] user= workdir=
[Test Operator Charts/operator-test]   ✅  Success - Post Set up Go
[Test Operator Charts/operator-test] ⭐ Run Post Create kind cluster
[Test Operator Charts/operator-test]   🐳  docker exec cmd=[node /var/run/act/actions/[email protected]/cleanup.js] user= workdir=
| Deleting cluster "chart-testing" ...
[Test Operator Charts/operator-test]   ✅  Success - Post Create kind cluster
[Test Operator Charts/operator-test] ⭐ Run Post Set up chart-testing
[Test Operator Charts/operator-test]   🐳  docker cp src=/Users/birca/.cache/act/[email protected]/ dst=/var/run/act/actions/[email protected]/
[Test Operator Charts/operator-test]   ✅  Success - Post Set up chart-testing
[Test Operator Charts/operator-test] ⭐ Run Post actions/setup-python@v2
[Test Operator Charts/operator-test]   🐳  docker exec cmd=[node /var/run/act/actions/actions-setup-python@v2/dist/cache-save/index.js] user= workdir=
[Test Operator Charts/operator-test]   ✅  Success - Post actions/setup-python@v2
[Test Operator Charts/operator-test]   ✅  Success - Post Setup
[Test Operator Charts/operator-test] 🏁  Job succeeded```

Allex1 mentioned this issue Mar 13, 2023

[BUMP] - operator 0.73/collector 0.74 open-telemetry/opentelemetry-helm-charts#679

Merged

moh-osman3 mentioned this issue Mar 14, 2023

[HPA/e2e] Add nil checks to behavior validation and reduce stabilization windows #1576

Closed

TylerHelmuth mentioned this issue Mar 16, 2023

[HPA/e2e] Increase e2e/autoscale reliability #1582

Merged

pavolloffay closed this as completed in #1582 Mar 17, 2023

moh-osman3 mentioned this issue Aug 15, 2023

REQUEST: New membership for moh-osman3 open-telemetry/community#1647

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm tests failing with v0.71.0 operator #1574

Helm tests failing with v0.71.0 operator #1574

Allex1 commented Mar 13, 2023

moh-osman3 commented Mar 13, 2023

iblancasa commented Mar 13, 2023

moh-osman3 commented Mar 14, 2023

Allex1 commented Mar 14, 2023

moh-osman3 commented Mar 14, 2023 •

edited

Loading

iblancasa commented Mar 14, 2023

moh-osman3 commented Mar 15, 2023

moh-osman3 commented Mar 15, 2023

Allex1 commented Mar 15, 2023

moh-osman3 commented Mar 15, 2023

Allex1 commented Mar 15, 2023

iblancasa commented Mar 15, 2023

Allex1 commented Mar 15, 2023 •

edited

Loading

iblancasa commented Mar 15, 2023

Allex1 commented Mar 15, 2023 •

edited

Loading

Helm tests failing with v0.71.0 operator #1574

Helm tests failing with v0.71.0 operator #1574

Comments

Allex1 commented Mar 13, 2023

moh-osman3 commented Mar 13, 2023

iblancasa commented Mar 13, 2023

When cloning the 0.71 release tag

When cloning master

moh-osman3 commented Mar 14, 2023

Allex1 commented Mar 14, 2023

moh-osman3 commented Mar 14, 2023 • edited Loading

iblancasa commented Mar 14, 2023

moh-osman3 commented Mar 15, 2023

moh-osman3 commented Mar 15, 2023

Allex1 commented Mar 15, 2023

moh-osman3 commented Mar 15, 2023

Allex1 commented Mar 15, 2023

iblancasa commented Mar 15, 2023

Allex1 commented Mar 15, 2023 • edited Loading

iblancasa commented Mar 15, 2023

Allex1 commented Mar 15, 2023 • edited Loading

When cloning `master`

moh-osman3 commented Mar 14, 2023 •

edited

Loading

Allex1 commented Mar 15, 2023 •

edited

Loading

Allex1 commented Mar 15, 2023 •

edited

Loading