Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale upstreams tests reports errors because zone size for NGINX Plus upstream #2023

Closed
pleshakov opened this issue May 23, 2024 · 3 comments · Fixed by #2439
Closed

Scale upstreams tests reports errors because zone size for NGINX Plus upstream #2023

pleshakov opened this issue May 23, 2024 · 3 comments · Fixed by #2439
Assignees
Labels
refined Requirements are refined and the issue is ready to be implemented. size/small Estimated to be completed within ~2 days tests Pull requests that update tests
Milestone

Comments

@pleshakov
Copy link
Contributor

When scale test runs with NGINX Plus with 648 upstream servers, it reports both NGF and NGINX Plus errors, because at some point the upstream zone size is no longer enough to hold all upstream servers. As a result, NGF fails to update NGINX Plus.

## Test TestScale_UpstreamServers

### Reloads

- Total: 3
- Total Errors: 0
- Average Time: 126ms
- Reload distribution:
	- 500ms: 3
	- 1000ms: 3
	- 5000ms: 3
	- 10000ms: 3
	- 30000ms: 3
	- +Infms: 3

### Event Batch Processing

- Total: 210
- Average Time: 93ms
- Event Batch Processing distribution:
	- 500ms: 209
	- 1000ms: 210
	- 5000ms: 210
	- 10000ms: 210
	- 30000ms: 210
	- +Infms: 210

### Errors

- NGF errors: 1
- NGF container restarts: 0
- NGINX errors: 2
- NGINX container restarts: 0

NGF log excerpt:

{"level":"error","ts":"2024-05-22T21:38:08Z","logger":"eventLoop.eventHandler","msg":"couldn't update upstream via the API, reloading configuration instead","batchID":227,"upstreamName":"scale_backend_80","error":"failed to update servers of scale_backend_80 upstream: failed to add 10.120.11.62:8080 server to scale_backend_80 upstream: expected 201 response, got 500. error.status=500; error.text=upstream memory exhausted; error.code=UpstreamOutOfMemory; request_id=0488143fe13f9042401627c559a66af1; href=https://nginx.org/en/docs/http/ngx_http_api_module.html","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).updateUpstreamServers\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:377\ngithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:204\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"}

NGINX Plus log:

2024/05/22 21:38:08 [crit] 118#118: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"
2024/05/22 21:38:08 [notice] 25#25: signal 1 (SIGHUP) received from 6, reconfiguring
2024/05/22 21:38:08 [notice] 25#25: reconfiguring
2024/05/22 21:38:08 [crit] 25#25: ngx_slab_alloc() failed: no memory in upstream zone "scale_backend_80"

Results https://github.com/nginxinc/nginx-gateway-fabric/blob/467fd76acebe746aacdf34426000a74e54fdda4b/tests/results/scale/edge/TestScale_UpstreamServers

Acceptance criteria:

  • Ensure scale upstreams test passes
pleshakov added a commit that referenced this issue May 23, 2024
Problem:
Scale test is not part of Github actions pipeline

Solution:
- Add NFR scale test to GitHub actions pipeline along other NFR tests.
- Increase the size of the cluster used for NFR tests, as the scale
  test requires bigger size.

Testing:
- Successfully run with NGINX -- #2002
- Successfully run with NGINX Plus -- #2017

Some scale test issues were discovered:
- #2023
- #2009

Closes #1927
@kate-osborn
Copy link
Contributor

N+ upstream scale test should only scale to 556 Pods.

From the old scale test doc:

Scale the deployment for that Service to 648 Pods for OSS and 556 Pods for Plus (these are the limits that the upstream zone size allows)

pleshakov added a commit that referenced this issue May 23, 2024
Problem:
Scale test is not part of Github actions pipeline

Solution:
- Add NFR scale test to GitHub actions pipeline along other NFR tests.
- Increase the size of the cluster used for NFR tests, as the scale
  test requires bigger size.

Testing:
- Successfully run with NGINX -- #2002
- Successfully run with NGINX Plus -- #2017

Some scale test issues were discovered:
- #2023
- #2009

Closes #1927
@mpstefan mpstefan added tests Pull requests that update tests needs-more-info Issue needs more information from creator labels Jun 5, 2024
@mpstefan
Copy link
Collaborator

mpstefan commented Jun 5, 2024

Does this error prevent us from running the test? As @kate-osborn points out, if we scale to 556, does this error affect us?

@sjberman
Copy link
Contributor

sjberman commented Jun 6, 2024

Issue seen again: #2110

Test still runs, just fails to reach scale.

@mpstefan mpstefan removed the needs-more-info Issue needs more information from creator label Jun 17, 2024
@mpstefan mpstefan added this to the v1.4.0 milestone Jun 17, 2024
@mpstefan mpstefan added refined Requirements are refined and the issue is ready to be implemented. size/small Estimated to be completed within ~2 days labels Jun 24, 2024
@mpstefan mpstefan modified the milestones: v1.4.0, v2.0.0 Jul 23, 2024
@bjee19 bjee19 self-assigned this Aug 19, 2024
@bjee19 bjee19 moved this from 🆕 New to 🏗 In Progress in NGINX Gateway Fabric Aug 19, 2024
@github-project-automation github-project-automation bot moved this from 👀 In Review to ✅ Done in NGINX Gateway Fabric Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refined Requirements are refined and the issue is ready to be implemented. size/small Estimated to be completed within ~2 days tests Pull requests that update tests
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants