Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Haproxy crashes instance during scalability test due to high memory usage #8166

Closed
dianabarsan opened this issue Apr 4, 2023 · 15 comments
Closed
Assignees
Labels
Priority: 1 - High Blocks the next release. Type: Bug Fix something that isn't working as intended
Milestone

Comments

@dianabarsan
Copy link
Member

dianabarsan commented Apr 4, 2023

Describe the bug
I started a scalability test, with 100 concurrent users, and haproxy crashed while the test was running. The behavior was that:

  • api returns 500 errors for every request
  • haproxy docker container was still running (and not restarting)
  • last haproxy log reported all workers had been killed

To Reproduce
Steps to reproduce the behavior:

  1. Start scalability test against current master
  2. Watch as scalability test ends with errors

Expected behavior
Haproxy should not crash

Logs
Beginning of haproxy log - to check that we're indeed on the latest config:

Starting enhanced syslogd: rsyslogd.
global
  maxconn 150000
  spread-checks 5
  lua-load-per-thread /usr/local/etc/haproxy/parse_basic.lua
  lua-load-per-thread /usr/local/etc/haproxy/parse_cookie.lua
  lua-load-per-thread /usr/local/etc/haproxy/replace_password.lua
  log stdout len 65535 local2 debug
  tune.bufsize 1638400
  tune.http.maxhdr 1010

API Error:

REQ 20d65ffb-3c6a-4290-af76-5bd28d9a5cf3 35.178.234.114 ac2 GET /medic/_changes?style=all_docs&heartbeat=10000&since=0&limit=100 HTTP/1.0
2023-04-04 14:18:54 ERROR: Server error: RequestError: Error: connect ECONNREFUSED 10.0.1.11:5984
    at new RequestError (/api/node_modules/request-promise-core/lib/errors.js:14:15)
    at Request.plumbing.callback (/api/node_modules/request-promise-core/lib/plumbing.js:87:29)
    at Request.RP$callback [as _callback] (/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at self.callback (/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:513:28)
    at Request.onRequestError (/api/node_modules/request/request.js:877:8)
    at ClientRequest.emit (node:events:513:28)
    at Socket.socketErrorListener (node:_http_client:494:9)
    at Socket.emit (node:events:513:28)
    at emitErrorNT (node:internal/streams/destroy:157:8) {
  name: 'RequestError',
  message: 'Error: connect ECONNREFUSED 10.0.1.11:5984',
  cause: Error: connect ECONNREFUSED 10.0.1.11:5984
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16) {
    [stack]: 'Error: connect ECONNREFUSED 10.0.1.11:5984\n' +
      '    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)',
    [message]: 'connect ECONNREFUSED 10.0.1.11:5984',
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '10.0.1.11',
    port: 5984
  },
  error: Error: connect ECONNREFUSED 10.0.1.11:5984
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16) {
    [stack]: 'Error: connect ECONNREFUSED 10.0.1.11:5984\n' +
      '    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)',
    [message]: 'connect ECONNREFUSED 10.0.1.11:5984',
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '10.0.1.11',
    port: 5984
  },
} 
RES 20d65ffb-3c6a-4290-af76-5bd28d9a5cf3 35.178.234.114 ac2 GET /medic/_changes?style=all_docs&heartbeat=10000&since=0&limit=100 HTTP/1.0 500 - 1.629 ms

Haproxy end of log:

<150>Apr  4 13:30:20 haproxy[25]: 10.0.1.3,couchdb-3.local,200,1743,6680,0,GET,/medic/org.couchdb.user%3Aac2?,api,medic,'-',495,1546,232,'node-fetch/1.0 (+https://github.com/bitinn/node-fetch)'
<150>Apr  4 13:30:21 haproxy[25]: 10.0.1.3,couchdb-2.local,200,2081,7009,0,GET,/_users/org.couchdb.user%3Aac2?,api,medic,'-',576,2080,313,'node-fetch/1.0 (+https://github.com/bitinn/node-fetch)'
[NOTICE]   (1) : haproxy version is 2.6.12-f588462
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[ALERT]    (1) : Current worker (25) exited with code 137 (Killed)
[ALERT]    (1) : exit-on-failure: killing every processes with SIGTERM

Output of docker ps:

CONTAINER ID   IMAGE                                                         COMMAND                   CREATED       STATUS             PORTS                                                                      NAMES
0b2e118f87cc   public.ecr.aws/s5s3h4s7/cht-nginx:4.1.0-alpha                 "/docker-entrypoint.…"    5 hours ago   Up About an hour   0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp   compose_nginx_1
2f166344d02a   public.ecr.aws/s5s3h4s7/cht-api:4.1.0-alpha                   "/bin/bash /api/dock…"    5 hours ago   Up About an hour   5988/tcp                                                                   compose_api_1
bd47151e7926   public.ecr.aws/s5s3h4s7/cht-sentinel:4.1.0-alpha              "/bin/bash /sentinel…"    5 hours ago   Up About an hour                                                                              compose_sentinel_1
27959cce073d   public.ecr.aws/medic/cht-haproxy:4.1.0-alpha                  "/entrypoint.sh hapr…"    5 hours ago   Up About an hour   5984/tcp                                                                   compose_haproxy_1
53f9ec2c028a   public.ecr.aws/s5s3h4s7/cht-haproxy-healthcheck:4.1.0-alpha   "/bin/sh -c \"/app/ch…"   5 hours ago   Up About an hour                                                                              compose_healthcheck_1

Environment

  • Instance: scalability test instance
  • App: haproxy
  • Version: alpha ~4.2.0

Additional context
Manually restarting the haproxy container fixes the issue.

I don't have any insight into what the underlying issue is, I'll keep running tests to see how frequently this occurs.
In the worst case scenario, we should find a way to signal to the container to restart because the process was terminated.

It looks like this is being caused by Haproxy allocating resources for each opened connection, and those resources do not get freed, ending up bloating haproxy memory usage to 17GB, which is more than the test instance has allocated in total.

@dianabarsan dianabarsan added the Type: Bug Fix something that isn't working as intended label Apr 4, 2023
@dianabarsan
Copy link
Member Author

This just happened again:

<149>Apr  4 14:58:20 haproxy[25]: Server couchdb-servers/couchdb-1.local is UP, reason: Layer7 check passed, code: 0, info: "via agent : up", check duration: 3ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING]  (25) : Server couchdb-servers/couchdb-1.local is UP, reason: Layer7 check passed, code: 0, info: "via agent : up", check duration: 3ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[NOTICE]   (1) : haproxy version is 2.6.12-f588462
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[ALERT]    (1) : Current worker (25) exited with code 137 (Killed)
[ALERT]    (1) : exit-on-failure: killing every processes with SIGTERM

@dianabarsan
Copy link
Member Author

dianabarsan commented Apr 5, 2023

This user seems to report the same issue: haproxy/haproxy#1984 , ant it seems to affect the same haproxy version that we are using 2.6.x
The suggestion is to upgrade to 2.7 and see if this occurs again. I'll try that.

@dianabarsan
Copy link
Member Author

dianabarsan commented Apr 5, 2023

Even with 2.7, I'm still seeing very high memory usage:
image

This has been reported multiple times. This seems very relevant to us: docker-library/haproxy#194

CPU usage spikes, memory seems constant.

@dianabarsan
Copy link
Member Author

The more I look into it, the more this seems to be a memory leak. I'll updated the reproduction steps and issue description once I'm certain.

@dianabarsan dianabarsan self-assigned this Apr 5, 2023
@dianabarsan
Copy link
Member Author

dianabarsan commented Apr 5, 2023

It looks like memory usage plateaus at 17.56GB. This is tested locally, as the scalability test instance only has 15GB RAM and haproxy exceeding this value is probably causing the instance to become unresponsive. (Monitoring metrics pending).

image

This issue appears very relevant: docker-library/haproxy#194

Quoting docker-library/haproxy#194 (comment):

I think the root cause is HAProxy allocating resources for each connection, up to the maximum, and deriving that maximum (maxconn) from the (very high) kernel default file descriptor limit, which is the effective limit when the container runtime file limit is infinity.

@dianabarsan dianabarsan added the Priority: 1 - High Blocks the next release. label Apr 5, 2023
@dianabarsan dianabarsan added this to the 4.2.0 milestone Apr 5, 2023
@dianabarsan dianabarsan moved this to In Progress in Product Team Activities Apr 5, 2023
@dianabarsan dianabarsan changed the title Haproxy crashes during scalability test Haproxy crashes instance during scalability test due to high memory usage Apr 5, 2023
@dianabarsan
Copy link
Member Author

dianabarsan commented Apr 6, 2023

I've traced the high memory usage to a config option that was added at some point in the archv3 work:
https://github.com/medic/cht-core/blob/master/haproxy/default_frontend.cfg#L13-L14

Without these settings, memory usage plateaus at ~2GB. I'm testing it by running e2e tests against it and running the scalability suite against it.

@dianabarsan
Copy link
Member Author

The solution to this was to:

  1. remove rsyslog because
  • it prevented container restart when the process was killed by OOM
  • on a forceful restart, would crash the container with:
[NOTICE]   (1) : haproxy version is 2.6.13-234aa6d
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[WARNING]  (1) : Exiting Master process...
[WARNING]  (20) : Proxy http-in stopped (cumulated conns: FE: 74, BE: 0).
[WARNING]  (20) : Proxy couchdb-servers stopped (cumulated conns: FE: 0, BE: 74).
[WARNING]  (20) : Proxy <HTTPCLIENT> stopped (cumulated conns: FE: 0, BE: 0).
rsyslogd: pidfile '/run/rsyslogd.pid' and pid 14 already exist.
If you want to run multiple instances of rsyslog, you need to specify
different pid files for them (-i option).
rsyslogd: run failed with error -3000 (see rsyslog.h or try https://www.rsyslog.com/e/3000 to learn what that number means)
[alert] 122/114035 (1) : parseBasic loaded
[alert] 122/114035 (1) : parseCookie loaded
[alert] 122/114035 (1) : replacePassword loaded
[alert] 122/114035 (1) : parseBasic loaded
[alert] 122/114035 (1) : parseCookie loaded
[alert] 122/114035 (1) : replacePassword loaded
[NOTICE]   (1) : New worker (20) forked
[NOTICE]   (1) : Loading success.
[NOTICE]   (1) : haproxy version is 2.6.13-234aa6d
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[WARNING]  (1) : Exiting Master process...
[WARNING]  (20) : Proxy http-in stopped (cumulated conns: FE: 114, BE: 0).
[WARNING]  (20) : Proxy couchdb-servers stopped (cumulated conns: FE: 0, BE: 114).
[WARNING]  (20) : Proxy <HTTPCLIENT> stopped (cumulated conns: FE: 0, BE: 0).
  1. Lower value of tune.bufsize from 1638400 to 32768
  2. Reduced maxconn from 150000 to 60000
  3. Added docker limit of 1GB of memory per process.

@dianabarsan dianabarsan moved this from In Progress to Ready for AT in Product Team Activities May 4, 2023
@garethbowen garethbowen moved this from Ready for AT to In Progress in Product Team Activities May 7, 2023
@garethbowen
Copy link
Member

Moved to "in progress" as we've decided to remove the "AT" statuses. @dianabarsan Please work with teammates until you're confident to merge this.

@mrjones-plip
Copy link
Contributor

@dianabarsan - I see @garethbowen approved the PR last week, but that there's a QA ("assistance"!! ; ) request in the prior comment. Lemme know if you need a hand with that ✋

@dianabarsan
Copy link
Member Author

hey @mrjones-plip
Thanks for jumping in! I'm very close to confident about merging this already, do you have any ideas on something specific to test?

@mrjones-plip
Copy link
Contributor

If you're good, I'm good!

Otherwise, my idea was not something specif, but I was gonna do a bit more black box approach and throw ab or wrk at it before and after fix to see how it behaves. I was also thinking of doing my own exploratory reading about tune.bufsize and maxconn.

@dianabarsan
Copy link
Member Author

Go ahead!

@mrjones-plip
Copy link
Contributor

Awesome! I'll work on this today and hopefully have something by the time you come back online tomorrow 🤞

@mrjones-plip
Copy link
Contributor

mrjones-plip commented May 9, 2023

tl;dr - I think the improvements are great! We're now much more resilient against HA going down and serving 500s.


I had trouble reliably reproducing this with an unconstrained container on a 32GB bare-metal machine. HA proxy would float up to about 16GB of RAM use.

When I constrained the container to 6GB RAM, it was no problem to reproduce. Here's the haproxy changes I made in the compose file

  haproxy:
    deploy:
      resources:
         limits:
           memory: 6G

In testing I always used the same command:

ab -s 300 -k -c 600 -n 800 https://INSTANCE_URL/api/v2/monitoring

while having a concurrency of 600 across 800 connections to our monitoring API is totally unreasonable, we should never irrecoverably crash.

Additionally I used lazydocker and then followed the logs of the ha proxy instance with docker logs -f HA_PROXY_CONTAINER

Every single time I ran this command against my 4.1.0 instance it crashed in under 10 seconds. Every single time, all day every day:

fail.good.720.mp4

While hitting the API would result in an HTTP 200, the stats were all -1 values. Further, all requests to /medic and the like were 500. The only fix was to restart HA Proxy service.

Using 4.1.0-8166-new-haproxy, I did see failure rates as high as 85%, I also saw them as low as 10%. Most importantly, HA Proxy was rock solid and NEVER crashed:

good.good.720.mp4

Despite the errors, both the API and the web GUI (including /medic) was responsive before and after the test calls.

dianabarsan added a commit that referenced this issue May 10, 2023
…#8170)

Adds following haproxy and container changes: 
- lowers maxconn to 60000
- removes rsyslog
- lowers tune.buffsize to 2x default value
- adds max 1G memory limit to haproxy container in docker-compose template

#8166
@dianabarsan dianabarsan moved this from In Progress to Done in Product Team Activities May 10, 2023
@dianabarsan
Copy link
Member Author

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 1 - High Blocks the next release. Type: Bug Fix something that isn't working as intended
Projects
Status: Done
Development

No branches or pull requests

3 participants