Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caddy 2.6.3 high memory usage #5366

Closed
GerritKopp opened this issue Feb 10, 2023 · 17 comments
Closed

Caddy 2.6.3 high memory usage #5366

GerritKopp opened this issue Feb 10, 2023 · 17 comments
Labels
bug 🐞 Something isn't working
Milestone

Comments

@GerritKopp
Copy link

We are using Caddy on Ubuntu 20.04 on of our servers as a reverse-proxy for our Docker registry and are regularly pulling and pushing Docker images to and from this registry. The server had initially 8GB of memory and everything worked fine with Caddy 2.6.2.

After updating to Caddy 2.6.3 the server regularly runs out of memory when building and publishing new Docker images via our CI system and Caddy gets killed because of that (see below). This still happens even after we increased the server memory to 16GB + 32GB swap.

A downgrade to 2.6.2 fixes the issue.

Did some configuration defaults change in the new version? Is there a way to limit the memory usage? Is there maybe a memory leak in 2.6.3?

~$ dmesg -T | egrep -i 'killed process'
[Fri Feb 10 13:45:11 2023] Out of memory: Killed process 792 (caddy) total-vm:45205552kB, anon-rss:14974688kB, file-rss:0kB, shmem-rss:0kB, UID:997 pgtables:86664kB oom_score_adj:0
[Fri Feb 10 13:50:59 2023] Out of memory: Killed process 2876 (caddy) total-vm:45738576kB, anon-rss:14941248kB, file-rss:0kB, shmem-rss:0kB, UID:997 pgtables:86736kB oom_score_adj:0
@mohammed90
Copy link
Member

mohammed90 commented Feb 10, 2023

Is the build step that's running out of memory, or the Caddy server itself?

Please share the steps to reproduce. If it's the build, share your build process. If it's Caddy itself, fill out the form below. Also, please share the output of http://localhost:2019/debug/pprof.

This template will ask for some information you've already provided; that's OK, just fill it out the best you can. 👍 I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```


### 1b. Caddy version (run `caddy version` or paste commit SHA)

```
paste here
```


### 1c. Go version (if building Caddy from source; run `go version`)

```
paste here
```


## 2. Description

### 2a. What happens (briefly explain what is wrong)




### 2b. Why it's a bug (if it's not obvious)




### 2c. Log output

```
paste terminal output or logs here
```



### 2d. Workaround(s)




### 2e. Relevant links




## 3. Tutorial (minimal steps to reproduce the bug)




Helpful tips

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file:
{ ... }

Open terminal and run Caddy:

$ caddy ...

Make an HTTP request:

$ curl ...

Notice that the result is ___ but it should be ___.

@mohammed90 mohammed90 added the needs info 📭 Requires more information label Feb 10, 2023
@GerritKopp
Copy link
Author

1. Environment

1a. Operating system and version

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

1b. Caddy version (run caddy version or paste commit SHA)

v2.6.3 h1:QRVBNIqfpqZ1eJacY44I6eUC1OcxQ8D04EKImzpj7S8=

1c. Go version (if building Caddy from source; run go version)

n/a

Caddy is installed from the APT repository.

2. Description

2a. What happens (briefly explain what is wrong)

Server 1: Caddy runs as a reverse-proxy for a Docker registry / Sonatype Nexus Repository (both on the same server), configured via e.g. reverse_proxy localhost:8004.
Server 2: Builds and publishes Docker images to the registry on Server 1. While building new images, layers might also be pulled from Server 1.

Server 1 runs out of memory because Caddy 2.6.3 uses huge amounts of memory. Caddy then gets killed by the system.

~$ dmesg -T | egrep -i 'killed process'
[Fri Feb 10 13:45:11 2023] Out of memory: Killed process 792 (caddy) total-vm:45205552kB, anon-rss:14974688kB, file-rss:0kB, shmem-rss:0kB, UID:997 pgtables:86664kB oom_score_adj:0
[Fri Feb 10 13:50:59 2023] Out of memory: Killed process 2876 (caddy) total-vm:45738576kB, anon-rss:14941248kB, file-rss:0kB, shmem-rss:0kB, UID:997 pgtables:86736kB oom_score_adj:0

2b. Why it's a bug (if it's not obvious)

The same process from 2a worked fine with Caddy 2.6.2, when Server 1 had only 8 GB of RAM.
The process from 2a fails with Caddy 2.6.3 with 8 GB of RAM and even with 48 GB (16 GB + 32 GB swapfile).

2c. Log output

There is no log output after Caddy is started until it gets killed by the system. These are the lines from journalctl:

Feb 10 13:41:08 dev-3.redacted.domain systemd[1]: Started Caddy.
Feb 10 13:45:13 dev-3.redacted.domain systemd[1]: caddy.service: Main process exited, code=killed, status=9/KILL
Feb 10 13:45:13 dev-3.redacted.domain systemd[1]: caddy.service: Failed with result 'signal'.

2d. Workaround(s)

A downgrade to Caddy 2.6.2 fixes the issue.

2e. Relevant links

n/a

3. Tutorial (minimal steps to reproduce the bug)

Unfortunately, I currently have no simple steps to reproduce the issue outside of our environment.

Output of http://localhost:2019/debug/pprof:

Count Profile
268 allocs
0 block
0 cmdline
27 goroutine
268 heap
0 mutex
0 profile
13 threadcreate
0 trace

@mohammed90
Copy link
Member

mohammed90 commented Feb 10, 2023

My hunch is Caddy is probably buffering the request and/or response in the reverse-proxy handler due to the changes of #5289. Try disabling buffering by setting flush_interval to -1. Is it possible to try and report back?

@GerritKopp
Copy link
Author

I have changed the config from

  reverse_proxy localhost:8004

to

reverse_proxy localhost:8004 {
    flush_interval -1
}

and restarted Caddy. It doesn't seem to have an effect. The memory still gets filled up pretty quickly when starting the process again.

@mholt
Copy link
Member

mholt commented Feb 10, 2023

@GerritKopp What is your full, unredacted config? What is the output of top (or htop) just before a crash?

Can you please collect a heap profile? That is what we mean when we talk about that debug/pprof endpoint -- not just the counts next to the links :) For example, for a 30 second heap profile, add ?seconds=30 to that link. If you could also provide the dump of goroutines that's often very helpful.

@GerritKopp
Copy link
Author

To reproduce the results faster, the server was configured with 16 GB RAM and no swap file. Caddy 2.6.2 had no issues with the same setup with only 8 GB.

I tried to collect all the requested metrics. Since Caddy gets killed the heap-delta had to stop earlier. I ran top in batch mode, starting when Caddy was idle until after Caddy was killed.

The goroutines dump is from another run performing the same tasks, when the server was nearly maxed out on memory again.

All files in an archive: top-output-and-heap-profile.zip

Caddyfile (with redacted credentials):

(error_pages) {
  handle_errors {
    header content-type "text/html"
    respond * "" {http.error.status_code} {
      import error_snippet
      close
    }
  }
}

(error_snippet) {
body "<!DOCTYPE html>
<html>
  <title>Error {http.error.status_code}</title>
  <p>Error {http.error.status_code}</p>
</html>
"
}

entwicklung-3.t2med.eu {
  root * /usr/share/caddy
  file_server

  import error_pages

  handle_path /monitoring/prometheus* {
    rewrite * /metrics{path}
    reverse_proxy localhost:9100
    basicauth * {
      hidden-user hidden-password
    }
  }
}

nexus.t2med.eu {
  reverse_proxy localhost:8002
  import error_pages
}

docker.t2med.eu {
  reverse_proxy localhost:8003
  import error_pages
}

upload.docker.t2med.eu {
  reverse_proxy localhost:8004
  import error_pages
}

I hope this information is useful.

@mholt
Copy link
Member

mholt commented Feb 10, 2023

Thanks! Yeah that is way more insight than before.

Goroutine dump is extremely small -- looks like you're only serving a small handful of requests, but at least one of those requests is causing your server to spiral:

(pprof) top
Showing nodes accounting for 3.31GB, 100% of 3.32GB total
Dropped 13 nodes (cum <= 0.02GB)
Showing top 10 nodes out of 17
      flat  flat%   sum%        cum   cum%
    3.31GB   100%   100%     3.31GB   100%  bytes.growSlice
         0     0%   100%     3.31GB   100%  bytes.(*Buffer).ReadFrom
         0     0%   100%     3.31GB   100%  bytes.(*Buffer).grow
         0     0%   100%     3.31GB 99.91%  github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).ServeHTTP
         0     0%   100%     3.31GB   100%  github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).enforcementHandler
         0     0%   100%     3.31GB   100%  github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).wrapPrimaryRoute.func1
         0     0%   100%     3.31GB   100%  github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Subroute).ServeHTTP
         0     0%   100%     3.31GB   100%  github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP
         0     0%   100%     3.31GB   100%  github.com/caddyserver/caddy/v2/modules/caddyhttp.wrapMiddleware.func1.1
         0     0%   100%     3.31GB   100%  github.com/caddyserver/caddy/v2/modules/caddyhttp.wrapRoute.func1.1

Something is buffering wildly.

Can you please do another go at this, but:

  • Enable debug and access logs? Put debug in the global options at the top, and then log in all of your site blocks.
  • Do you think you could also grab the file that shows the "full goroutine stack dump" goroutines just before a crash? It's at the URI /debug/pprof/goroutine?debug=2.

What kinds of requests is this server getting? Are clients uploading or downloading large files? What are the headers of the requests like? If you could post the unredacted logs here that would be extremely helpful 😊 Thank you!

@mholt
Copy link
Member

mholt commented Feb 10, 2023

The reverse proxy is buffering the body... (click this image to view large, if it doesn't show up)

profile001

Here's the relevant code, I think:

// if enabled, buffer client request; this should only be
// enabled if the upstream requires it and does not work
// with "slow clients" (gunicorn, etc.) - this obviously
// has a perf overhead and makes the proxy at risk of
// exhausting memory and more susceptible to slowloris
// attacks, so it is strongly recommended to only use this
// feature if absolutely required, if read timeouts are
// set, and if body size is limited
if (h.BufferRequests || isChunkedRequest(req)) && req.Body != nil {
req.Body, req.ContentLength = h.bufferedBody(req.Body)
req.Header.Set("Content-Length", strconv.FormatInt(req.ContentLength, 10))
}

These must be chunked requests...

@mholt
Copy link
Member

mholt commented Feb 10, 2023

Ok, culprit is this PR here: #5289

@u5surf Would you be able to help look into this? We will likely need to release a hotfix in v2.6.4 for this.

(This is also my oversight in code review, as we should not be buffering chunked requests. The reason they are chunked is because they are large. They should not be buffered.)

Edit: Wow, I just noticed that @mohammed90 caught it hours before I did. Sorry that slipped by my scrolling. (The GitHub app is... a bit janky with comments. I'm on my desktop now.) Thank you Mohammed!

@GerritKopp
Copy link
Author

Are more detailed logs still needed? The earliest I could look into the necessary configuration changes to collect them would be on next Monday.

@mholt
Copy link
Member

mholt commented Feb 10, 2023

@GerritKopp Nope, sorry -- I should have let you know that I think we have enough info now for a hotfix.

@GerritKopp
Copy link
Author

No problem. Looking forward for the fix. Thanks!

mholt added a commit that referenced this issue Feb 10, 2023
Mostly reverts 845bc4d (#5289)

Adds warning for unsafe config.

Deprecates unsafe properties in favor of simpler, safer designed ones.
@mholt
Copy link
Member

mholt commented Feb 10, 2023

I have a proposed patch in #5367.

@u5surf Would you be able to take a look? I still want to incorporate your fix but maybe in an opt-in sort of way. Or if we can find a better fix. Apparently not all chunked encodings hang! (see @GerritKopp's use case.)

Super bonus points if you could test the patch to see if it works for you, @GerritKopp -- thank you 😊 we will release v2.6.4 right away once verified.

@mholt mholt added bug 🐞 Something isn't working and removed needs info 📭 Requires more information labels Feb 11, 2023
@mholt mholt added this to the v2.6.4 milestone Feb 11, 2023
@mholt mholt closed this as completed in 4b119a4 Feb 12, 2023
@GerritKopp
Copy link
Author

Unfortunately, I could not test this earlier. The issue is fixed vor me in 2.6.4. 👍

@mholt
Copy link
Member

mholt commented Feb 16, 2023

Oh good :) Thanks!

@karl-cardenas-coding
Copy link

Thank you so much for fixing this @mholt . I had a container go into cashback loops due to resource utilization, and we suspected Caddy, but we weren't sure if it was a UI memory leak that could be causing the issue. I'm happy to share that with Caddy 2.6.4, all the issues went away 🚀

@mholt
Copy link
Member

mholt commented Mar 1, 2023

Great to hear! Thanks for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants