ingress-gce-404-server-with-metrics causes OOM #1426

jupblb · 2021-05-05T14:50:50Z

We encountered a scenario where 404-server-with-metrics can cause OOM exception. This is probably caused by logs being partially retained in the memory. When someone pings the cluster a lot (e.g. botnet looking for vulnerabilities) this causes a surge in the amount of log messages being written. Example:

...
I0505 11:27:49.607462 1 server-with-metrics.go:243] response 404 (backend NotFound), service rules for [ /header.html ] non-existent
I0505 11:27:49.707176 1 server-with-metrics.go:243] response 404 (backend NotFound), service rules for [ /q79w_38jg__.shtml ] non-existent
I0505 11:27:49.707220 1 server-with-metrics.go:243] response 404 (backend NotFound), service rules for [ /gk/public_html/ ] non-existent
...

Which in turn may cause the container to hit the memory limit.

/cc @mborsz

The text was updated successfully, but these errors were encountered:

mborsz · 2021-05-05T17:06:50Z

In fact it's not logs being kept in memory. Those looks good.

I have done a following experiment:

Modified code to add /debug/pprof/heap
Ran 'curl' test from the README.md
Checked docker stats (it was ~400MiB)
Fetched pprof for the memory and...

it looks like vast majority of memory is being allocated in lines

ingress-gce/cmd/404-server-with-metrics/server-with-metrics.go

Lines 99 to 107 in b1a7452

    
           go func() { 
        
           	for { 
        
           		select { 
        
           		case <-server.idleChannel: 
        
           		case <-time.After(*idleLogTimer): 
        
           			klog.Infof("No connection requests received for 1 hour\n") 
        
           		} 
        
           	} 
        
           }()

It looks like on each server.idleChannel update (which happens every request) we allocate a new time.Timer which then lives for the next *idleLogTimer (1h by default)

This matches the documentation of time.After (src: https://golang.org/pkg/time/#After):

The underlying Timer is not recovered by the garbage collector until the timer fires. If efficiency is a concern, use NewTimer instead and call Timer.Stop if the timer is no longer needed.

spencerhance · 2021-05-05T17:42:00Z

cc: @skmatti @vbannai

vbannai · 2021-07-22T06:21:37Z

It might be better to call allocate using NewTimer() and call timer.Stop() when the idleChannel lights up. That way we are cleaning up the timers as and when requests are received keeping the memory footprint down.

k8s-triage-robot · 2021-11-03T19:09:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-12-03T20:07:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-01-02T20:13:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-01-02T20:14:10Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nokernel · 2022-11-22T15:35:38Z

/remove-lifecycle rotten

nokernel · 2022-11-22T15:36:24Z

/reopen

k8s-ci-robot · 2022-11-22T15:36:29Z

@nokernel: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nokernel · 2022-11-22T15:36:44Z

This issue still exist.

ratelle · 2022-11-22T16:46:38Z

I'd like to assign this issue to myself and work on a small PR to fix it.

jupblb · 2022-11-22T17:22:22Z

/reopen

k8s-ci-robot · 2022-11-22T17:22:27Z

@jupblb: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2023-02-20T18:06:44Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-03-22T18:34:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-04-21T18:45:37Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-04-21T18:45:42Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

swetharepakula added the kind/bug Categorizes issue or PR as related to a bug. label Aug 5, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 3, 2021

k8s-ci-robot closed this as completed Jan 2, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 22, 2022

k8s-ci-robot reopened this Nov 22, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 20, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 22, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingress-gce-404-server-with-metrics causes OOM #1426

ingress-gce-404-server-with-metrics causes OOM #1426

jupblb commented May 5, 2021

mborsz commented May 5, 2021

spencerhance commented May 5, 2021

vbannai commented Jul 22, 2021

k8s-triage-robot commented Nov 3, 2021

k8s-triage-robot commented Dec 3, 2021

k8s-triage-robot commented Jan 2, 2022

k8s-ci-robot commented Jan 2, 2022

nokernel commented Nov 22, 2022

nokernel commented Nov 22, 2022

k8s-ci-robot commented Nov 22, 2022

nokernel commented Nov 22, 2022

ratelle commented Nov 22, 2022

jupblb commented Nov 22, 2022

k8s-ci-robot commented Nov 22, 2022

k8s-triage-robot commented Feb 20, 2023

k8s-triage-robot commented Mar 22, 2023

k8s-triage-robot commented Apr 21, 2023

k8s-ci-robot commented Apr 21, 2023

ingress-gce-404-server-with-metrics causes OOM #1426

ingress-gce-404-server-with-metrics causes OOM #1426

Comments

jupblb commented May 5, 2021

mborsz commented May 5, 2021

spencerhance commented May 5, 2021

vbannai commented Jul 22, 2021

k8s-triage-robot commented Nov 3, 2021

k8s-triage-robot commented Dec 3, 2021

k8s-triage-robot commented Jan 2, 2022

k8s-ci-robot commented Jan 2, 2022

nokernel commented Nov 22, 2022

nokernel commented Nov 22, 2022

k8s-ci-robot commented Nov 22, 2022

nokernel commented Nov 22, 2022

ratelle commented Nov 22, 2022

jupblb commented Nov 22, 2022

k8s-ci-robot commented Nov 22, 2022

k8s-triage-robot commented Feb 20, 2023

k8s-triage-robot commented Mar 22, 2023

k8s-triage-robot commented Apr 21, 2023

k8s-ci-robot commented Apr 21, 2023