Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of endpoints inside probe can grow over time #3645

Closed
bboreham opened this issue Jul 1, 2019 · 4 comments · Fixed by #3661
Closed

List of endpoints inside probe can grow over time #3645

bboreham opened this issue Jul 1, 2019 · 4 comments · Fixed by #3661
Labels
performance Excessive resource usage and latency; usually a bug or chore

Comments

@bboreham
Copy link
Collaborator

bboreham commented Jul 1, 2019

Another picture showing report sizes over time, similar to #3576

image

It's not clear what the trigger is, but looking inside the reports the number of Endpoints grows by a few every time; up to 36,000 after 8 days. Restarting the probe resets the size down to normal levels.

conntrack -L on the node lists around 8,000 connections, mostly in TIME_WAIT.

@bboreham bboreham added the performance Excessive resource usage and latency; usually a bug or chore label Jul 1, 2019
@bboreham
Copy link
Collaborator Author

bboreham commented Jul 4, 2019

I suspect all the probes that are affected are using conntrack rather than ebpf. E.g. I see this in the logs:

<probe> ERRO: 2019/07/01 13:39:49.650529 tcp tracer received event with timestamp 726896371146595 even though the last timestamp was 726896371186140. Stopping the eBPF tracker.
<probe> WARN: 2019/07/01 13:39:50.818972 ebpf tracker died, restarting it
<probe> ERRO: 2019/07/01 13:42:41.287671 tcp tracer received event with timestamp 727068008148899 even though the last timestamp was 727068008151558. Stopping the eBPF tracker.
<probe> WARN: 2019/07/01 13:42:41.816997 ebpf tracker died again, gently falling back to proc scanning 

@bboreham
Copy link
Collaborator Author

bboreham commented Jul 5, 2019

Still happening after #3648.
Right now I am out of ideas how we can be leaking connections.
I wonder if we should do a periodic resync, e.g. once an hour, which would mask whatever is really causing it.

@bboreham
Copy link
Collaborator Author

bboreham commented Jul 9, 2019

I had an idea to improve the first problem - ebpf tracker died: iovisor/gobpf#42 (comment)

@bboreham
Copy link
Collaborator Author

bboreham commented Aug 2, 2019

#3653 made a big improvement, according to my stats, but still seeing fall-back to conntrack in a few cases and then constant growth over hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Excessive resource usage and latency; usually a bug or chore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant