Conntrack loses events due to ENOBUFS #1137

2opremio · 2016-03-07T12:56:37Z

<probe> ERRO: 2016/03/07 12:21:21.337873 conntrack stderr:WARNING: We have hit ENOBUFS! We are losing events.
<probe> ERRO: 2016/03/07 12:21:21.338081 conntrack stderr:This message means that the current netlink socket buffer size is too small.
<probe> ERRO: 2016/03/07 12:21:21.338189 conntrack stderr:Please, check --buffer-size in conntrack(8) manpage.
<probe> ERRO: 2016/03/07 12:21:21.338211 conntrack stderr:conntrack v1.4.3 (conntrack-tools): Operation failed: No buffer space available
<probe> ERRO: 2016/03/07 12:21:21.338141 conntrack error: EOF
<probe> INFO: 2016/03/07 12:21:21.338307 contrack exiting

Full logs: https://gist.github.com/janwillies/54083a50358718a4fb21
Context: https://weaveworks.slack.com/archives/scope-public/p1457352064000743

The text was updated successfully, but these errors were encountered:

2opremio · 2016-03-07T13:00:03Z

User was running kernel 4.3.5-300

tomwilkie · 2016-03-08T14:27:30Z

This is kinda by design; on a system with a large rate of connections we can't keep up with conntract and it will fail. We degrade gracefully, falling back to polling.

2opremio · 2016-03-08T14:29:09Z

Wouldn't it be reasonable to dynamically adjust the buffer size?

tomwilkie · 2016-03-08T14:55:34Z

In the worst case there will always be too many events to keep up with. Its reasonable to fall back to polling.

2opremio · 2016-03-08T16:10:31Z

I think it would be worth confirming with the user whether he was actually reaching a point in which we want to fall back to polling or whether we want to extend the buffers a bit further.

Also, it would be helpful to print a friendlier error with an explanation similar to the one in this ticket.

2opremio · 2016-05-11T16:55:21Z

I've seen this again in our service:

probe> ERRO: 2016/05/11 16:54:00.105107 conntrack stderr:WARNING: We have hit ENOBUFS! We are losing events.
<probe> ERRO: 2016/05/11 16:54:00.105178 conntrack stderr:This message means that the current netlink socket buffer size is too small.
<probe> ERRO: 2016/05/11 16:54:00.105200 conntrack stderr:Please, check --buffer-size in conntrack(8) manpage.
<probe> ERRO: 2016/05/11 16:54:00.105211 conntrack stderr:conntrack v1.4.3 (conntrack-tools): Operation failed: No buffer space available
<probe> ERRO: 2016/05/11 16:54:00.203215 conntrack error: EOF
<probe> INFO: 2016/05/11 16:54:00.203289 contrack exiting
<probe> ERRO: 2016/05/11 16:54:00.203360 conntrack error: exit status 1
<probe> ERRO: 2016/05/11 16:54:00.583857 conntrack stderr:WARNING: We have hit ENOBUFS! We are losing events.
<probe> ERRO: 2016/05/11 16:54:00.584331 conntrack stderr:This message means that the current netlink socket buffer size is too small.
<probe> ERRO: 2016/05/11 16:54:00.584588 conntrack stderr:Please, check --buffer-size in conntrack(8) manpage.
<probe> ERRO: 2016/05/11 16:54:00.590787 conntrack stderr:conntrack v1.4.3 (conntrack-tools): Operation failed: No buffer space available
<probe> ERRO: 2016/05/11 16:54:00.778920 conntrack error: EOF
<probe> INFO: 2016/05/11 16:54:00.779309 contrack exiting
<probe> ERRO: 2016/05/11 16:54:00.779656 conntrack error: exit status 1

rade · 2016-08-25T01:15:07Z

Note that when encountering this (or any other) error we immediately spawn another conntrackWalker, so we generally won't fall back to polling for long. If spikes are frequent though then a larger buffer would be better. Plus the message is alarming to users. So we should consider

making the buffer size configurable
using a higher default value (than provided by /proc/sys/net/core/rmem_default)
printing a friendlier error, noting the impact, current limit, and how to change it

tomwilkie · 2016-09-26T19:09:26Z

(1) done in #1896

rade · 2016-11-11T14:14:18Z

The error message we log from conntrack is now rather misleading since it suggests that the buffer size can be adjusted in /proc/sys/net/core/rmem_default, which won't help since we have a hard-coded default value.

Perhaps the default value should be read from /proc/sys/net/core/rmem_default? Though that won't help once we do (2).

rade · 2017-08-14T11:05:07Z

We increased the default buffer size significantly in #2739.

I don't think it's worth doing anything more here.

2opremio changed the title ~~Conntrack loses events and fails~~ Conntrack loses events and exits Mar 7, 2016

2opremio added this to the 0.14.0 milestone Mar 7, 2016

tomwilkie removed this from the 0.14.0 milestone Mar 17, 2016

rade mentioned this issue Jul 4, 2016

more accurate connection tracking #1637

Closed

rade added the bug Broken end user or developer functionality; not working as the developers intended it label Jul 4, 2016

mindfulmonk mentioned this issue Aug 17, 2016

conntrack issues and timeouts #1814

Closed

rade changed the title ~~Conntrack loses events and exits~~ Conntrack loses events due to ENOBUFS Aug 25, 2016

rade added the accuracy Incorrect information is being shown to the user; usually a bug label Jan 11, 2017

rade closed this as completed Aug 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conntrack loses events due to ENOBUFS #1137

Conntrack loses events due to ENOBUFS #1137

2opremio commented Mar 7, 2016

2opremio commented Mar 7, 2016

tomwilkie commented Mar 8, 2016

2opremio commented Mar 8, 2016

tomwilkie commented Mar 8, 2016

2opremio commented Mar 8, 2016

2opremio commented May 11, 2016

rade commented Aug 25, 2016

tomwilkie commented Sep 26, 2016

rade commented Nov 11, 2016

rade commented Aug 14, 2017

Conntrack loses events due to ENOBUFS #1137

Conntrack loses events due to ENOBUFS #1137

Comments

2opremio commented Mar 7, 2016

2opremio commented Mar 7, 2016

tomwilkie commented Mar 8, 2016

2opremio commented Mar 8, 2016

tomwilkie commented Mar 8, 2016

2opremio commented Mar 8, 2016

2opremio commented May 11, 2016

rade commented Aug 25, 2016

tomwilkie commented Sep 26, 2016

rade commented Nov 11, 2016

rade commented Aug 14, 2017