Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to proc when ebpf timestamps are wrong #2336

Merged
merged 3 commits into from
Mar 21, 2017

Conversation

schu
Copy link
Contributor

@schu schu commented Mar 15, 2017

cc @alban

@@ -99,7 +100,11 @@ var lastTimestampV4 uint64

func tcpEventCbV4(e tracer.TcpV4) {
if lastTimestampV4 > e.Timestamp {
log.Errorf("ERROR: late event!\n")
// A kernel bug can cause the timestamps to be wrong, i.e. Ubuntu with Linux 4.4.0-47.68
// See https://github.com/iovisor/bcc/issues/790#issuecomment-263704235

This comment was marked as abuse.

@schu schu force-pushed the schu/timestamp-fallback branch 2 times, most recently from cba42e4 to 44c7c11 Compare March 15, 2017 15:49
@schu schu changed the title [WIP] Fallback to proc when ebpf timestamps are wrong Fallback to proc when ebpf timestamps are wrong Mar 16, 2017
@schu
Copy link
Contributor Author

schu commented Mar 16, 2017

Works in a manual test like schu@6febb7d (https://circleci.com/gh/schu/scope/308) For an integration test we would have to prepare and use a separate node with an older kernel w/o the fix. Not worth the effort IMHO.

t.ebpfTracker = nil
} else {
t.performEbpfTrack(rpt, hostNodeID)
return

This comment was marked as abuse.

This comment was marked as abuse.

@@ -99,7 +100,10 @@ var lastTimestampV4 uint64

func tcpEventCbV4(e tracer.TcpV4) {
if lastTimestampV4 > e.Timestamp {
log.Errorf("ERROR: late event!\n")
// See https://github.com/weaveworks/scope/issues/2334
log.Debugf("tcp tracer received event with timestamp %v even though the last timestamp was %v. Stopping the eBPF tracker.", e.Timestamp, lastTimestampV4)

This comment was marked as abuse.

This comment was marked as abuse.

@2opremio
Copy link
Contributor

2opremio commented Mar 17, 2017

For an integration test we would have to prepare and use a separate node with an older kernel w/o the fix. Not worth the effort IMHO.

You could mock the failure and add a unit-test

EDIT: only if it's not too much work

schu added 2 commits March 17, 2017 14:43
.. and avoid nil pointer dereference. It can happen that
`getWalkedProcPid` is called before the first `performWalk` finished.
@schu schu force-pushed the schu/timestamp-fallback branch from 44c7c11 to 5262e07 Compare March 17, 2017 13:43
@schu
Copy link
Contributor Author

schu commented Mar 17, 2017

Updated.

You could mock the failure and add a unit-test

Mocking the ebpf part would require too much work in my opinion (as we would have to mock most of it). I have added a simple test to see if we set the tracker to dead after receiving events out of order.

@2opremio 2opremio merged commit d64d66e into weaveworks:master Mar 21, 2017
@2opremio
Copy link
Contributor

2opremio commented Mar 21, 2017

This didn't work as expected. This is what I got when testing it in WeaveCloud's dev environment (the ebpf error continues in a loop).

time="2017-03-21T12:20:45Z" level=info msg="publishing to: https://<elided>@frontend.dev.weave.works, https://<elided>@cloud.weave.works, scope.kube-system.svc.cluster.local:80" 
<probe> INFO: 2017/03/21 12:20:45.750334 command line args: --logtostderr=true --mode=probe --no-app=true --probe.conntrack.buffersize=2097152 --probe.docker=true --probe.docker.bridge=cbr0 --probe.ebpf.connections=true --probe.http.listen=:4041 --probe.kubernetes=tru
e --probe.resolver=10.0.0.10:53 --weave=false https://<elided>@frontend.dev.weave.works https://<elided>@cloud.weave.works scope.kube-system.svc.cluster.local:80
<probe> INFO: 2017/03/21 12:20:45.750461 probe starting, version 2a61656, ID 2a3f76c3ed315deb
<probe> INFO: 2017/03/21 12:20:45.829637 Control connection to frontend.dev.weave.works starting
<probe> INFO: 2017/03/21 12:20:45.866573 Control connection to cloud.weave.works starting
<probe> INFO: 2017/03/21 12:20:45.892792 Control connection to scope.kube-system.svc.cluster.local starting
<probe> INFO: 2017/03/21 12:20:47.819510 kubernetes: targeting api server https://10.0.0.1:443
<probe> ERRO: 2017/03/21 12:20:47.897545 tcp tracer received event with timestamp 9164623054914167 even though the last timestamp was 9164623056153635. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:47.926527 tcp tracer received event with timestamp 9164623054136294 even though the last timestamp was 9164623055631102. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:48.168299 plugins: problem loading: no such file or directory
<probe> INFO: 2017/03/21 12:20:48.170005 Profiling data being exported to :4041
<probe> INFO: 2017/03/21 12:20:48.170117 go tool proof http://:4041/debug/pprof/{profile,heap,block}
<probe> ERRO: 2017/03/21 12:20:48.171029 tcp tracer received event with timestamp 9164624052746315 even though the last timestamp was 9164624055753098. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:48.205917 tcp tracer received event with timestamp 9164624052912341 even though the last timestamp was 9164624053316166. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:48.337777 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 4194304 bytes.
<probe> ERRO: 2017/03/21 12:20:48.652453 tcp tracer received event with timestamp 9164624054080901 even though the last timestamp was 9164624055925277. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:49.094904 tcp tracer received event with timestamp 9164625053710135 even though the last timestamp was 9164625055879401. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:49.111708 tcp tracer received event with timestamp 9164625053335726 even though the last timestamp was 9164625054386441. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:49.378834 tcp tracer received event with timestamp 9164625052644795 even though the last timestamp was 9164625055487794. Stopping the eBPF tracker.
<probe> WARN: 2017/03/21 12:20:49.530015 ebpf tracker died, gently falling back to proc scanning
<probe> ERRO: 2017/03/21 12:20:49.554308 tcp tracer received event with timestamp 9164625056086104 even though the last timestamp was 9164625056308378. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:49.578644 tcp tracer received event with timestamp 9164625052604987 even though the last timestamp was 9164625056236728. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:49.598553 tcp tracer received event with timestamp 9164625052491960 even though the last timestamp was 9164625052604987. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:49.752892 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 4194304 bytes.
<probe> ERRO: 2017/03/21 12:20:50.341305 tcp tracer received event with timestamp 9164626052415625 even though the last timestamp was 9164626053456793. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:50.456562 tcp tracer received event with timestamp 9164626053328268 even though the last timestamp was 9164626054166952. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:50.495565 tcp tracer received event with timestamp 9164626053263961 even though the last timestamp was 9164626054711229. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:50.539413 tcp tracer received event with timestamp 9164626052791124 even though the last timestamp was 9164626053263961. Stopping the eBPF tracker.
<probe> INFO: 2017/03/21 12:20:51.430468 Publish loop for cloud.weave.works starting
<probe> INFO: 2017/03/21 12:20:51.430899 Publish loop for scope.kube-system.svc.cluster.local starting
<probe> INFO: 2017/03/21 12:20:51.431330 Publish loop for frontend.dev.weave.works starting
<probe> ERRO: 2017/03/21 12:20:51.604695 tcp tracer received event with timestamp 9164627054285071 even though the last timestamp was 9164627055437225. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:52.992805 tcp tracer received event with timestamp 9164628053474331 even though the last timestamp was 9164628056281821. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:53.004402 tcp tracer received event with timestamp 9164628054192927 even though the last timestamp was 9164628054916996. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:53.059531 tcp tracer received event with timestamp 9164629052464334 even though the last timestamp was 9164629055395006. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:53.106566 tcp tracer received event with timestamp 9164629052790759 even though the last timestamp was 9164629055350544. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:53.262504 tcp tracer received event with timestamp 9164629053085192 even though the last timestamp was 9164629053624826. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:53.582277 tcp tracer received event with timestamp 9164629054868485 even though the last timestamp was 9164629055992184. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:53.594538 tcp tracer received event with timestamp 9164629052317434 even though the last timestamp was 9164629055942911. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:54.192466 tcp tracer received event with timestamp 9164630053260029 even though the last timestamp was 9164630054782341. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:54.265299 tcp tracer received event with timestamp 9164630054651380 even though the last timestamp was 9164630055673123. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:54.325498 tcp tracer received event with timestamp 9164630054710055 even though the last timestamp was 9164630054996914. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:54.385495 tcp tracer received event with timestamp 9164630053596825 even though the last timestamp was 9164630055770522. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:54.479220 tcp tracer received event with timestamp 9164630052901363 even though the last timestamp was 9164630054585827. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:54.482689 tcp tracer received event with timestamp 9164630052403622 even though the last timestamp was 9164630052901363. Stopping the eBPF tracker.
<probe> ERRO: 2017/03/21 12:20:54.675139 tcp tracer received event with timestamp 9164630052662181 even though the last timestamp was 9164630055610166. Stopping the eBPF tracker.

@schu schu deleted the schu/timestamp-fallback branch March 21, 2017 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants