Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goroutine panic in docker.parseContainerStats #8692

Closed
trygvis opened this issue Jan 14, 2021 · 44 comments
Closed

Goroutine panic in docker.parseContainerStats #8692

trygvis opened this issue Jan 14, 2021 · 44 comments
Assignees
Labels
area/docker bug unexpected problem or unintended behavior

Comments

@trygvis
Copy link

trygvis commented Jan 14, 2021

I'm getting a panic when running Telegraf v1.15 and v1.17:

Stacktrace

2021-01-14T13:03:28Z I! Starting Telegraf 1.17.0
2021-01-14T13:03:28Z I! Using config file: /etc/telegraf/telegraf.conf
2021-01-14T13:03:28Z I! Loaded inputs: cpu disk diskio docker kernel mem processes swap system
2021-01-14T13:03:28Z I! Loaded aggregators: 
2021-01-14T13:03:28Z I! Loaded processors: 
2021-01-14T13:03:28Z I! Loaded outputs: influxdb
2021-01-14T13:03:28Z I! Tags enabled: host=xxx
2021-01-14T13:03:28Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"xxx", Flush Interval:10s
panic: runtime error: slice bounds out of range [:12] with capacity 0

goroutine 163 [running]:
github.com/influxdata/telegraf/plugins/inputs/docker.parseContainerStats(0xc000592480, 0x348a6c0, 0xc0002643a0, 0xc00037d860, 0xc000c0a2c0, 0x40, 0x1, 0xc0006d4936, 0x5)
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:677 +0x23e5
github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).gatherContainerInspect(0xc0005bc600, 0xc000c0a2c0, 0x40, 0xc0000c9fc0, 0x1, 0x4, 0xc000afb660, 0xd, 0xc000c08190, 0x47, ...)
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:568 +0x20e
github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).gatherContainer(0xc0005bc600, 0xc000c0a2c0, 0x40, 0xc0000c9fc0, 0x1, 0x4, 0xc000afb660, 0xd, 0xc000c08190, 0x47, ...)
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:495 +0x910
github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather.func1(0xc000afbb80, 0xc0005bc600, 0x348a6c0, 0xc0002643a0, 0xc000c0a2c0, 0x40, 0xc0000c9fc0, 0x1, 0x4, 0xc000afb660, ...)
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:217 +0xbd
created by github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:215 +0x3f8

Relevant source code:

// If we have OnlineCPUs field, then use it to restrict stats gathering to only Online CPUs
// (https://github.com/moby/moby/commit/115f91d7575d6de6c7781a96a082f144fd17e400)
var percpuusage []uint64
if stat.CPUStats.OnlineCPUs > 0 {
percpuusage = stat.CPUStats.CPUUsage.PercpuUsage[:stat.CPUStats.OnlineCPUs]
} else {
percpuusage = stat.CPUStats.CPUUsage.PercpuUsage
}

It seems that a double check of the data actually being present would be useful, even though it might hide other problems. Perhaps a warning is in order.

Environment

Linux

OS: Debian unstable.

Linux akili 5.10.0-1-amd64 #1 SMP Debian 5.10.4-1 (2020-12-31) x86_64 GNU/Linux

Docker

Client:
 Version:           20.10.0+dfsg2
 API version:       1.41
 Go version:        go1.15.6
 Git commit:        7287ab3
 Built:             Mon Dec 14 12:39:22 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.0+dfsg2
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.15.6
  Git commit:       eeddea2
  Built:            Mon Dec 14 12:39:22 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3~ds1
  GitCommit:        1.4.3~ds1-1
 runc:
  Version:          1.0.0~rc92+dfsg1
  GitCommit:        1.0.0~rc92+dfsg1-5
 docker-init:
  Version:          0.19.0
  GitCommit:        
@ivorybilled ivorybilled added area/docker bug unexpected problem or unintended behavior ready labels Jan 14, 2021
@Aladex
Copy link
Contributor

Aladex commented Jan 15, 2021

I think it should be like:

 var percpuusage []uint64 
 if stat.CPUStats.OnlineCPUs > 0 { 
 	percpuusage = stat.CPUStats.CPUUsage.PercpuUsage[:stat.CPUStats.OnlineCPUs - 1] 
 } else { 
 	percpuusage = stat.CPUStats.CPUUsage.PercpuUsage 
 } 

@trygvis
Copy link
Author

trygvis commented Jan 15, 2021

Just upgraded to Docker version 20.10.2+dfsg1, same issue.

@Aladex
Copy link
Contributor

Aladex commented Jan 18, 2021

@trygvis
Hi. Can you show groups of telegraf user? Do you run this telegraf agent from docker or on bare metal
as service?
Can you show your config section for docker plugin?

@trygvis
Copy link
Author

trygvis commented Jan 18, 2021

Telegraf is running in a Docker container through Docker compose:

version: "3"
services:
  telegraf:
    image: telegraf:1.17
    privileged: true
    network_mode: host
    volumes:
      - /etc/XXX/telegraf.conf:/etc/telegraf/telegraf.conf:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /sys:/rootfs/sys:ro
      - /proc:/rootfs/proc:ro
      - /etc:/rootfs/etc:ro
    command:
      - sh
      - -c
      - apt update && apt install -y --install-recommends=no smartmontools; exec telegraf
    environment:
      INFLUX_URL: "..."
      INFLUX_SKIP_DATABASE_CREATION: "true"
      HOST_PROC: "/rootfs/proc"
      HOST_SYS: "/rootfs/sys"
      HOST_ETC: "/rootfs/etc"
      HOST_MOUNT_PREFIX: "/rootfs"

Telegraf.conf:

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = "akili"
  omit_hostname = false

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.docker]]



[[outputs.influxdb]]
  urls = ["$INFLUX_URL"]
  skip_database_creation = false

@trygvis
Copy link
Author

trygvis commented Jan 20, 2021

@Aladex Is there anything else you need from me to debug?

@Aladex
Copy link
Contributor

Aladex commented Jan 20, 2021

@Aladex Is there anything else you need from me to debug?

No. thanks. I will try with you config

@Aladex
Copy link
Contributor

Aladex commented Jan 20, 2021

What about default docker stat?

docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}"

or just

docker stats

What is output of these commands? Try it with sudo too

@trygvis
Copy link
Author

trygvis commented Jan 20, 2021

I'm getting the same output with with and without sudo, however my user is in the docker group.

$ docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" --no-stream
CONTAINER      CPU %     MEM USAGE / LIMIT
bf18f31f73fa   0.00%     44.28MiB / 30.9GiB
44b177c78824   0.03%     37.49MiB / 30.9GiB
39b9df87ea3f   0.99%     354.5MiB / 30.9GiB
572d4a16c40f   0.00%     27MiB / 30.9GiB
$ id
uid=1000(trygvis) gid=1000(trygvis) groups=1000(trygvis),24(cdrom),25(floppy),27(sudo),29(audio),30(dip),44(video),46(plugdev),101(systemd-journal),109(netdev),112(bluetooth),116(scanner),125(docker),128(libvirt)

@trygvis
Copy link
Author

trygvis commented Jan 26, 2021

@Aladex Any progress on this issue? Did my output help or do you need more tests?

@sjwang90 sjwang90 removed the ready label Jan 29, 2021
@Satforst
Copy link

Hi.
I have the same problem with my Telegraf.
It happened after I updated Fedora 30 to 31.
I hope you are on the good way to solve it :)

@Aladex
Copy link
Contributor

Aladex commented Jan 29, 2021

Hi. Can you check this bug from bare metal not from docker? Just start on your server or on PC from binary with your config

@pensionado
Copy link

I have this problem in bare metal. You need any documentation ?

@Satforst
Copy link

Satforst commented Feb 12, 2021

The problem stays the same, it doesn't matter if is Telegraf in a container or bare metal.
Actually, I tried also deploy new InfluxDB with version 2.0 and there it's ok, but I don't want two databases just because Telegraf has a bug with that.

@Aladex
Copy link
Contributor

Aladex commented Feb 12, 2021

Sry. But i can't reproduce this bug.

@ssoroka who can help?

@Satforst
Copy link

I "solved" this problem run container via docker run and in telegraf.conf I comment Docker plugin. Since that, it works, except Docker plugin.
If you want some details, feel free to ask ;)

@Ulrar
Copy link

Ulrar commented Feb 13, 2021

I've got the exact same issue, telegraf running in docker and if the docker plugin is enabled same trace.
If I disable the docker plugin it runs fine, but of course without any of the docker metrics being published.

@giesmininkas
Copy link

I am also encountering the same issue.
Docker engine version 20.10.3
Telegraf running in docker, image tag 1.17.2-alpine

@trygvis
Copy link
Author

trygvis commented Feb 15, 2021

@Aladex It it possible to fix the out-of-bounds problem if if you can't reproduce it? It is stopping me from getting any Docker statistics, anything would be better than nothing.

@giesmininkas
Copy link

This problem appeared to me when I ran a new instance on my pc. The exact same configuration on my other machine is running just fine.

It has the same docker and telegraf versions. Almost the same telegraf config (except from a few paths and currently disabled docker input). The only difference I can think of is that the one which is working fine has been upgraded from older versions (docker and telegraf) while this issue manifested after a clean install.

I am running Debian Buster on the "good" machine, and Debian Sid on the other. Same docker from their official apt repo (deb [arch=amd64] https://download.docker.com/linux/debian buster stable).

@Mohan-cloud
Copy link

I am also having the same issue. InfluxDB 2.0.3 and Telegraf v 1.17.2 running in docker.

@Tmd11a
Copy link

Tmd11a commented Feb 23, 2021

I'll add on, I am having this issue as well. InfluxDB 1.8.4 and Telegraf 1.17.3

@giesmininkas
Copy link

giesmininkas commented Feb 26, 2021

Still have same issue with Docker 20.10.4 and Telegraf 1.17.3

telegraf    | 2021-02-26T15:53:11Z I! Starting Telegraf 1.17.3
telegraf    | 2021-02-26T15:53:11Z I! Using config file: /etc/telegraf/telegraf.conf
telegraf    | 2021-02-26T15:53:11Z I! Loaded inputs: cpu disk docker mem net system temp
telegraf    | 2021-02-26T15:53:11Z I! Loaded aggregators: 
telegraf    | 2021-02-26T15:53:11Z I! Loaded processors: 
telegraf    | 2021-02-26T15:53:11Z I! Loaded outputs: influxdb
telegraf    | 2021-02-26T15:53:11Z I! Tags enabled: host=desktop
telegraf    | 2021-02-26T15:53:11Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"desktop", Flush Interval:10s
telegraf    | panic: runtime error: slice bounds out of range [:12] with capacity 0
telegraf    | 
telegraf    | goroutine 53 [running]:
telegraf    | github.com/influxdata/telegraf/plugins/inputs/docker.parseContainerStats(0xc000718b40, 0x3556c60, 0xc0005990a0, 0xc00067d0e0, 0xc000662400, 0x40, 0x100, 0xc00038e810, 0x5)
telegraf    |   /go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:677 +0x23e5
telegraf    | github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).gatherContainerInspect(0xc0002bc400, 0xc000662400, 0x40, 0xc000437180, 0x1, 0x4, 0xc00038e4a0, 0x16, 0xc000434280, 0x47, ...)
telegraf    |   /go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:568 +0x265
telegraf    | github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).gatherContainer(0xc0002bc400, 0xc000662400, 0x40, 0xc000437180, 0x1, 0x4, 0xc00038e4a0, 0x16, 0xc000434280, 0x47, ...)
telegraf    |   /go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:495 +0x910
telegraf    | github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather.func1(0xc000667050, 0xc0002bc400, 0x3556c60, 0xc0005990a0, 0xc000662400, 0x40, 0xc000437180, 0x1, 0x4, 0xc00038e4a0, ...)
telegraf    |   /go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:217 +0xbd
telegraf    | created by github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather
telegraf    |   /go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:215 +0x3f8
Client: Docker Engine - Community
 Version:           20.10.4
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        d3cb89e
 Built:             Thu Feb 25 07:05:24 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.4
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       363e9a8
  Built:            Thu Feb 25 07:03:20 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

@LylianPerson-Gay
Copy link

Same issue on Linux with:

Client: Docker Engine - Community
 Version:           20.10.0
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        7287ab3
 Built:             Tue Dec  8 18:59:40 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.0
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       eeddea2
  Built:            Tue Dec  8 18:57:45 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

and Telegraf 1.15.2

Change version of telegraf and docker many time, doesn't fix the issue

@Tmd11a
Copy link

Tmd11a commented Mar 4, 2021

Has anyone had success with a lower version of Docker?

@valodzka
Copy link

valodzka commented Mar 4, 2021

Same issue. I think that it might be related to running docker with cgroups v2.

@valodzka
Copy link

valodzka commented Mar 4, 2021

I ran curl --unix-socket /var/run/docker.sock http:/v1.21/containers/my-container/stats, results:
From server with issues:

"cpu_stats": {
"cpu_usage": {
  "total_usage": 72982239176000,
  "usage_in_kernelmode": 2933702214000,
  "usage_in_usermode": 70048536961000
},
"system_cpu_usage": 1504145660000000,
"online_cpus": 8,
"throttling_data": {
  "periods": 0,
  "throttled_periods": 0,
  "throttled_time": 0
}
},

From good server:

"cpu_stats": {
"cpu_usage": {
  "total_usage": 50770201933162,
  "percpu_usage": [
    4025208597196,
    4660386142943,
    4377374787337,
    4339088988196,
    3446537002338,
    4557378246693,
    4473931637526,
    3924435808393,
    4208383217677,
    4047214108661,
    4641274011094,
    4068989385108
  ],
  "usage_in_kernelmode": 555510000000,
  "usage_in_usermode": 50201420000000
},
"system_cpu_usage": 81986869040000000,
"online_cpus": 12,
"throttling_data": {
  "periods": 0,
  "throttled_periods": 0,
  "throttled_time": 0
}
},

@RobertBerger
Copy link

It seems I have the same issue with telegraf 1.17.3 on an Embedded System (arm 32 bit). Also, it looks like it only appears when I run telegraf from within a docker container and not when I run it natively. I am still investigating it and will add more info as I proceed with my investigation.

docker --version
Docker version 20.10.3, build 41b3ea7e47
docker-compose up telegraf
Building with native build. Learn about native build in Compose here: https://docs.docker.com/go/compose-native-build/
influxdb is up-to-date
Starting telegraf ... done
Attaching to telegraf
telegraf    | 2021-03-07T09:54:01Z I! Starting Telegraf 1.17.3
telegraf    | 2021-03-07T09:54:01Z I! Using config file: /etc/telegraf/telegraf.conf
telegraf    | 2021-03-07T09:54:01Z I! Loaded inputs: cpu disk docker mem mqtt_consumer net swap system
telegraf    | 2021-03-07T09:54:01Z I! Loaded aggregators: 
telegraf    | 2021-03-07T09:54:01Z I! Loaded processors: 
telegraf    | 2021-03-07T09:54:01Z I! Loaded outputs: influxdb (2x)
telegraf    | 2021-03-07T09:54:01Z I! Tags enabled: host=2755d374f0f7
telegraf    | 2021-03-07T09:54:01Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"2755d374f0f7", Flush Interval:10s
telegraf    | 2021-03-07T09:54:01Z I! [inputs.mqtt_consumer] Connected [tcp://mqttbrk1.res.training:1883]
telegraf    | panic: runtime error: slice bounds out of range [:4] with capacity 0
telegraf    | 
telegraf    | goroutine 116 [running]:
telegraf    | github.com/influxdata/telegraf/plugins/inputs/docker.parseContainerStats(0x4366340, 0x283d988, 0x40562e0, 0x4056cc0, 0x43920c0, 0x40, 0x1, 0x406f410, 0x5)
telegraf    | 	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:677 +0x1ed8
telegraf    | github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).gatherContainerInspect(0x3d58400, 0x43920c0, 0x40, 0x4056c60, 0x1, 0x4, 0x40fa560, 0x17, 0x3c30780, 0x47, ...)
telegraf    | 	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:568 +0x35c
telegraf    | github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).gatherContainer(0x3d58400, 0x43920c0, 0x40, 0x4056c60, 0x1, 0x4, 0x40fa560, 0x17, 0x3c30780, 0x47, ...)
telegraf    | 	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:495 +0x6f0
telegraf    | github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather.func1(0x407e0a0, 0x3d58400, 0x283d988, 0x40562e0, 0x43920c0, 0x40, 0x4056c60, 0x1, 0x4, 0x40fa560, ...)
telegraf    | 	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:217 +0x68
telegraf    | created by github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather
telegraf    | 	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:215 +0x3b4
telegraf exited with code 2

In case you need more info please let me know.

@RobertBerger
Copy link

It also bombs when run outside of a container!

root@multi-v7-ml:/lib/systemd# /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
2021-03-07T10:59:33Z I! Starting Telegraf 1.17.3
2021-03-07T10:59:33Z I! Loaded inputs: cpu disk diskio docker kernel mem net processes swap system 
2021-03-07T10:59:33Z I! Loaded aggregators:  
2021-03-07T10:59:33Z I! Loaded processors:  
2021-03-07T10:59:33Z I! Loaded outputs: influxdb 
2021-03-07T10:59:33Z I! Tags enabled: host=multi-v7-ml 
2021-03-07T10:59:33Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"multi-v7-ml", Flush Interval:10s 
2021-03-07T10:59:33Z D! [agent] Initializing plugins 
2021-03-07T10:59:33Z D! [agent] Connecting outputs 
2021-03-07T10:59:33Z D! [agent] Attempting connection to [outputs.influxdb] 
2021-03-07T10:59:33Z D! [agent] Successfully connected to outputs.influxdb 
2021-03-07T10:59:33Z D! [agent] Starting service inputs 
panic: runtime error: slice bounds out of range [:4] with capacity 0

goroutine 97 [running]:
github.com/influxdata/telegraf/plugins/inputs/docker.parseContainerStats(0x473e000, 0x283d988, 0x40eccc0, 0x409d120, 0x3c23840, 0x40, 0x1, 0x47001f0, 0x5)
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:677 +0x1ed8
github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).gatherContainerInspect(0x3cc0000, 0x3c23840, 0x40, 0x409d0c0, 0x1, 0x4, 0x3c21660, 0x17, 0x3e0a320, 0x47, ...)
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:568 +0x35c
github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).gatherContainer(0x3cc0000, 0x3c23840, 0x40, 0x409d0c0, 0x1, 0x4, 0x3c21660, 0x17, 0x3e0a320, 0x47, ...)
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:495 +0x6f0
github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather.func1(0x40ab550, 0x3cc0000, 0x283d988, 0x40eccc0, 0x3c23840, 0x40, 0x409d0c0, 0x1, 0x4, 0x3c21660, ...)
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:217 +0x68
created by github.com/influxdata/telegraf/plugins/inputs/docker.(*Docker).Gather
	/go/src/github.com/influxdata/telegraf/plugins/inputs/docker/docker.go:215 +0x3b4

@RobertBerger
Copy link

RobertBerger commented Mar 7, 2021

@valodzka

Same issue. I think that it might be related to running docker with cgroups v2.

Same issue if I pass to the kernel command line cgroup_no_v1=net_cls,net_prio.

# cat /proc/cmdline 
console=ttymxc1,115200n8 cgroup_no_v1=net_cls,net_prio ip=192.168.42.71:192.168.42.1:192.168.42.254:255.255.255.0:imx6q-phytec-mira-rdk-nand-tig:eth0:off nfsroot=192.168.42.1:/opt/poky/tig/mira-rootfs,v3,tcp noinitrd nohlt panic=1 earlyprintk=ttymxc1,115200n8

.... but not passing cgroup_no_v1=net_cls,net_prio to the kernel command line seems to fix the problem.

@RobertBerger
Copy link

I am not sure what docker stats reports can be correct:

# docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" --no-stream
CONTAINER      CPU %     MEM USAGE / LIMIT
0c9eb000fd11   0.00%     0B / 0B
bb1070e275ff   0.00%     0B / 0B
2683f63f9b06   0.00%     0B / 0B
# docker stats
CONTAINER ID   NAME       CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O   PIDS
0c9eb000fd11   influxdb   0.00%     0B / 0B             0.00%     83.1MB / 15.9MB   0B / 0B     0
bb1070e275ff   grafana    0.00%     0B / 0B             0.00%     965kB / 353kB     0B / 0B     0
2683f63f9b06   mqtt       0.00%     0B / 0B             0.00%     0B / 0B           0B / 0B     0

@RobertBerger
Copy link

RobertBerger commented Mar 7, 2021

without the change in the kernel command line I get the infamous:

cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation

but now it looks like docker stats start to work again

# docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" --no-stream
CONTAINER      CPU %     MEM USAGE / LIMIT
2755d374f0f7   0.94%     13.07MiB / 989.9MiB
0c9eb000fd11   100.57%   217.2MiB / 989.9MiB
bb1070e275ff   0.05%     17.69MiB / 989.9MiB
2683f63f9b06   0.10%     1.617MiB / 989.9MiB
# docker stats
CONTAINER ID   NAME       CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O   PIDS
2755d374f0f7   telegraf   0.08%     13.02MiB / 989.9MiB   1.32%     35.5kB / 103kB    0B / 0B     10
0c9eb000fd11   influxdb   0.50%     200.3MiB / 989.9MiB   20.24%    733kB / 146kB     0B / 0B     14
bb1070e275ff   grafana    0.23%     17.72MiB / 989.9MiB   1.79%     14.2kB / 4.27kB   0B / 0B     13
2683f63f9b06   mqtt       0.13%     1.621MiB / 989.9MiB   0.16%     0B / 0B           0B / 0B     2

A preliminary check looks pretty good now. Seems not to crash anymore ;)

@trygvis
Copy link
Author

trygvis commented Mar 8, 2021

@Aladex have you look at @RobertBerger's finings about cgroup2?

@Aladex
Copy link
Contributor

Aladex commented Mar 17, 2021

Hi all
This is not a problem of telegraf software. Read this documentation https://docs.docker.com/config/containers/runmetrics/
But i think that need be fixed. We need to write an exception to log with information about that

@valodzka
Copy link

This is a 100% bug in telegraf. Not only docker metrics aren't collected - telegraf stops working at all in such cases (nothing is collected).

@RobertBerger
Copy link

RobertBerger commented Mar 17, 2021

@valodzka

This is a 100% bug in telegraf. Not only docker metrics aren't collected - telegraf stops working at all in such cases (nothing is collected).

I tend to completely agree with you!

CONTAINER      CPU %     MEM USAGE / LIMIT
0c9eb000fd11   0.00%     0B / 0B
bb1070e275ff   0.00%     0B / 0B
2683f63f9b06   0.00%     0B / 0B

seems to throw the error. So I guess the only thing which needs to be done is to catch the case with 0/NULL with the docker plugin. It is 100% reproducible with telegraf native or telegraf in docker.

Even better would be to check for cgroup v1 and cgroup v2 and try to read differenr things.

@Aladex
Copy link
Contributor

Aladex commented Mar 17, 2021

Ok. I will reproduce this bug on VM with Fedora 20 an try to fix it

@Aladex
Copy link
Contributor

Aladex commented Mar 17, 2021

@valodzka
Can you give me the full output from good an bad examples of this command curl --unix-socket /var/run/docker.sock http:/v1.21/containers/my-container/stats ?

@valodzka
Copy link

Bad:

{"read":"2021-03-17T17:15:06.569130418Z","preread":"2021-03-17T17:15:05.567625913Z","pids_stats":{"current":208,"limit":40000},"blkio_stats":{"io_service_bytes_recursive":[{"major":8,"minor":16,"op":"read","value":1105920},{"major":8,"minor":16,"op":"write","value":30867456},{"major":8,"minor":0,"op":"read","value":4415488},{"major":8,"minor":0,"op":"write","value":30867456},{"major":9,"minor":2,"op":"read","value":5521408},{"major":9,"minor":2,"op":"write","value":30867456}],"io_serviced_recursive":null,"io_queue_recursive":null,"io_service_time_recursive":null,"io_wait_time_recursive":null,"io_merged_recursive":null,"io_time_recursive":null,"sectors_recursive":null},"num_procs":0,"storage_stats":{},"cpu_stats":{"cpu_usage":{"total_usage":1365434314000,"usage_in_kernelmode":95167360000,"usage_in_usermode":1270266953000},"system_cpu_usage":5599036030000000,"online_cpus":8,"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"precpu_stats":{"cpu_usage":{"total_usage":1361713580000,"usage_in_kernelmode":95137879000,"usage_in_usermode":1266575700000},"system_cpu_usage":5599028050000000,"online_cpus":8,"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"memory_stats":{"usage":49330290688,"stats":{"active_anon":48853880832,"active_file":2187264,"anon":48852156416,"anon_thp":0,"file":348835840,"file_dirty":1622016,"file_mapped":2568192,"file_writeback":135168,"inactive_anon":1286144,"inactive_file":344186880,"kernel_stack":2457600,"pgactivate":0,"pgdeactivate":0,"pgfault":749133,"pglazyfree":0,"pglazyfreed":0,"pgmajfault":0,"pgrefill":0,"pgscan":0,"pgsteal":0,"shmem":2838528,"slab":27287552,"slab_reclaimable":21188608,"slab_unreclaimable":6098944,"sock":1335296,"thp_collapse_alloc":0,"thp_fault_alloc":0,"unevictable":0,"workingset_activate":297,"workingset_nodereclaim":0,"workingset_refault":297},"limit":67338993664},"name":"/my-container","id":"53afe38e569f4ee127518da710571471ba3a6a0bc0297c8d7dace9ecfc8095b2","networks":{"eth0":{"rx_bytes":3022668306,"rx_packets":1266304,"rx_errors":0,"rx_dropped":0,"tx_bytes":408605192,"tx_packets":1349661,"tx_errors":0,"tx_dropped":0}}}

Good:

{"read":"2021-03-17T17:15:58.93062148Z","preread":"0001-01-01T00:00:00Z","pids_stats":{"current":187,"limit":40000},"blkio_stats":{"io_service_bytes_recursive":[{"major":9,"minor":0,"op":"Read","value":6303744},{"major":9,"minor":0,"op":"Write","value":0},{"major":9,"minor":0,"op":"Sync","value":6303744},{"major":9,"minor":0,"op":"Async","value":0},{"major":9,"minor":0,"op":"Discard","value":0},{"major":9,"minor":0,"op":"Total","value":6303744},{"major":8,"minor":16,"op":"Read","value":4063232},{"major":8,"minor":16,"op":"Write","value":0},{"major":8,"minor":16,"op":"Sync","value":4063232},{"major":8,"minor":16,"op":"Async","value":0},{"major":8,"minor":16,"op":"Discard","value":0},{"major":8,"minor":16,"op":"Total","value":4063232},{"major":8,"minor":0,"op":"Read","value":16609280},{"major":8,"minor":0,"op":"Write","value":0},{"major":8,"minor":0,"op":"Sync","value":16609280},{"major":8,"minor":0,"op":"Async","value":0},{"major":8,"minor":0,"op":"Discard","value":0},{"major":8,"minor":0,"op":"Total","value":16609280},{"major":9,"minor":2,"op":"Read","value":14368768},{"major":9,"minor":2,"op":"Write","value":0},{"major":9,"minor":2,"op":"Sync","value":14368768},{"major":9,"minor":2,"op":"Async","value":0},{"major":9,"minor":2,"op":"Discard","value":0},{"major":9,"minor":2,"op":"Total","value":14368768}],"io_serviced_recursive":[{"major":9,"minor":0,"op":"Read","value":1539},{"major":9,"minor":0,"op":"Write","value":0},{"major":9,"minor":0,"op":"Sync","value":1539},{"major":9,"minor":0,"op":"Async","value":0},{"major":9,"minor":0,"op":"Discard","value":0},{"major":9,"minor":0,"op":"Total","value":1539},{"major":8,"minor":16,"op":"Read","value":593},{"major":8,"minor":16,"op":"Write","value":0},{"major":8,"minor":16,"op":"Sync","value":593},{"major":8,"minor":16,"op":"Async","value":0},{"major":8,"minor":16,"op":"Discard","value":0},{"major":8,"minor":16,"op":"Total","value":593},{"major":8,"minor":0,"op":"Read","value":1753},{"major":8,"minor":0,"op":"Write","value":0},{"major":8,"minor":0,"op":"Sync","value":1753},{"major":8,"minor":0,"op":"Async","value":0},{"major":8,"minor":0,"op":"Discard","value":0},{"major":8,"minor":0,"op":"Total","value":1753},{"major":9,"minor":2,"op":"Read","value":807},{"major":9,"minor":2,"op":"Write","value":0},{"major":9,"minor":2,"op":"Sync","value":807},{"major":9,"minor":2,"op":"Async","value":0},{"major":9,"minor":2,"op":"Discard","value":0},{"major":9,"minor":2,"op":"Total","value":807}],"io_queue_recursive":[],"io_service_time_recursive":[],"io_wait_time_recursive":[],"io_merged_recursive":[],"io_time_recursive":[],"sectors_recursive":[]},"num_procs":0,"storage_stats":{},"cpu_stats":{"cpu_usage":{"total_usage":1208252893044,"percpu_usage":[144388446646,148134255847,143649095560,150737744275,154310602613,148953881001,170814121289,147264813476],"usage_in_kernelmode":71080000000,"usage_in_usermode":1149920000000},"system_cpu_usage":46327519310000000,"online_cpus":8,"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"precpu_stats":{"cpu_usage":{"total_usage":0,"usage_in_kernelmode":0,"usage_in_usermode":0},"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"memory_stats":{"usage":27403276288,"max_usage":27403542528,"stats":{"active_anon":25431363584,"active_file":7544832,"cache":201891840,"dirty":946176,"hierarchical_memory_limit":9223372036854771712,"hierarchical_memsw_limit":0,"inactive_anon":1696595968,"inactive_file":191827968,"mapped_file":2027520,"pgfault":5650326,"pgmajfault":363,"pgpgin":6312042,"pgpgout":684309,"rss":27124793344,"rss_huge":4284481536,"total_active_anon":25431363584,"total_active_file":7544832,"total_cache":201891840,"total_dirty":946176,"total_inactive_anon":1696595968,"total_inactive_file":191827968,"total_mapped_file":2027520,"total_pgfault":5650326,"total_pgmajfault":363,"total_pgpgin":6312042,"total_pgpgout":684309,"total_rss":27124793344,"total_rss_huge":4284481536,"total_unevictable":0,"total_writeback":0,"unevictable":0,"writeback":0},"limit":67376185344},"name":"/my-container","id":"3344721b1ea05c15901b4485988b3fefa1ddf13b0496e56914bc31816afe5a86","networks":{"eth0":{"rx_bytes":1502871296,"rx_packets":678436,"rx_errors":0,"rx_dropped":0,"tx_bytes":281872106,"tx_packets":673401,"tx_errors":0,"tx_dropped":0}}}

@valodzka
Copy link

As a workaround with 1.18 following config should fix this issue:

[[inputs.docker]]
perdevice = false
perdevice_include = []
total = true

@Aladex
Copy link
Contributor

Aladex commented Mar 18, 2021

As a workaround with 1.18 following config should fix this issue:

[[inputs.docker]]
perdevice = false
perdevice_include = []
total = true

No. With this config you will switch off any kind of cpu stats. Or you need to set all another metrics except cpu from docker socket in this array. And dont forget about another metrics from cpu from this dictionary:
"cpu_stats": { "cpu_usage": { "total_usage": 1365434314000, "usage_in_kernelmode": 95167360000, "usage_in_usermode": 1270266953000 }, "system_cpu_usage": 5599036030000000, "online_cpus": 8, "throttling_data": { "periods": 0, "throttled_periods": 0, "throttled_time": 0 } },

@Aladex
Copy link
Contributor

Aladex commented Mar 23, 2021

Hi all
Fix is merged in master branch

@helenosheaa helenosheaa self-assigned this Mar 23, 2021
@helenosheaa
Copy link
Member

I'm going to close this issue due to the fix that was merged, if the issue is still present please reopen the issue

@JoshKeegan
Copy link
Contributor

Thanks for the fix.
For anyone else coming accross this still using the old deprecated perdevice = true setting you need to migrate to using perdevice_include for the fix to work.

e.g.

## Whether to report for each container per-device blkio (8:0, 8:1...),
## network (eth0, eth1, ...) and cpu (cpu0, cpu1, ...) stats or not.
## Usage of this setting is discouraged since it will be deprecated in favor of 'perdevice_include'.
## Default value is 'true' for backwards compatibility, please set it to 'false' so that 'perdevice_include' setting 
## is honored.
perdevice = false

## Specifies for which classes a per-device metric should be issued
## Possible values are 'cpu' (cpu0, cpu1, ...), 'blkio' (8:0, 8:1, ...) and 'network' (eth0, eth1, ...)
## Please note that this setting has no effect if 'perdevice' is set to 'true'
perdevice_include = ["cpu", "blkio", "network"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docker bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests