Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU usage on dockerd 100% #246

Open
runningman84 opened this issue Dec 13, 2016 · 47 comments
Open

CPU usage on dockerd 100% #246

runningman84 opened this issue Dec 13, 2016 · 47 comments

Comments

@runningman84
Copy link

If I start logspout-logstash using this docker-compose the cpu usage spikes to 100% and stays there...

  logspout:
    image: bekt/logspout-logstash
    mem_limit: 256m
    restart: on-failure
    depends_on:
      - logstash
    environment:
      ROUTE_URIS: logstash://logstash:5000
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

before starting logspout

 7015 root      20   0 1423204  80956  29440 S 106.3  1.3   1:25.72 dockerd       

after starting logspout

 7015 root      20   0 1423204  80972  29440 S   0.3  1.3   1:58.57 dockerd  
@garceri
Copy link

garceri commented Jan 26, 2017

Experiencing the same issue here:

$logspout --version
v3.2-dev-custom

$docker version
Client:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built: Wed Oct 26 23:26:11 2016
OS/Arch: linux/amd64

Server:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built: Wed Oct 26 23:26:11 2016
OS/Arch: linux/amd64

[root@ip-XXXXXXX rancher]# docker info
Containers: 56
Running: 44
Paused: 0
Stopped: 12
Images: 89
Server Version: 1.12.3
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local rancher-nfs
Network: null bridge overlay host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 4.4.24-rancher
Operating System: RancherOS v0.7.1
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 14.94 GiB
Name: ip-XXXXXXXXXXX.ec2.internal
ID: DVSV:NLW5:VD5U:UOYY:FIOB:4RTH:7NGQ:GIBO:KKTR:VPYO:J6A2:4SZE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
127.0.0.0/8

Logspout container logs with DEBUG enabled show a constant stream of:

1/26/2017 12:37:44 PM2017/01/26 17:37:44 pump.pumpLogs(): 495069bafc93 stopped
1/26/2017 12:37:45 PM2017/01/26 17:37:45 pump.pumpLogs(): 495069bafc93 started
1/26/2017 12:37:45 PM2017/01/26 17:37:45 pump.pumpLogs(): 25635c95f804 stopped
1/26/2017 12:37:45 PM2017/01/26 17:37:45 pump.pumpLogs(): 25635c95f804 started
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): 495069bafc93 stopped
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): 495069bafc93 started
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): a27d4bfb3a41 stopped
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): a27d4bfb3a41 started
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): 32e734dcf238 stopped
1/26/2017 12:37:46 PM2017/01/26 17:37:46 pump.pumpLogs(): 32e734dcf238 started
1/26/2017 12:37:47 PM2017/01/26 17:37:47 pump.pumpLogs(): 495069bafc93 stopped
1/26/2017 12:37:47 PM2017/01/26 17:37:47 pump.pumpLogs(): 495069bafc93 started
1/26/2017 12:37:48 PM2017/01/26 17:37:48 pump.pumpLogs(): 495069bafc93 stopped
1/26/2017 12:37:48 PM2017/01/26 17:37:48 pump.pumpLogs(): a27d4bfb3a41 stopped
1/26/2017 12:37:48 PM2017/01/26 17:37:48 pump.pumpLogs(): 495069bafc93 started
1/26/2017 12:37:49 PM2017/01/26 17:37:48 pump.pumpLogs(): a27d4bfb3a41 started

@mattatcha
Copy link
Member

@pythianarceri Are you able to provide a docker-compose.yml I can use to reproduce the issue?

@garceri
Copy link

garceri commented Jan 26, 2017

We use a custom built image compiled with the logstash output module,

logspout:
  image: pythiant9shared/logspout-logstash
  environment:
    DEBUG: '1'
    ROUTE_URIS: logstash://logstash:5000
  stdin_open: true
  external_links:
  - logstash/logstash-collector:logstash
  volumes:
  - /var/run/docker.sock:/var/run/docker.sock
  tty: true
  labels:
    io.rancher.scheduler.global: 'true'
    io.rancher.container.hostname_override: container_name

@garceri
Copy link

garceri commented Jan 31, 2017

Any luck getting this fixed or at least the issue identified, is it normal for the pumps to be started/stopped all the time ?
We are running under RancherOS 1.3.3 just in case

@mattdodge
Copy link

Same thing here. I end up seeing a lot of these messages in /var/log/docker.log

time="2017-03-13T21:36:26.491710944Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value"
time="2017-03-13T21:36:26.514911587Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value"
time="2017-03-13T21:36:26.547530212Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value"

It's happened a couple times. Sometimes that "invalid character" is different but the gist of the log message is the same. We see dockerd up around 100% CPU and a constant stream of those errors.

@byronwilliams
Copy link

Hi all,

We've managed to reproduce the same issue as @pythianarceri - on Jan 26. To add some detail; running a docker logs -f {ContainerID} does not actually follow the logs and exits straight away. This would make sense with the pumpLogs starting and stopping as the HTTP connection being closed for some reason.

This is reproducible on the latest master commit (7c22df4) as well as 513b74c and 8b98313.

Our container is running a few processed inside, all of which have PYTHONUNBUFFERED=1.

Whilst I consider this to be a bug within our container not outputting to stdout and stderr correctly it also makes our docker nodes perform really badly when the containes are scheduled.

Can we add some kind of counter when the pump starts and if it restarts {n} times within a second make it backoff?.

# docker version
Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.4
 Git commit:   60ccb22
 Built:        Mon Mar 13 18:07:51 UTC 2017
 OS/Arch:      linux/arm

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.4
 Git commit:   60ccb22
 Built:        Mon Mar 13 18:07:51 UTC 2017
 OS/Arch:      linux/arm
 Experimental: false

@woutor
Copy link

woutor commented May 3, 2017

Same issue here. Reverted back to docker image v3.1 and logging and CPU usage is back to normal.

@sebgl
Copy link

sebgl commented May 9, 2017

Same issue, we also reverted to docker image v3.1.

We are running Docker version 17.04.0-ce, build 4845c56, and deploying our logspout container with the new swarm mode ('docker stack deploy'). We're not using logstash, but forward logs to a remote syslog directly.

Here is the content of our docker-compose.yml file:

version: '3'

services:
  logspout:
    image: gliderlabs/logspout
    environment:
      SYSLOG_STRUCTURED_DATA: "- $SYSLOG_CONF"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - 10888:80
    command: syslog+tls://$LOGS_ENDPOINT
    deploy:
      mode: global

@gaieges
Copy link

gaieges commented May 31, 2017

I have the same problem, using the most recent master image (ef55cfcceda7) and version v3.3-dev on both docker 17.04.0-ce and 1.13.1.

It's fairly difficult to reproduce, and seems to occur only after a period of time - I'm not sure whether its a result of complex log messages (or potentially unexpected charsets), or just the volume of messages shipped off.

During the time that it's at high cpu usage, I also have seen a number of weird messages such as what mattdodge described

@woutor @sebgl after you reverted back to version 3.1 - did you see the problem pop up again?

@woutor
Copy link

woutor commented May 31, 2017

@gaieges: Nope, with 3.1 the CPU usage remains normal.

@gdurr
Copy link

gdurr commented Jun 7, 2017

Same issue there with Docker 17.05.0-ce and logspout 3.3. We can't go back to 3.1 cause we need label filtering feature.

@avoinea
Copy link

avoinea commented Jun 7, 2017

Same issue here. Reverted back to docker image v3.1 and logging and CPU usage is back to normal.

@michaelshobbs
Copy link
Member

I'm using v3.3 in production with papertrail. Is there anybody experiencing this issue that is NOT using logstash? I'm thinking the problem might be in that module...

@gaieges
Copy link

gaieges commented Jun 7, 2017

@michaelshobbs I don't use the logstash module, I just use the tcp+syslog output.

@michaelshobbs
Copy link
Member

Interesting....anybody have any thoughts on this one?

@SqiSch
Copy link

SqiSch commented Jun 16, 2017

Same issue here, but i cannot ack the high cpu load. Only a lot loglines in /var/log/syslog and the pump.pumpLogs() when debugging enabled.

i also using the logstash adapter

.....
165+02:00" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"
Jun 16 11:55:18 xyz docker[11404]: time="2017-06-16T11:55:18.307515574+02:00" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"
Jun 16 11:55:18 xyz docker[11404]: time="2017-06-16T11:55:18.432245150+02:00" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"
....

@mrbobbytables
Copy link

We encountered the same issue today randomly on one of hosts. I unfortunately do not have anymore info than what others have reported.
Docker 17.03.1-ce
RancherOS v1.0.2
logspout configured to send to syslog+tcp

@sfrique
Copy link

sfrique commented Jun 28, 2017

Hello folks,

I was having the same problem:

Jun 28 14:07:20 xxx docker[7686]: time="2017-06-28T14:07:20.892703572Z" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"
Jun 28 14:07:20 xxx docker[7686]: time="2017-06-28T14:07:20.894972987Z" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"

and it turns out to be corrupt json-file.log, it happened when I got a disk full so json-file.log was truncated.
I had some thing like this at my json-file.log after disk space got fixed and docker running again.

{"log":"    2017/06/28 07:46:00 [INFO] consul: adding WAN ser{"log":"    2017/06/28 07:46:00 [INFO] consul: adding WAN server xxx.dc1 (Addr: 10.240.0.21:8300) (DC: dc1)\n","stream":"stdout","time":"2017-06-28T07:46:00.5738544Z"}

So it isn't a valid json and it breaks docker logs, somehow it make logspout enter a loop and make docker print those lines.

If you try to run, docker logs $container on the container with that corrupt json-file.log, you will get one line of the error.

So look for corrupted/bad formatted json-files.log

I just removed that line and everything was working normally again.

Hope that helps.

@gaieges
Copy link

gaieges commented Jun 28, 2017

On @sfrique 's note - I wonder if using journald log driver (if possible) instead of json-file would help this?

@alhafoudh
Copy link

@gaieges I tried with journald, same problem

@pauln
Copy link

pauln commented Jul 21, 2017

We're seeing the same thing - 100% CPU usage when running logspout (pumping to logstash). Enabling logspout's debug logs showed rapid repetition of log entries such as:

21/07/2017 11:57:202017/07/20 23:57:20 pump.pumpLogs(): fb6033455fd0 stopped
21/07/2017 11:57:202017/07/20 23:57:20 pump.pumpLogs(): fb6033455fd0 started

Initially, we were seeing this on most of our hosts, with two container IDs showing up on each. The containers in question were Rancher's metadata and DNS containers. We upgraded Rancher from 1.6.2 to 1.6.5 and upgraded all the system services, and logspout now seems to run happily on most of our hosts. However, we're still getting the stopped/started log spam (and corresponding CPU load) on our DB hosts - with only a single container ID flopping: mongodb.

@garceri
Copy link

garceri commented Jul 21, 2017

The scenario described by @pauln is exactly what we are experiencing, (we haven't had chance to upgrade to 1.6.5 yet), Out of curiosity @pauln , what docker version are you running on your hosts ?

@pauln
Copy link

pauln commented Jul 23, 2017

@garceri, we're currently running Docker 17.03 CE, though we're looking at upgrading to 17.06 CE in the near future.

@robcza
Copy link

robcza commented Jul 26, 2017

My thoughts on this issue:

  • it is quite serious, because it can affect the whole host and performance of other containers
  • according to reaction of other people does not affect older logspout versions -> seems could be prevented on the logspout level. What is the difference between versions?
  • malformed logs seem to be the trouble. How they emerge? Could it be prevented on the docker daemon level?
  • can I help gathering any data while this happens? Running with DEBUG: true shows just the constant stream of pump start/stop
2017/07/26 07:09:48 pump.pumpLogs(): 21c90bb7ac5c stopped
2017/07/26 07:09:48 pump.pumpLogs(): 21c90bb7ac5c started
2017/07/26 07:09:48 pump.pumpLogs(): 21c90bb7ac5c stopped
2017/07/26 07:09:48 pump.pumpLogs(): 21c90bb7ac5c started

@michaelshobbs
Copy link
Member

@robcza thanks for the additional details. can you provide a full debug log from start to stop of the logspout container that has entered into this state? looking for any other clues here.

To your question about difference between versions, we tag each release and thus you can browse the history by comparing tags. Additionally, we have a CHANGELOG.md with convenient links to compare each version.

@ko-christ
Copy link

Same for me on docker 17.03.1-ce using this image bekt/logspout-logstash:latest.

I have managed to get a full debug log since logspout container creation.
There is a constant start / stop stream for c21bde2e91e7.
I had this issue on other machines as well with various services.

# docker logs elk-logspout 
2017/07/28 10:50:57 setting allowTTY to: false
# logspout v3.3-dev-custom by gliderlabs
# adapters: raw udp tcp logstash
# options : debug:True persist:/mnt/routes
# jobs    : http[]:80 pump routes
# routes  :
#   ADAPTER	ADDRESS		CONTAINERS	SOURCES	OPTIONS
#   logstash+tcpprod00:1111				map[]
2017/07/28 10:50:57 pump.Run(): using inactivity timeout:  0s
2017/07/28 10:50:57 pump.pumpLogs(): 4dd58c94144a ignored: environ ignore
2017/07/28 10:50:57 pump.pumpLogs(): 741c807703d4 started
2017/07/28 10:50:57 pump.pumpLogs(): 03c007defa61 started
2017/07/28 10:50:57 pump.pumpLogs(): 0398b9b5542e started
2017/07/28 10:50:57 pump.pumpLogs(): 0b1b04fd17a9 started
2017/07/28 10:50:57 pump.pumpLogs(): 18b4f33a6b1d started
2017/07/28 10:50:57 pump.pumpLogs(): 18cd5d3b88cb started
2017/07/28 10:50:57 pump.pumpLogs(): dfebc3118ce3 started
2017/07/28 10:50:57 pump.pumpLogs(): e12708a548c4 started
2017/07/28 10:50:57 pump.pumpLogs(): 78423dfc9805 started
2017/07/28 10:50:57 pump.pumpLogs(): 872c498a0c2f started
2017/07/28 10:50:57 pump.pumpLogs(): 2f78048ee356 started
2017/07/28 10:50:57 pump.pumpLogs(): bcedcc0cd5d2 started
2017/07/28 10:50:57 pump.pumpLogs(): 0b8e0146d61c started
2017/07/28 10:50:57 pump.pumpLogs(): a1c7966d4ad3 started
2017/07/28 10:50:57 pump.pumpLogs(): d49788286a32 started
2017/07/28 10:50:57 pump.pumpLogs(): 41c5e05d0019 started
2017/07/28 10:50:57 pump.pumpLogs(): ce7030af1846 started
2017/07/28 10:50:57 pump.pumpLogs(): 78cff52b2cc8 started
2017/07/28 10:50:57 pump.pumpLogs(): c5b69f63a3e8 started
2017/07/28 10:50:57 pump.pumpLogs(): 52ab8eb7e542 started
2017/07/28 10:50:57 pump.pumpLogs(): f26b009798f4 started
2017/07/28 10:50:57 pump.pumpLogs(): 7b1f7d90859a started
2017/07/28 10:50:57 pump.pumpLogs(): a5b864e479b7 started
2017/07/28 10:50:57 pump.pumpLogs(): 08581c7c3829 started
2017/07/28 10:50:57 pump.pumpLogs(): 9e4497a0dcaf started
2017/07/28 10:50:57 pump.pumpLogs(): 227ca66d7f83 started
2017/07/28 10:50:57 pump.pumpLogs(): 6f6b327b7b9e started
2017/07/28 10:50:57 pump.pumpLogs(): 846ba10f0ee7 started
2017/07/28 10:50:57 pump.pumpLogs(): e672bc2d0fe2 started
2017/07/28 10:50:57 pump.pumpLogs(): f1d48ee0211f started
2017/07/28 10:50:57 pump.pumpLogs(): f8fdca2aff57 started
2017/07/28 10:50:57 pump.pumpLogs(): 5856ed01b7ff started
2017/07/28 10:50:57 pump.pumpLogs(): 57e77510b9de started
2017/07/28 10:50:57 pump.pumpLogs(): a05333be7c0f started
2017/07/28 10:50:57 pump.pumpLogs(): dcab95f7ca2f started
2017/07/28 10:50:57 pump.pumpLogs(): f28130fa353a started
2017/07/28 10:50:57 pump.pumpLogs(): 464d4d0b24bd started
2017/07/28 10:50:57 pump.pumpLogs(): 139fccfc578f started
2017/07/28 10:50:57 pump.pumpLogs(): 341f311f43c0 started
2017/07/28 10:50:57 pump.pumpLogs(): f08a73de23ab started
2017/07/28 10:50:57 pump.pumpLogs(): 7e88c606f51b started
2017/07/28 10:50:57 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:50:57 pump.pumpLogs(): 52fe6826666a started
2017/07/28 10:51:02 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:02 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:51:06 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:06 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:51:09 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:09 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:51:11 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:11 pump.pumpLogs(): c21bde2e91e7 started
2017/07/28 10:51:14 pump.pumpLogs(): c21bde2e91e7 stopped
2017/07/28 10:51:14 pump.pumpLogs(): c21bde2e91e7 started

@michaelshobbs
Copy link
Member

Thank @ko-christ for more detail! Next time this happens it would be great to see if there's anything odd about the output of the container with the "thrashing" log pump. (i.e. docker logs c21bde2e91e7 in this particular case)

Same request to others that have noticed the stop/start thrash symptom.

@shahar-davidson
Copy link

Hi all,
We've just encountered this issue recently. (logspout starts and CPU jumps to ~100%)
Here are my insights after investigating this for a couple of hours:
In our environment, this was caused due to malformed JSON output in the container logs (when using json-file logging).
What apparently happened is that the machine ran out of space (we had a clear indication for that) and log lines could not be written properly - so we got json log line that were cut in the middle.
When that happens, docker will print out the following whenever someone tries to fetch the logs (even a simple docker logs command will reproduce this)

"Error streaming logs: invalid character 'l' after object key:value pair"

Apparently, docker will print this error and stop returning the log lines, even if there were many more lines after the corrupt json line.
On our machine we had about a dozen services constantly logging and this JSON line corruption was found in most of them.
It seems that docker encounters this and emits an error( https://github.com/moby/moby/blob/8af4db6f002ac907b6ef8610b237879dfcaa5b7a/daemon/logs.go#L109 ) yet Logspout immediately starts requesting logs again and, again, docker bombards the docker service logs with the above error message (tens per second).

Possible workarounds for such an issue (I didn't verify these options yet):

  1. Setting the BACKLOG env var and restarting services
  2. Stopping the service and moving/truncating the existing log files so that they would start off clean.

So this actually seems to be a docker issue. The best way to deal with this, IMHO, is to have docker detect such cases and roll the log file and put the corrupt log file aside (while prompting a meaningful warning about this)

One thing that puzzles me is why reverting to v3.1 "fixes" this issue (I didn't try reverting back to v3.1 yet, I'm using v3.3).

@pythiancastaneda
Copy link

Hello,
I am having the same issue. Docker log is flooded with messages like this:

time="2017-08-30T22:30:31.922624624Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:31.977376555Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:32.172320949Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:32.570219033Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:32.662642497Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 
time="2017-08-30T22:30:32.860254549Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 

And the docker logs for bekt/logspout-logstash:master are:

2017/08/30 22:18:33 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:33 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:34 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:34 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:34 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:34 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:34 pump.Run() event: 42ad09d40141 network:connect
2017/08/30 22:18:34 pump.Run() event: orient-backu volume:mount
2017/08/30 22:18:34 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:35 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:35 pump.Run() event: orient-datab volume:mount
2017/08/30 22:18:35 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:35 pump.Run() event: orient-log volume:mount
2017/08/30 22:18:35 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:35 pump.Run() event: rancher-cni volume:mount
2017/08/30 22:18:35 pump.Run() event: efe51a4c5ed0 volume:mount
2017/08/30 22:18:36 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:36 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:36 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:36 pump.Run() event: 58a80e107fc1 start
2017/08/30 22:18:36 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:36 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:36 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:36 pump.pumpLogs(): 58a80e107fc1 started
2017/08/30 22:18:36 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:36 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:36 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:36 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:37 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:37 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:37 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:37 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:37 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:37 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:37 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:37 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:38 pump.pumpLogs(): be68d546188d stopped
2017/08/30 22:18:38 pump.pumpLogs(): be68d546188d started
2017/08/30 22:18:38 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:38 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:38 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:38 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:38 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:38 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:39 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:39 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:39 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:39 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:39 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:39 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:39 pump.pumpLogs(): f6fddb18f835 started
2017/08/30 22:18:39 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:40 pump.pumpLogs(): 9ab5e8cc0c36 stopped
2017/08/30 22:18:40 pump.pumpLogs(): 9972da45bf76 stopped
2017/08/30 22:18:40 pump.pumpLogs(): 9972da45bf76 started
2017/08/30 22:18:40 pump.pumpLogs(): 9ab5e8cc0c36 started
2017/08/30 22:18:40 pump.pumpLogs(): f6fddb18f835 stopped
2017/08/30 22:18:40 pump.pumpLogs(): f6fddb18f835 started`

I have trimmed some of the docker-json logs and restarted the containers and it is not helping, once the logstput container starts it makes dockerd to eat almost all the CPU.

Thanks in advance!

@michaelshobbs
Copy link
Member

@mattatcha thoughts on this given new info?

@sfrique
Copy link

sfrique commented Aug 31, 2017

@pythiancastaneda To figureout what container is giving you the trouble:

# tail -f /var/log/syslog | grep docker
Second term:
# docker logs "for each container you have"

That way you will see at logs:

time="2017-08-30T22:30:32.860254549Z" level=error msg="Error streaming logs: invalid character '\\x00' looking for beginning of value" 

And then you can remove the log or the wrong line.

Hope that helps.

I recently had the same problem and i did the steps above and fixed the issue.

@stevecalvert stevecalvert mentioned this issue Sep 1, 2017
@leriel
Copy link

leriel commented Sep 15, 2017

Thanks for the update @stevecalvert ! In my case issue was the same corrupt container logs. For me, 3.2.2 also was logging an additional error. Posting that just in case

logspout_1  | 2017/09/15 17:42:43 pump.pumpLogs(): 0e1c957edf2d started
logspout_1  | 2017/09/15 17:42:43 pump.pumpLogs(): 0e1c957edf2d stopped with error: Unrecognized input header: 3
logspout_1  | 2017/09/15 17:42:43 pump.pumpLogs(): 0e1c957edf2d started
logspout_1  | 2017/09/15 17:42:43 pump.pumpLogs(): 0e1c957edf2d stopped with error: Unrecognized input header: 3

Only way for me to fix that issue, like for others, was to remove and recreate container.

@konvergence
Copy link

konvergence commented Sep 20, 2017

@leriel
I get the same issue with logspout 3.2.2 and gelf plugins.

But i'm sure to understand the workarround ...

You recreate which container ? Logspout or containers with corrupted json-file logs ?

Is there any way to correct logspout code ?

@konvergence
Copy link

Don't know if thee is any relationship ...
but I get this message if I tried to tail on a *-json.log, given by logspout in debug mode

tail: inotify cannot be used, reverting to polling: Too many open files
seem to by a default limit on ubuntu ? that could explain a some corrupted json-file logs ?
see https://nefaria.com/2014/08/tail-inotify-resources-exhausted/

@konvergence
Copy link

Another info ...
when I purge all json-file logs given "stopped" by logspout, and restart it, then cpu usage on dockerd is flat

here my code:

for containerid in $(docker logs $(docker ps -a -q --filter "label=io.rancher.stack_service.name=graylog2/logspout")  2>&1| grep stopped | cut -d " " -f 4 | sort -u); do
     ls -l /var/lib/docker/containers/${containerid}*/${containerid}*-json.log
   > /var/lib/docker/containers/${containerid}*/${containerid}*-json.log
done 

@konvergence
Copy link

One more information,
I use docker engine 17.03.2-ce. Last week, we had a power shutdown on the nodes.
In my case, I think that it was the cause, why we get corrupted json-file logs (Error streaming logs: invalid character 'l' after object key:value pair)
Do you know if these issue can be solved with a new docker-ce release ?

@konvergence
Copy link

Here some news ...
But don't know if there is a relationship

I get all containerid given stopped by logspout:
docker logs $(docker ps -a -q --filter "label=io.rancher.stack_service.name=graylog/logspout") 2>&1| grep stopped | cut -d " " -f 4 | sort -u

c062de60016b
c4d6b098968c
f8f18bba6203

I effectivelly found '\x00' of each file that is corrupted, and files are big


for containerid in $(docker ps -a -q); do
     grep -Pa '\x00' /var/lib/docker/containers/${containerid}*/${containerid}*-json.log
     if [ $? -eq 0 ]; then
          echo ${containerid}
          ls -lh /var/lib/docker/containers/${containerid}*/${containerid}*-json.log
    fi
done


{"log":"Creating volume folders...\n","stream":"stdout","time":"2017-09-16T15:50:34.379859731Z"}
c4d6b098968c
-rw-r----- 1 root root 564M Sep 21 12:11 /var/lib/docker/containers/c4d6b098968c475a82ed7ff34cc4e50344f097155f10d0d9d07528437b4417e7/c4d6b098968c475a82ed7ff34cc4e50344f097155f10d0d9d07528437b4417e7-json.log
{"log":"Creating volume folders...\n","stream":"stdout","time":"2017-09-16T15:50:33.076131673Z"}
c062de60016b
-rw-r----- 1 root root 478M Sep 21 12:11 /var/lib/docker/containers/c062de60016bd13faeebceb0301c121b35a0dfbc620048584a3d06c3d0449761/c062de60016bd13faeebceb0301c121b35a0dfbc620048584a3d06c3d0449761-json.log
{"log":"Creating volume folders...\n","stream":"stdout","time":"2017-09-16T15:50:47.996912867Z"}
f8f18bba6203
-rw-r----- 1 root root 806M Sep 21 12:11 /var/lib/docker/containers/f8f18bba62032e188dd75f76b9cc061936e804995289d7406b89d43e02d7cc36/f8f18bba62032e188dd75f76b9cc061936e804995289d7406b89d43e02d7cc36-json.log

@pierreozoux
Copy link

pierreozoux commented Oct 19, 2017

We experienced the same issue yesterday.
We don't use logspout, but it is the closest issue we found, with this kind of error logs:

Jun 28 14:07:20 xxx docker[7686]: time="2017-06-28T14:07:20.892703572Z" level=error msg="Error streaming logs: invalid character 'l' after object key:value pair"

And the invalid char being anything in our case.

We are still investigating, but would like to see if we can find the issue together.
It looks like with our current information that one of our containers is outputting some kind of json under specific conditions, and that dockerd doesn't like it at all. Didn't find any moby related issue though :/

We'll let you know of our findings.

Our setup is different though, docker 1.12 + k8s 1.7.5 with scalyr agent ingesting logs from dockerd.

@mjablecnik
Copy link

Hello I have:

$ docker version
Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:09 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:40:48 2017
 OS/Arch:      linux/amd64
 Experimental: false

System: Debian 9.1

logspout setup in Rancher:
screenshot

and in syslog dockerd errors:

Oct 24 10:03:39 web4-vultr dockerd[31377]: time="2017-10-24T10:03:39.819294736+01:00" level=error msg="Error streaming logs: invalid character '\x00' looking for beginning of value" container=db62afc089ec method="(*Daemon).ContainerLogs" module=daemon
Oct 24 10:03:41 web4-vultr dockerd[31377]: time="2017-10-24T10:03:41.950449384+01:00" level=error msg="Error streaming logs: invalid character '\x00' looking for beginning of value" container=db62afc089ec method="(*Daemon).ContainerLogs" module=daemon
Oct 24 10:03:42 web4-vultr dockerd[31377]: time="2017-10-24T10:03:42.717452402+01:00" level=error msg="Error streaming logs: invalid character '\x00' looking for beginning of value" container=6488521fdc8f method="(*Daemon).ContainerLogs" module=daemon

and CPU usage is very high

I dont know why, but downgrade logspout to version v3.1 fix it..

@rhuddleston
Copy link

rhuddleston commented Oct 30, 2017

We're on AWS and see this issue regularly. We have ~100 EC2 instances and when AWS suddenly retires or shuts down an instance the server restarts but some of the docker json log files end up corrupt.

I can see where the corruption is and what line by doing the following:

for FILE in /var/lib/docker/containers/*/*-json.log; do cat $FILE | jq . > /dev/null; done

I used to just shutdown the container that corresponded to that corrupt logs(s), manually remove the corrupted (half written json line), then restart the container and all was well. Though with newer version of docker that no-longer fixes the issues I have to remove all the containers and re-deploy them to fix it now. I think this is really a Docker issue. If Docker detects a half written file I think it needs to handle this correctly. Basically 100% of the time AWS restarts a node unexpectedly this happens for us. Typically it's faster to completely replace the node than spend a bunch of time trying to fix this by hand.

@ghost
Copy link

ghost commented Dec 12, 2017

We have the same issue but found the offending log files using grep below.

grep -P '\x00' /var/lib/docker/containers/**/*json.log

After manually removing the null "\x00" characters from the log the issue went away.
It seems to be a docker issue docker/for-linux#140 related to unexpected host/daemon restarts.

@ppiccolo
Copy link

ppiccolo commented Feb 14, 2018

me too, same issue ... any suggestion ?

@shahar-davidson
Copy link

We resolved to use #321 as a workaround for the time being. (note that BACKLOG should not be set, or be set to true).

@mahnunchik
Copy link

I've faced with 100% CPU usage and error:

Mar 23 16:48:52 burger dockerd[569]: time="2018-03-23T16:48:52.971499270+02:00" level=error msg="Error streaming logs: EOF" container=b1dc4286ba72 method="(*Daemon).ContainerLogs" module=daemon

on cadvisor container.

My config:

version: '2.3'

services:

  logspout:
    image: gliderlabs/logspout:v3.2.4
    container_name: logspout
    command: syslog+tcp://logs-01.loggly.com:514
    restart: always
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      TAIL: 10

avoinea added a commit to eea/eea.rancher.catalog that referenced this issue Jun 14, 2018
@arunesh90
Copy link

Any update on this? Just had it happen on a server which became pretty unresponsive because of it

@kamaev
Copy link

kamaev commented Aug 5, 2018

+upvoting rollback to 3.1

I've just force pushed ca675f3 and manually added:

- glide.lock
- glide.yaml
- Dockerfile
- build.sh
- and also /adapters/multiline/multiline.go to use multiline

from latest master to build an image
new container works good

@biguphpc
Copy link

biguphpc commented Jan 22, 2021

I just wanted to confirm about this issue with the latest CE version of Docker (19.03.12) and logspout (syslog+tls). We've had production servers crashing because of this. Dockerd was using 1600% CPU and then became unresponsive on a 32 CPU server. Noticed a very high number of goroutines showing with docker info in debug mode. Stopping the logspout container and waiting a few hours resolved the issue and the Dockerd CPU was back to normal (along with the number of goroutines). Tried a workarounds like TAIL=0 without success. Reverting to v3.1 solves the issue for now.

We had about 30 containers running on that machine, logging was heavy but not excessive. It's easily reproducible with the latest logspout version: starting logspout with a lot of containers logging show dockerd using high CPU almost immediately and increasing over time.

Maybe v3.1 is querying the docker daemon differently than newer versions and flooding it with requests leading to a high number of goroutines and high CPU?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests