Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Journalbeat Stops Reading Journals on Journald Rotation #9533

Closed
joelika opened this issue Dec 13, 2018 · 44 comments
Closed

Journalbeat Stops Reading Journals on Journald Rotation #9533

joelika opened this issue Dec 13, 2018 · 44 comments

Comments

@joelika
Copy link

joelika commented Dec 13, 2018

I've been evaluating journalbeat to send data to logstash, and journalbeat stops sending data daily, which I am assuming is tied to when journald is rotating journals. A quick restart of journalbeat gets data sending again.

  • Version: 6.5.2
  • Operating System: RHEL 7.6

My journald storage mode is set to auto, and I've tried it both with setting up the persistent storage, and without. I've also tried it where I specified the journal location in journalbeat and without.

@kfalconer
Copy link

Also seeing same issue with

  • Version: 6.5.3
  • Operating System: Ubuntu 16.04

@kristinn
Copy link

I'm able to reproduce using master with 952d922 as the newest commit.

This has been causing us lots of issues since we are processing around 10 million records a day. The "workaround" is to restart journalbeat every 5 minutes.

@cachedout
Copy link
Contributor

Possibly related to this: coreos/go-systemd#233

If that's the case, they were waiting on the 1.11 release of golang earlier this year and they may be able to merge their pending fix. I bumped the upstream maintainers just to see.

kvch added a commit that referenced this issue Jan 28, 2019
Previously sd_journal_wait was not used. From now on all changes to journals are detected.

I also added custom seccomp policy to Journalbeat.

Closes #9533
kvch added a commit to kvch/beats that referenced this issue Feb 1, 2019
Previously sd_journal_wait was not used. From now on all changes to journals are detected.

I also added custom seccomp policy to Journalbeat.

Closes elastic#9533
(cherry picked from commit cead4b6)
@SpComb
Copy link

SpComb commented Feb 4, 2019

Currently testing this as a workaround, installed in /etc/cron.hourly/journalbeat-rotate:

#!/bin/sh

set -ue

SYSTEM_JOURNAL_FILE_ID=$(journalctl --file /var/log/journal/*/system.journal --header | awk -F ': ' '/File ID/ { print $2 }')

if [ ! -e /var/lib/journalbeat/system-journal.id ] || [ "$SYSTEM_JOURNAL_FILE_ID" != "$(cat /var/lib/journalbeat/system-journal.id)" ]; then
  systemctl try-restart journalbeat && echo "$SYSTEM_JOURNAL_FILE_ID" > /var/lib/journalbeat/system-journal.id
fi

kvch added a commit that referenced this issue Feb 4, 2019
…entries change (#10485)

Cherry-pick of PR #9994 to 6.x branch. Original message:

Previously sd_journal_wait was not used. From now on all changes to journals are detected.

I also added custom seccomp policy to Journalbeat.

Closes #9533
@kvch
Copy link
Contributor

kvch commented Feb 12, 2019

After testing journalbeat on several systemd versions, it turns out that it is not a journalbeat bug. The beat was affected by the same issue as journalctl as it uses the same C API: systemd/systemd#474
This issue is fixed and released in v233. So currently journalbeat requires at least systemd v233.

@JustinHead
Copy link

JustinHead commented Feb 12, 2019

We are experiencing this or some other issue that causes journalbeat to stop sending logs on Ubuntu 18.04 which uses systemd v237. Like the other posts, restarting journalbeat causes it to move along again.

@mickymiek
Copy link

Same issue here with systemd v238. Restarting journalbeat "solves" the issue.

@kvch
Copy link
Contributor

kvch commented Feb 14, 2019

@JustinHead @mickymiek Do you know what triggers the stopping of shipping logs (e.g. rotate, vacuum)?
I would like to track down whether the issue you are facing is due to a bug in the C API or in Journalbeat.

@jbguerraz
Copy link

Hello @kvch ,

@mickymiek and I are now investigating the case. We believe it could be related to rotate but have no evidence yet.

During investigations, and probably after deleting journal files, then running journalctl --vacuum-time=1seconds, we started to see tons of [input] input/input.go:167 Error while reading event: error while waiting for event: operation not permitted in logs due to inotify_init1(O_NONBLOCK|O_CLOEXEC) = -1 EPERM (Operation not permitted)

@kvch
Copy link
Contributor

kvch commented Feb 14, 2019

This might be due to incorrect seccomp policy. Could you please try to turn seccomp policy to test this theory?

seccomp.enabled: false

@jbguerraz
Copy link

jbguerraz commented Feb 14, 2019

Yes, this is related to seccomp, and disabling it workaround that permission issue.
BTW, we're running CoreOS, same case than #10605 so probably covered by your #10593
We're now monitoring it, to make sure the "stop reading journals" is really related tot hat seccomp issue.

philandstuff added a commit to alphagov/verify-infrastructure that referenced this issue Feb 14, 2019
This removes the journalbeat restart cron job for prometheus only.

My goal is to let it run for a bit and observe if we're actually
hitting this issue: elastic/beats#9533

ie I plan to see if we stop sending logs, and see if it coincides with
journal rotation

I picked prometheus for this because a) it has a different cloudinit
file, and b) it's less critical than everything else
@philandstuff
Copy link

I'm experiencing this issue. I'm running ubuntu 18.04.2 (systemd v237).

Could the issue be that the journalbeat binaries are built on a debian stretch image, and stretch has libsystemd v232? I wonder if it might be possible to use libsystemd from backports (which is v239) to get the fixed version?

@suhailpatel
Copy link

suhailpatel commented Feb 17, 2019

👋, we faced a similar issue (operation not permitted) on CoreOS Systemd (we reported #10605) and found the result was because the seccomp policy was missing ppoll (confirmed using strace).

We've been running with #10593 which fixes the issue 🙌

@jbguerraz
Copy link

@kvch confirmed, no more issue since we disabled seccomp. Gonna go with the master branch since #10593 got merged. Thx!

@tovern
Copy link

tovern commented Feb 22, 2019

I'm having the same trouble with Journalbeat 6.6.1 on Ubuntu 16.04. Setting "seccomp.enabled: false" has made no difference for me unfortunately. The debug logs show nothing useful, but the journal timestamps indicate that this is happening whenever the journal is rotated.

@kvch
Copy link
Contributor

kvch commented Feb 22, 2019

What do you mean by same issue? In this issue, two problem is reported.
What is your systemd version? If it is v232 or below, journalbeat fails to read entries after rotation, because of a bug in systemd.
If you are seeing operation not permitted errors, that is due to a bug in journalbeat, which is going to be fixed in the next release. As a workaround you can set seccomp.enabled to false.

@kvch
Copy link
Contributor

kvch commented Mar 6, 2019

I have opened an issue to track the two problems in different issues: #11110

@hamelg
Copy link

hamelg commented Mar 6, 2019

Unfortunately, not. The systemd API call we are using right now has the bug. But in a future release, we would like to refactor the code to remove this dependency.

Sorry, but my question was about the bug in API systemd <=v232 using by journalbeat.
Centos/RHEL 7.6 runs systemd v219. Is there a chance to see a fix in a near future ?

@kvch
Copy link
Contributor

kvch commented Mar 6, 2019

This is the issue to track that problem. It is definitely something we want to fix in an upcoming release.

@hamelg
Copy link

hamelg commented Mar 6, 2019

So, its status shouldn't be closed.

@kvch kvch reopened this Mar 6, 2019
@kvch
Copy link
Contributor

kvch commented Mar 6, 2019

Reopened.

@liqw33d
Copy link

liqw33d commented Mar 18, 2019

It looks like that: coreos/go-systemd#279 is fixing this problem.
I re-compiled journald with this patch applied and now there are no problems when systemd journal is rotated. Test is running ~2days and all messages were processed.
No other regressions were found during this testing.
Tested system is Ubuntu 16.04 (xenial) and Go is 1.11.

@kvch
Copy link
Contributor

kvch commented Mar 19, 2019

How did you apply the patch? We use github.com/coreos/go-systemd/sdjournal not github.com/coreos/go-systemd/journal. Did you copy the latter package also?
In the meantime I am looking into it.

@kvch
Copy link
Contributor

kvch commented Mar 19, 2019

Also, what's your systemd version?

@kvaps
Copy link

kvaps commented Mar 19, 2019

@kvch I was just replaced whole go-systemd directory with newer version.
You can try one of my docker image:

kvaps/journalbeat:7.0.0-fix9533
kvaps/journalbeat:8.0.0-fix9533

@kvaps
Copy link

kvaps commented Mar 19, 2019

Agree. This fix is working fine for me!

what's your systemd version?

- systemd 229 (xenial)
- systemd 232 (stretch)
- systemd 237 (bionic)
- systemd 241 (stretch)
go version go1.11.5 linux/amd64
journalbeat version 8.0.0 (amd64), libbeat 8.0.0 [258c1c8419ec945967f8a9e8e9d92e166f0bb3f7 built 2019-03-19 01:41:30 +0000 UTC]

@liqw33d
Copy link

liqw33d commented Mar 20, 2019

Also, what's your systemd version?

systemd version is:
229-4ubuntu21.17 (xenial)
I applied mentioned diff to go-systemd (master) and compiled journalbeat (master) with it's dependencies.
I don't have exact steps to reproduce it now.

  "system_info": {
    "build": {
      "commit": "005bb608b010fa777db527738ad27b5f40086986",
      "libbeat": "8.0.0",
      "time": "2019-03-16T22:51:04.000Z",
      "version": "8.0.0"
    }
  }

@vquie
Copy link

vquie commented Mar 20, 2019

When will this be available through the official elastic repo?

@kvch
Copy link
Contributor

kvch commented Mar 21, 2019

@liqw33d @kvaps Thank you for verifying the fix!

@vquiering Once the fix is released in a new version of github.com/coreos/go-systemd, I am updating the vendored folder.

@hamelg
Copy link

hamelg commented Mar 21, 2019

I re-compiled journald with this patch applied and now there are no problems when systemd journal is rotated. Test is running ~2days and all messages were processed.

Does it mean the fix consists to recompile journald ? The patch you mentioned is only related to github.com/coreos/go-systemd/journal (for writing to systemd's logging service, journald), journalbeat uses github.com/coreos/go-systemd/sdjournal (for reading from journald by wrapping its C API). Please, can you clarify ?

@kvaps
Copy link

kvaps commented Mar 21, 2019

@hamelg yes it is, you just need to recompile journalbeat with updated go-systemd

@stanxing
Copy link

stanxing commented Mar 22, 2019

yes it is, you just need to recompile journalbeat with updated go-systemd

@kvaps I feel a little confused. Why it can work when updating the package that has not been depended by journalbeat?

@kvch
Copy link
Contributor

kvch commented Mar 22, 2019

This is what I am investigating. My hunch is that the init function of that package connects to the FD of systemd journal, thus it lets Journalbeat read when and what changes are being made to the journal it is reading from.

@kvch
Copy link
Contributor

kvch commented Mar 22, 2019

I did more investigation on the matter. It turns out that the problem was fixed in Journalbeat not in its dependency github.com/coreos/go-systemd/sdjournal. The upcoming 6.7 and 7.0.0-rc release works with systemd v232.

Do you mind verifying my findings once the release is available?

@stanxing
Copy link

The upcoming 6.7 and 7.0.0-rc release works with systemd v232.

Take the liberty to ask, when is the 6.7 and 7.0.0-rc release date ?

@praseodym
Copy link
Contributor

6.7 has just been released.

@stanxing
Copy link

stanxing commented Mar 27, 2019

the release of 6.7 works fine for me on ubuntu 16.04 with systemd version 229. But I found some erros as follows:

2019-03-27T17:52:36.001+0800    ERROR   [input] input/input.go:167      Error while reading event: failed to get realtime timestamp: 99  {"id": "a6a84c06-f11d-4f37-9d8f-3251d731d235"}

Does anyone have ideas ? My feeling is that it happened when rotating.

@hamelg
Copy link

hamelg commented Mar 27, 2019

I confirm 6.7 has fixed the issue on centos 7.6 with systemd version 219.

@hamelg
Copy link

hamelg commented Mar 27, 2019

But I found some erros as follows:

2019-03-27T17:52:36.001+0800    ERROR   [input] input/input.go:167      Error while reading event: failed to get realtime timestamp: 99  {"id": "a6a84c06-f11d-4f37-9d8f-3251d731d235"}

The same here, I see this error each time the journal is rotating.

@kvch
Copy link
Contributor

kvch commented Mar 27, 2019

@hamelg Do you mind opening a separate issue for that? Thanks for the confirmation.

@kvch
Copy link
Contributor

kvch commented Mar 27, 2019

Fixed in 6.7 and on master.

@kvch kvch closed this as completed Mar 27, 2019
NicholasPCole added a commit to NicholasPCole/droplet-connectivity-testing that referenced this issue Mar 29, 2019
This should no longer be necessary with the 6.7.0 release, per
elastic/beats#9533.
DStape pushed a commit to DStape/beats that referenced this issue Aug 20, 2019
Previously sd_journal_wait was not used. From now on all changes to journals are detected.

I also added custom seccomp policy to Journalbeat.

Closes elastic#9533
DStape pushed a commit to DStape/beats that referenced this issue Aug 20, 2019
…ournal entries change (elastic#10485)

Cherry-pick of PR elastic#9994 to 6.x branch. Original message:

Previously sd_journal_wait was not used. From now on all changes to journals are detected.

I also added custom seccomp policy to Journalbeat.

Closes elastic#9533
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests