Add support for network volumes in Filebeat #5876

jsoref · 2017-12-14T03:45:07Z

Version: 5.6.5
Operating System: Ubuntu 16.04.3 LTS (xenial)
Steps to Reproduce:

Use zfs (compression=lz4) w/ lots of file systems and occasionally create new ones
Have filebeat enabled
1. Filebeat is configured to look at /path/to/logs/logfile-*.txt
2. w/ an include_lines: designed to capture a tiny subset of the log file content
3. w/ the goal that it would be able to pick up any old files it may have missed
Have a bunch of services which
1. generate log files named /path/to/logs/logfile-STUFF-YYYY-MM-DD.txt
2. themselves manage the dates
3. have log files in places where users expect them to be
I'm not
1. particularly inclined to move them just to make filebeat happy
2. interested in setting up arbitrary logrotate just to make filebeat happy
add additional volumes to the zpool (using zfs create pool/...)
reboot
Filebeat appears to reprocess the files
1. ends up getting fairly stuck on the largest ones
  - each are >4GB (uncompressed, they're text log files and compress fairly nicely)
  - none of the other files in the entire directory are
2. stuck is defined as chewing 100% of a CPU core for incredibly large periods of time
  - so much so that my host suggests I upgrade the processing power of the machine.

Here's some python I used to look at the registry (because I was curious):

>>> import json
>>> def loadj(n):
...  f=open(n)
...  j=json.load(f)
...  f.close()
...  return j
...
>>> registry=loadj("/var/lib/filebeat/registry")
>>> files83={a['source']:a for a in registry if a['FileStateOS']['device']==83}
>>> files91={a['source']:a for a in registry if a['FileStateOS']['device']==91}
>>> files91_not83 = [a for a in files91 if not a in files83]
>>> files_83_less_91=[[files83[a],files91[a]] for a in files83 if a in files91 and files83[a]['offset'] < files91[a]['offset']]
>>> files_91_less_83=[[files83[a],files91[a]] for a in files83 if a in files91 and files91[a]['offset'] < files83[a]['offset']]
>>> map(len,[files91, files83, files91_not83, files_83_less_91, files_91_less_83])
[2719, 2639, 80, 67, 13]
>>> offset_check=[[a[0]['source'][-14:],a[0]['offset'],a[1]['offset']] for a in files_91_less_83]
>>> offset_check
[[u'2017-11-13.txt', 16381394402, 5175661735], [u'2017-11-07.txt', 8495794954, 5221706109], [u'2017-11-06.txt', 7292885516, 5286674534], [u'2017-12-10.txt', 8619599723, 5246991550], [u'2017-12-11.txt', 10040266323, 5221623440], [u'2017-11-15.txt', 19147829467, 5194955114], [u'2017-12-04.txt', 34748715454, 5155669419], [u'2017-12-09.txt', 7199195022, 5332507314], [u'2017-11-10.txt', 12397086731, 5205001580], [u'2017-11-09.txt', 10995929363, 5209715654], [u'2017-11-11.txt', 13719864719, 5184755763], [u'2017-12-08.txt', 5778746908, 5321055999], [u'2017-11-08.txt', 9710451263, 5220711257]]

Essentially, at some point in time my log files were on a volume which was assigned deviceid 83. At some point the system rebooted, and now the deviceid for the same volume is 91. After that point, the system ran for a bit over a day and 80 new files (~75 from the second day) appeared.

The files I care most about are the ones in offset_check -- they're all >4GB, and filebeat had them all open. I believe it was slowly making progress through them, but it was doing a really bad job of it.

My understanding is that device ids are not guaranteed past reboot, and any process trying to use them past that point is "doing it wrong". I believe that filebeat is in this category.

Expected result (conceptually):

if [ /proc/1 -nt /var/lib/filebeat/registry ]; then
 echo "Do not rely on Device ID or INode as they are no longer meaningful"
fi

I don't have any particular opinion about what one should do if a volume is unmounted and remounted while filebeat is running. Offhand, I think that if the device id changes you probably can't rely on the inode either, but personally, I'd expect the process to consider the datestamp of the file -- if the file hasn't changed since the last time it was seen, then it should normally be treated as the same file. If the file has a different inode and the same deviceid as the last time something looked, then it's reasonable to think it may have changed.

The text was updated successfully, but these errors were encountered:

exekias · 2017-12-14T11:01:19Z

I guess this is happening because ZFS sets a new device ID when mounting the partition?

jsoref · 2017-12-14T12:51:56Z

Each volume gets a device ID and they're first come first served, device IDs being assigned dynamically by the running system without any persistence.

The implementation here is linear (1,2,3,4,...), But there's really no particular requirement for that, and absolutely nothing requires persistence across boots.

jsoref · 2017-12-17T16:56:14Z

I should probably also note that you're only recording the minor id, but even on a running system that doesn't define a device, you need major+minor to uniquely identify a device on a running system.

ruflin · 2017-12-19T02:57:25Z

I think this is an issue specific to ZFS and not a general "doing it wrong" ;-) So far for most file system it worked really well and the combination of device + inode is the unique identifer of a file on the other file systems (windows has 3 identifiers). The issue you are describing reminds me a lot of some "interesting" behaviour on shared file systems and is the reason we recommend to install filebeat on the edge node. But for ZFS I think this can't be applied.

The main question is which methods we have to identify a file over it's life time:

Based on path
Based on unique identifiers
Based on fingerprint

All have their pros and cons. Currently we do option 2 with the most common identifiers. We also discussed option 1 and 3 in the past. 1 would work really well in cases where files are not moved / rotated / renamed and the file name is the unique identifier. Option 3 we discussed for cases where unique identifiers from 2 do not stay the same like in shared volumes. We would identify a file based on hash of a subset of the content of the file. One additional option you brought up above is to take 2 but only enforce a subset of identifiers.

I seems for your case only option 3 would work as the path of the same file changes over time?

ruflin · 2017-12-19T02:58:45Z

@exekias I removed the bug label and changed it to enhancement as the above behaviour is from my point of view as expected and by design. I was not aware that ZFS behaves like the above, so we should probably add a note to our docs about this.

jsoref · 2017-12-19T05:52:40Z

This isn't limited to zfs.
If you use Google Compute Engine (or AWS, or Azure, or Linode, or...) and dynamically add/remove physical storage, the nodes you'll get will be reassigned minor ids. It should also happen if you used classical hot plugging.

I just tested w/ GCE, I had a /dev/sda1 (/), I attached a disk (which provided /dev/sdb1) and mounted it as /media, then I unmounted it, created a new disk, attached it, partitioned+formated it, mounted it as /media (it became /dev/sdb1 -- and had the same major/minor as the previous /dev/sdb1), and then I reattached the previous disk (it became /dev/sdc1) which I mounted as /mnt -- this time it had a new minor number (because the newer disk was sitting in the minor slot ...).

Really, anything that involves any amount of dynamism is problematic.

[It would probably apply to nfs, samba, afs, fuse, but you'd wave that off...]

ruflin · 2017-12-20T03:35:48Z

Thanks a lot for putting all this effort in and test the different system. It's definitively a problem we need to start tackling more active with filebeat. I think the most important step on our side is to make it pluggable how files are compared so any of the 3 options mentioned above could be used and more could be added. First steps have been made but we are not there yet.

You are definitively right with the other network systems you mentioned. We are aware of this limitation, see https://www.elastic.co/guide/en/beats/filebeat/current/faq.html#filebeat-network-volumes Let me quickly explain why I replaced the bug label with an enhancement label in case that concerns you. If we treat it as a bug it would mean we should / must backport fixes to older branches because it is broken. But from a support perspective we know it doesn't work and we don't recommend it. That doesn't mean we should not add this feature.

@jsoref I suggest to rename the title to something like "Add support for network volumes in Filebeat" or "Add additional file identification mechanisms to Filebeat" to be more explicity on what we need to add.
For your specific ZFS use case, would it be enough to just ignore the device id or can the inode also change?

jsoref · 2017-12-20T04:31:24Z

I don't have a particularly good sense of the story for inodes.

https://lists.freebsd.org/pipermail/freebsd-hackers/2010-February/030746.html seems to indicate that inodes are figments created as requested.

https://github.com/zfsonlinux/zfs is the project that handles the Linux kernel implementation.
#zfsonlinux on freenode.

Afaict, these things are probably moderately reliable until a computer reboots or a device detaches, and entirely unreliable after either of those events occurs. Again, the algorithm I suggested of only relying on these pieces of information up to the point where the system has rebooted should work for most cases.

I don't have any advice for how to deal w/ the case where physical devices come/go while a system is running. And, fwiw that's probably going to happen much more often. (This morning we hot swapped a disk on a physical server because it failed. This afternoon I started talking about plans for various migrations between systems, some models could involve ejecting disks and moving them to other systems.)

ruflin · 2017-12-27T04:04:01Z

@jsoref Thanks for updating the title. Swapping the physical disk is also a very interesting use case. I'm kind of surprised we didn't get hit by this issue yet (or we just didn't hear of it).

@ph FYI as you are current doing quite a bit of cleanup / work on filebeat.

ghost · 2018-05-29T13:28:27Z

@ruflin Any update on this?

We face the same issue with a nfs volumes mounted in a Kubernetes pod.

ph · 2018-05-29T14:24:45Z

@bquartier It still something we are planning to do to improve our story on shared FS, we haven't started working on it yet.

jsoref · 2018-05-29T16:14:23Z

@bquartier: thanks for the note (we're considering playing w/ kubernetes, so you've saved me a check...)

mcwm6 · 2019-05-20T00:30:23Z

@ruflin Any updates on this enhancement?

botelastic · 2020-07-08T19:18:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jsoref · 2020-07-08T20:02:18Z

that's very unhelpful.

elasticmachine · 2021-05-10T08:27:36Z

Pinging @elastic/agent (Team:Agent)

botelastic · 2022-05-10T08:28:35Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

jsoref · 2022-05-10T11:07:52Z

I see no evidence that anything has actually improved and would rather a human point to actual progress.
🗿

botelastic · 2023-05-10T11:29:44Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

jsoref · 2023-05-10T11:33:29Z

I see no evidence that anything has actually improved and would rather a human point to actual progress.
🗿

botelastic · 2024-05-09T12:01:31Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

jsoref · 2024-05-09T12:06:57Z

I see no evidence that anything has actually improved and would rather a human point to actual progress.

exekias added bug Filebeat Filebeat labels Dec 14, 2017

ruflin added enhancement and removed bug labels Dec 19, 2017

jsoref changed the title ~~/var/lib/filebeat/registry relies on deviceid inappropriately~~ Add support for network volumes in Filebeat Dec 20, 2017

breml mentioned this issue Sep 6, 2019

Identity tracking of files in Filebeat inputs #13492

Closed

1 task

botelastic bot added Stalled needs_team Indicates that the issue/PR needs a Team:* label labels Jul 8, 2020

botelastic bot removed the Stalled label Jul 8, 2020

jsoriano added the Team:Elastic-Agent Label for the Agent team label May 10, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 10, 2021

botelastic bot added the Stalled label May 10, 2022

botelastic bot removed the Stalled label May 10, 2022

mmguero mentioned this issue Apr 25, 2023

kubernetes - check out filebeat on network volumes idaholab/Malcolm#188

Closed

botelastic bot added the Stalled label May 10, 2023

botelastic bot removed the Stalled label May 10, 2023

botelastic bot added the Stalled label May 9, 2024

botelastic bot removed the Stalled label May 9, 2024

mmguero mentioned this issue Nov 5, 2024

kubernetes - check out filebeat on network volumes cisagov/Malcolm#472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for network volumes in Filebeat #5876

Add support for network volumes in Filebeat #5876

jsoref commented Dec 14, 2017

exekias commented Dec 14, 2017

jsoref commented Dec 14, 2017

jsoref commented Dec 17, 2017

ruflin commented Dec 19, 2017

ruflin commented Dec 19, 2017 •

edited

Loading

jsoref commented Dec 19, 2017

ruflin commented Dec 20, 2017

jsoref commented Dec 20, 2017

ruflin commented Dec 27, 2017

ghost commented May 29, 2018

ph commented May 29, 2018

jsoref commented May 29, 2018

mcwm6 commented May 20, 2019 •

edited

Loading

botelastic bot commented Jul 8, 2020

jsoref commented Jul 8, 2020

elasticmachine commented May 10, 2021

botelastic bot commented May 10, 2022

jsoref commented May 10, 2022

botelastic bot commented May 10, 2023

jsoref commented May 10, 2023

botelastic bot commented May 9, 2024

jsoref commented May 9, 2024

Add support for network volumes in Filebeat #5876

Add support for network volumes in Filebeat #5876

Comments

jsoref commented Dec 14, 2017

exekias commented Dec 14, 2017

jsoref commented Dec 14, 2017

jsoref commented Dec 17, 2017

ruflin commented Dec 19, 2017

ruflin commented Dec 19, 2017 • edited Loading

jsoref commented Dec 19, 2017

ruflin commented Dec 20, 2017

jsoref commented Dec 20, 2017

ruflin commented Dec 27, 2017

ghost commented May 29, 2018

ph commented May 29, 2018

jsoref commented May 29, 2018

mcwm6 commented May 20, 2019 • edited Loading

botelastic bot commented Jul 8, 2020

jsoref commented Jul 8, 2020

elasticmachine commented May 10, 2021

botelastic bot commented May 10, 2022

jsoref commented May 10, 2022

botelastic bot commented May 10, 2023

jsoref commented May 10, 2023

botelastic bot commented May 9, 2024

jsoref commented May 9, 2024

ruflin commented Dec 19, 2017 •

edited

Loading

mcwm6 commented May 20, 2019 •

edited

Loading