-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identity tracking of files in Filebeat inputs #13492
Comments
We have an issue, but it doesn't involve network or shared volume. The volume is local and does not change. But the instance does. USE CASE: We use terraform to script our AWS infrastructure. The log files and filebeat repo are stored on a non-root volume, e.g. /data, that is mounted on a device that accesses a AWS volume different from the root volume. When the instance is rebuilt, the non-root AWS volume is preserved and remounted to the same dir on the new instance. So, inodes are stable, but the device ID changes when the instance is rebuilt. In my use case, inode+device ID is problematic, and simply using inode would be sufficient. |
Simply inode can lead to false positives, in case users have multiple mount points to different devices. The inode+fingerprint would use the fingerprinting to handle possible conflicts. Given that the interface is pluggable, users can provide other implementations if required. Regarding fingerprinting complexity: The question is what 'n' is. It has not been discussed yet how the fingerprint will be computed.
I wonder where/how this would be used. We also need to compute the |
Your point is valid, @urso . My use case is a specific case where I think the device id is not stable, but I think it should be. Simple solution, ignore the device id since there is only one device. I think that the device id(s) should be the same if the same volume(s) are attached. Perhaps having the option on linux, to use the a UUID or PARTUUID as the device id, or at least used to generate the device id? (see bash command blkid) |
Good idea. We should look into this. Can you check if the UUID stays stable for your use case? |
Yes, in my case, the UUID is stable. To illustrate: I terminated the AWS instance, and restarted it. ( BEFORE - instance "A":
AFTER - instance "B":
Note: using terraform, I can guarantee that the same AWS volume will be used when the AWS instance is terminated. Since I don't run fdisk on the volume on subsequent startups, the UUID remains the same. |
Thank you for testing. This looks very promissing. I think we should also investigate if we can use UUID by default and have an automated upgrade path for users coming from an older registry file. I'm not really sure about the UUID in presence of NFS of CIFS. Especially if the server is Windows, or a very old system. Anyway, inode+UUID should definitely be an option. |
I added it to the list of possible options. But it still needs more investigation. @johnhoughton-v Thank you for the suggestion! |
I also like to chime in on this issue. Until now we do not exactly know, why the device ID changes in our case, but the description @johnhoughton-v provided in the comment above sounds reasonable for our case as well (even though we are not on AWS but on the infrastructure of a local cloud provider. I will try to find out, if the UUID generated by I have a suggestion for an other approach for a device ID alternative. What about a special marker file (on *nix this could be a hidden file), which sits next to the files that are indexed by filebeat (or maybe filebeat could traverse the path up until root to find such a file). The content would be a unique id, that is used instead of the device ID. For the situations @johnhoughton-v and we do face, this would solve the problem as well, because in our cases the block device with all its contents stays the same, but the system (maybe due to a rebuild of the instance) generates a new device ID for the same block device. Therefore, this marker file would be sufficient to identify the block device / file system as the same. |
For reference, related issues:
As well as discussions on discuss: |
I kindly ask again, how I can help to get some progress on this issue. |
We just had a significant production incident cause by this issue so, I decided to check back in to see if there has been any progress. Can anyone provide an update on where this stands? |
@johnhoughton-v For me this topic is also still relevant, but there has not progress, no reaction, nothing for quite some time even though I provided a PR and offered to help. So I do not really have any hope on this topic. |
Creating a marker file sounds interesting, but would be a little difficult to coordinate between multiple inputs with the current architecture (I think we can improve on this on the future). For now it might be easier if we ask for the marker file to exist already before we start collecting. All in all I'd prefer if Beats would not need write access to files or directories. |
Why not be a marker file be something that the user provides? Then, no write access is necessary.
If, in the config, we provided the name of the file, file eat could read the contents of that file.
We could populate that file with some unique value, like a UU ID of the device. The marker file could be read once on start up of filebeat.
… On May 25, 2020, at 9:23 AM, Steffen Siering ***@***.***> wrote:
Creating a marker file sounds interesting, but would be a little difficult to coordinate between multiple inputs with the current architecture (I think we can improve on this on the future). For now it might be easier if we ask for the marker file to exist already before we start collecting. All in all I'd prefer if Beats would not need write access to files or directories.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Yeah, this is what I I have had in mind by saying "For now it might be easier if we ask for the marker file to exist already before we start collecting". A random UUID might be good enough. We will definitely need to document examples on how to create the 'device' file. If the file is missing we would not collect from the directory until it is present. The thing I wonder is: do we want the file to be present in each directory, or just somewhere in the parent directories? |
I think that a marker/sentinel file would be fine from our/the user end. Here are a few thoughts:
|
I have started to work on this feature. The first PR is still in progress, but it can be tracked here: #18748 |
can we close this issue? |
Follow up in #19990 |
@kvch Thank you for your effort, this new feature is highly appreciated and I am looking forward to test it in our environment. |
@breml Thank you for your kind words! I am looking forward to your feedback. |
The current file identification of Filebeat is limited and does not support network shares well.
Right now inode and device id are used to tell files apart. But from time to time device id changes on such shares, so Filebeat rereads already processed files.
As there are many options to track file identity and there is no silver bullet to fit all use cases, this should be configurable.
Possible choices:
All choices have their advantages and disadvantages:
len(data)
)len(data)
)Suggested configuration format (by @urso):
As there might be different requirements in special use cases, we intend to provide a pluggable interface so users can write their own identity tracker.
I propose the following interface:
where
SameFile
is able to support both the existingos.SameFile
check and/or fingerprinting the contents of the files.Further issues we need to address:
@urso @ph @faec WDYT?
The text was updated successfully, but these errors were encountered: