Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add contents based hash to the filebeat regsitry for detecting inode reuse #11277

Closed
urso opened this issue Mar 16, 2019 · 4 comments
Closed

Comments

@urso
Copy link

urso commented Mar 16, 2019

Filebeat currently tracks file identity by inode and device id. If a file is opened we check if it is a new or an old file based on these meta-data. If we detect that it is an old file, then we jump to the old offset.
Unfortunately some scenarios and distributions create false positives, such that we assume a new file is actually an old file and jump right into the middle of the new file. Due to this logs might be lost or we might produce parse errors.

So to mitigate this we could store a hash value based on the files first N bytes in addition to the inode and device id. This is not meant as a replacement for inode+device id, but an extension, so to detect inode reuse.
The number of bytes must not be too small, so hashing doesn't break if a file header is in place, yet we don't want to process too many bytes. So to make it somewhat dynamic (and grow the hash) we can store the tuple: (inode, device id, hash, N) (with N == number of bytes used to compute the hash). Once the file grows we can increase and update the hash, up until 4096 bytes (arbitrary value ;) ). Due to N being in the tuple, we do not invalidate old tuples.

@jordansissel
Copy link
Contributor

to mitigate this we could store a hash value based on the files first N bytes

@guyboertje did some research on this a year or so ago for the same concern in the Logstash file input. He might have some notes about the implementation(s) he came up with and weird edge cases (like your example of a file header, etc)

@kvch kvch self-assigned this Mar 16, 2019
@guyboertje
Copy link

I am happy to talk about this.

IIRC there is some design challenges around delaying the "hashing" (I used the term fingerprinting at the time) when the content is growing but smaller than the size needed for reliable hashing.

I also considered using two smaller N values with one taken from the start and another taken from an offset into the content to act as a tie-breaker (in the event that two files shared the same 'preamble' or a hash collision occurs).

My reading of the internets at the time seemed to indicate that the FNV algorithm was a good all round candidate for a non-crypto hash function.

@urso
Copy link
Author

urso commented Mar 25, 2019

@jordansissel @guyboertje Yeah, I remember discussions about having different implementations of file identity. E.g. based on Inode, path names only, or fingerprinting. I think these are still interesting for different use cases.

The idea of this issues is to introduce a simple heuristic for inode based file identity in cases of inode reuse only. A tie-breaker so to say.

there is some design challenges around delaying the "hashing" (I used the term fingerprinting at the time) when the content is growing but smaller than the size needed for reliable hashing.

True. One idea for the reader could be to return a tuple (content, fingerprint, fingerprint-kind, N bytes used). Depending on file size we can dynamically change the fingerprint-kind. The fingerprint will not change anymore once we N == 4096 for example.
FNV should be good enough. One can easily switch from 8bit -> 64bit hash. I'm not really concerned about hash collisions, but more about common file headers (think CSV header).

@kvch kvch removed their assignment Oct 7, 2021
@nimarezainia
Copy link
Contributor

I am not certain this is still a big issue for us, especially since #19990. so will close it until we have a reason to open it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants