-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add contents based hash to the filebeat regsitry for detecting inode reuse #11277
Comments
@guyboertje did some research on this a year or so ago for the same concern in the Logstash file input. He might have some notes about the implementation(s) he came up with and weird edge cases (like your example of a file header, etc) |
I am happy to talk about this. IIRC there is some design challenges around delaying the "hashing" (I used the term fingerprinting at the time) when the content is growing but smaller than the size needed for reliable hashing. I also considered using two smaller N values with one taken from the start and another taken from an offset into the content to act as a tie-breaker (in the event that two files shared the same 'preamble' or a hash collision occurs). My reading of the internets at the time seemed to indicate that the FNV algorithm was a good all round candidate for a non-crypto hash function. |
@jordansissel @guyboertje Yeah, I remember discussions about having different implementations of file identity. E.g. based on Inode, path names only, or fingerprinting. I think these are still interesting for different use cases. The idea of this issues is to introduce a simple heuristic for inode based file identity in cases of inode reuse only. A tie-breaker so to say.
True. One idea for the reader could be to return a tuple |
I am not certain this is still a big issue for us, especially since #19990. so will close it until we have a reason to open it again. |
Filebeat currently tracks file identity by inode and device id. If a file is opened we check if it is a new or an old file based on these meta-data. If we detect that it is an old file, then we jump to the old offset.
Unfortunately some scenarios and distributions create false positives, such that we assume a new file is actually an old file and jump right into the middle of the new file. Due to this logs might be lost or we might produce parse errors.
So to mitigate this we could store a hash value based on the files first N bytes in addition to the inode and device id. This is not meant as a replacement for inode+device id, but an extension, so to detect inode reuse.
The number of bytes must not be too small, so hashing doesn't break if a file header is in place, yet we don't want to process too many bytes. So to make it somewhat dynamic (and grow the hash) we can store the tuple: (inode, device id, hash, N) (with N == number of bytes used to compute the hash). Once the file grows we can increase and update the hash, up until 4096 bytes (arbitrary value ;) ). Due to
N
being in the tuple, we do not invalidate old tuples.The text was updated successfully, but these errors were encountered: