Add contents based hash to the filebeat regsitry for detecting inode reuse #11277

urso · 2019-03-16T01:40:16Z

Filebeat currently tracks file identity by inode and device id. If a file is opened we check if it is a new or an old file based on these meta-data. If we detect that it is an old file, then we jump to the old offset.
Unfortunately some scenarios and distributions create false positives, such that we assume a new file is actually an old file and jump right into the middle of the new file. Due to this logs might be lost or we might produce parse errors.

So to mitigate this we could store a hash value based on the files first N bytes in addition to the inode and device id. This is not meant as a replacement for inode+device id, but an extension, so to detect inode reuse.
The number of bytes must not be too small, so hashing doesn't break if a file header is in place, yet we don't want to process too many bytes. So to make it somewhat dynamic (and grow the hash) we can store the tuple: (inode, device id, hash, N) (with N == number of bytes used to compute the hash). Once the file grows we can increase and update the hash, up until 4096 bytes (arbitrary value ;) ). Due to N being in the tuple, we do not invalidate old tuples.

The text was updated successfully, but these errors were encountered:

jordansissel · 2019-03-16T04:28:27Z

to mitigate this we could store a hash value based on the files first N bytes

@guyboertje did some research on this a year or so ago for the same concern in the Logstash file input. He might have some notes about the implementation(s) he came up with and weird edge cases (like your example of a file header, etc)

guyboertje · 2019-03-21T10:33:07Z

I am happy to talk about this.

IIRC there is some design challenges around delaying the "hashing" (I used the term fingerprinting at the time) when the content is growing but smaller than the size needed for reliable hashing.

I also considered using two smaller N values with one taken from the start and another taken from an offset into the content to act as a tie-breaker (in the event that two files shared the same 'preamble' or a hash collision occurs).

My reading of the internets at the time seemed to indicate that the FNV algorithm was a good all round candidate for a non-crypto hash function.

urso · 2019-03-25T11:49:31Z

@jordansissel @guyboertje Yeah, I remember discussions about having different implementations of file identity. E.g. based on Inode, path names only, or fingerprinting. I think these are still interesting for different use cases.

The idea of this issues is to introduce a simple heuristic for inode based file identity in cases of inode reuse only. A tie-breaker so to say.

there is some design challenges around delaying the "hashing" (I used the term fingerprinting at the time) when the content is growing but smaller than the size needed for reliable hashing.

True. One idea for the reader could be to return a tuple (content, fingerprint, fingerprint-kind, N bytes used). Depending on file size we can dynamically change the fingerprint-kind. The fingerprint will not change anymore once we N == 4096 for example.
FNV should be good enough. One can easily switch from 8bit -> 64bit hash. I'm not really concerned about hash collisions, but more about common file headers (think CSV header).

nimarezainia · 2022-04-15T19:18:24Z

I am not certain this is still a big issue for us, especially since #19990. so will close it until we have a reason to open it again.

urso added enhancement Filebeat Filebeat labels Mar 16, 2019

kvch self-assigned this Mar 16, 2019

breml mentioned this issue Sep 6, 2019

Identity tracking of files in Filebeat inputs #13492

Closed

1 task

kvch removed their assignment Oct 7, 2021

nimarezainia closed this as completed Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add contents based hash to the filebeat regsitry for detecting inode reuse #11277

Add contents based hash to the filebeat regsitry for detecting inode reuse #11277

urso commented Mar 16, 2019

jordansissel commented Mar 16, 2019

guyboertje commented Mar 21, 2019

urso commented Mar 25, 2019

nimarezainia commented Apr 15, 2022

Add contents based hash to the filebeat regsitry for detecting inode reuse #11277

Add contents based hash to the filebeat regsitry for detecting inode reuse #11277

Comments

urso commented Mar 16, 2019

jordansissel commented Mar 16, 2019

guyboertje commented Mar 21, 2019

urso commented Mar 25, 2019

nimarezainia commented Apr 15, 2022