The problem with fingerprinters #2701

MOZGIII · 2020-05-28T13:17:08Z

Currently, we have two file fingerprinters:

Fingerprinter::DevInode - uses file inode number to compute the checksum
Fingerprinter::Checksum - uses a few first bytes from the file to compute the checksum

You'd think the Fingerprinter::DevInode is the preferred one, but its usage can cause problems: upon the log file rotation, if the file is truncated instead of moved and recreated, and then written from scratch, the inode fingerprinter will fail to properly distinguish between the two. This will lead to incorrect behavior at the checkpoints, and the system will stop reading the logs.

Fingerprinter::Checksum mostly works, but it has a nasty problem - when files are too small, the fingerprinter will skip them, and we won't ever get data from them. This is usually not a big deal when collecting logs from the long-running services - they usually log plenty, but it doesn't work for one-time jobs that write just a tiny summary of the operation result to the log file.

Currently, there's no solution we can offer that would just work.

Let's discuss our options to solve this problem.

Potentially relevant issues: #2163.

The text was updated successfully, but these errors were encountered:

lukesteensen · 2020-05-28T14:42:30Z

Checksum is the preferred strategy, not DevInode. It does require by default that files are at least 256 bytes, but that is configurable.

MOZGIII · 2020-05-28T15:08:43Z

Do you mean there's no issue here?

Checksum is the preferred strategy, not DevInode.

Why is that btw? Is it due to issues with the DevInode I outlined above? I genuinely assume using inodes would be better designed here, assuming we solve the truncate problem.

It does require by default that files are at least 256 bytes, but that is configurable.

Reducing the byte size of the Fingerprinter::Checksum might not be a practical solution for some cases. Obviously, using the size of 0 disables the fingerprinter altogether - and some users might go with that setting, since they care about files being picked up, but not so much about what this setting means and why it has to be properly set.
The value of 26, which is what we get when we do date --rfc-3339=s | wc -c seems like a more viable option for logs (date is likely to be in every line-delimited log file format, so even a single log line will work). Maybe we should adjust our default. But still, logs for scripts that write just ok or done to the logs will never be picked up.

We have a solution that requires manual configuration and serious insight. Users have to estimate the meaningful amount of bytes that contribute to the checksum based on their use case. Incorrect estimates will result in lost log events.
This doesn't seem like a good UX, and it's not in line with Vector's sane defaults policy.
The problem is that it might not even be possible to come up with a single default that would be a good fit. We might need to step up the approach.

binarylogic · 2020-05-28T15:27:38Z

Oh, the good ol' fingerprinting conversation 😄. @MOZGIII we settled on the checksum strategy because:

The behavior is consistent across operating systems and file systems, such as Windows.
inode is full of weird edge case behavior. Ex: inodes can be reused.
checksum doesn't care how you rotate your files. AKA it works with copytruncate.

Therefore, checksum felt like the sane default. Yes, there are edge cases where it won't work, like the ones you mentioned, but those aren't the cases we are typically solving for. We are solving for the typical log file -- Vector is an observability utility.

If you have a silver bullet, I'm all for it. Otherwise, we might consider requiring users to choose a strategy instead of defaulting to one. If you have improvements you'd like to make please offer them in a succinct constructive manner. If we agree there are improvements to be had, we'll need an RFC. Just because this same discussion keeps popping up and an RFC would prevent that going forward.

lukesteensen · 2020-05-28T15:32:47Z

Why is that btw?

Checksums are more direct, more robust, less dependent on platform-specific implementation details, etc, etc.

The value of 26, which is what we get when we do date --rfc-3339=s | wc -c seems like a more viable option for logs (date is likely to be in every line-delimited log file format, so even a single log line will work). Maybe we should adjust our default.

Absolutely not. That would require making two massive assumptions:

Log lines are formatted with RFC3339 dates as the first component (almost certainly not true)
A timestamp precise to the second is sufficient to uniquely identify log files (also certainly not true)

We have a solution that requires manual configuration and serious insight.

I very strongly disagree with this. The vast, vast majority of use cases for log shipping involve files that quickly grow past 256 bytes.

It would be great if we a solution that handled everything that checksumming handled while also taking care of logs that consist solely of ok or done. I am unaware of what that solution would be. The most likely change we make here it to add a separate mode to the file source that explicitly does not account for file rotation and simply uses the file path.

MOZGIII · 2020-05-28T16:22:41Z

Checksums are more direct, more robust, less dependent on platform-specific implementation details, etc, etc.

Checksums don't work for all files - that's a major downside, in my view - much more problematic than everything that you've listed combined.
Also, I'm not sure if I agree with the core perception - that they're more direct and robust. They're platform-specific for sure (the values are, not the approach since inodes exist on every platform supported by Vector), but I can't imagine anyone it's very would be switching platforms and expect the log collection to pick up where it left off.
So, this doesn't seem as obvious and trivial to be that checksums are clearly better. I'd appreciate it if we discuss this in more detail. If you're interested of course. It's a bit offtopic.

I very strongly disagree with this. The vast, vast majority of use cases for log shipping involve files that quickly grow past 256 bytes.

I understand where you're coming from, and I even mostly agree. But this is just choosing to ignore the problem - the cases where checksum fingerprinter doesn't work. I've stumbled upon this in practice too much to get annoyed to a point I created this issue.

I just want to fix it 😄
The problem is real, and hurts the most when the file source (or file server) is used for files of mixed nature. One such occasion, in particular, is with k8s.

I agree there's no solution in sight in the framework that's been built so far.

The most likely change we make here it to add a separate mode to the file source that explicitly does not account for file rotation and simply uses the file path.

Interesting. Wouldn't DevInode fingerprinter be the same or better in this case?

I also have another idea.

What if we treat files that aren't long enough as if they have a zero checksum? Then we'd need a mechanism to transition such files from the 0 checksum to a non-zero checksum once they're grown long enough, to avoid emitting the logs twice.

MOZGIII · 2020-05-28T16:34:45Z

I somehow missed #2701 (comment).

Oh, the good ol' fingerprinting conversation

😄 I know this has been discussed a lot before, but apparently things are still not there yet.

@MOZGIII we settled on the checksum strategy because:

The behavior is consistent across operating systems and file systems, such as Windows.

inode is full of weird edge case behavior. Ex: inodes can be reused.

checksum doesn't care how you rotate your files. AKA it works with copytruncate.

Therefore, checksum felt like the sane default. Yes, there are edge cases where it won't work, like the ones you mentioned, but those aren't the cases we are typically solving for. We are solving for the typical log file -- Vector is an observability utility.

Ok, some concrete points! Thx!

(1) shoudn't be an issue since Windows has inodes too;
(2) right, I didn't thought about it, inodes are indeed not that great; some abstraction around them would be needed to avoid the quirks;
(3) right, and it's great! but it's kinda broken. 😦

If you have a silver bullet, I'm all for it. Otherwise, we might consider requiring users to choose a strategy instead of defaulting to one. If you have improvements you'd like to make please offer them in a succinct constructive manner. If we agree there are improvements to be had, we'll need an RFC. Just because this same discussion keeps popping up and an RFC would prevent that going forward.

No silver bullets (so far). Just some ideas (a couple) I'd like to discuss, pre RFC.

lukesteensen · 2020-05-28T19:46:31Z

Checksums don't work for all files - that's a major downside, in my view - much more problematic than everything that you've listed combined.

If you have a solution that works for all situations I would be very eager to hear about it. All of the implementations I'm aware of have tradeoffs and we've chosen checksums because the downside (requiring that files be a certain configurable size) impacts the least important use case.

I can't imagine anyone it's very would be switching platforms and expect the log collection to pick up where it left off.

To be clear, no one expects that and that is not at all what I meant by platform-independent. There are meaningful differences in behavior across platforms and filesystems. This causes problems with tools that use device/inode fingerprinting that must be documented and worked around.

But this is just choosing to ignore the problem - the cases where checksum fingerprinter doesn't work.

We do not ignore the problem. We provide multiple ways to adjust the behavior if your use case is not well supported by the defaults.

The problem is real, and hurts the most when the file source (or file server) is used for files of mixed nature. One such occasion, in particular, is with k8s.

Please be more specific than "files of mixed nature". Obviously we want the tool to work well for as many circumstances as possible. Are you stating that k8s deployments routinely produce log files that are less than 256 bytes in size?

Wouldn't DevInode fingerprinter be the same or better in this case?

No, because of issues like inode reuse.

What if we treat files that aren't long enough as if they have a zero checksum? Then we'd need a mechanism to transition such files from the 0 checksum to a non-zero checksum once they're grown long enough, to avoid emitting the logs twice.

Then we would consider all such files to be the same and you couldn't have more than one of them. A zero checksum would not uniquely identify any of them.

(1) shoudn't be an issue since Windows has inodes too

You should not take the mere presence of inodes as evidence that there are no meaningful behavior differences here.

(2) right, I didn't thought about it, inodes are indeed not that great; some abstraction around them would be needed to avoid the quirks

That is not a problem that is solved by adding abstraction.

(3) right, and it's great! but it's kinda broken.

There is a very meaningful difference between being broken and making necessary tradeoffs.

MOZGIII · 2020-06-23T08:14:33Z

Ok, I think I figured a way to introduce a significant improvement with minimal efforts: #2890.

binarylogic · 2020-08-07T15:46:25Z

Closing this in favor of #2926.

MOZGIII changed the title ~~Truncate-aware inode file fingerprinter~~ The problem with fingerprinters May 28, 2020

binarylogic added meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. source: file Anything `file` source related meta: feedback Anything related to customer/user feedback. labels May 28, 2020

MOZGIII mentioned this issue Jun 23, 2020

The ability to checksum by the first line at the file server #2890

Closed

MOZGIII mentioned this issue Jul 1, 2020

Expose the ability to checksum the first line to the users #2926

Closed

binarylogic closed this as completed Aug 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The problem with fingerprinters #2701

The problem with fingerprinters #2701

MOZGIII commented May 28, 2020

lukesteensen commented May 28, 2020

MOZGIII commented May 28, 2020

binarylogic commented May 28, 2020 •

edited

Loading

lukesteensen commented May 28, 2020

MOZGIII commented May 28, 2020

MOZGIII commented May 28, 2020

lukesteensen commented May 28, 2020

MOZGIII commented Jun 23, 2020

binarylogic commented Aug 7, 2020

The problem with fingerprinters #2701

The problem with fingerprinters #2701

Comments

MOZGIII commented May 28, 2020

lukesteensen commented May 28, 2020

MOZGIII commented May 28, 2020

binarylogic commented May 28, 2020 • edited Loading

lukesteensen commented May 28, 2020

MOZGIII commented May 28, 2020

MOZGIII commented May 28, 2020

lukesteensen commented May 28, 2020

MOZGIII commented Jun 23, 2020

binarylogic commented Aug 7, 2020

binarylogic commented May 28, 2020 •

edited

Loading