-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The ability to checksum by the first line at the file server #2890
Comments
As far as I understand, there is exactly one circumstance where this does not work. Requiring files to be a certain minimum size is a known tradeoff. If there are other things you consider "suboptimal" please explain further.
What happens if you read and there is no newline character yet? What if the first character is a newline? Should we require a minimum number of bytes before the newline to be considered valid? What about files with common headers?
Again, please explain these major issues so that we can evaluate your proposal against them.
Is there something specific about k8s that is creating many files smaller than the default 256B limit? Please explain further so that we can have an informed discussion before proposing specific changes. |
Correct. The smallest message at k8s consists of 40 bytes with one log format, and 62 bytes with the other log format. Both log formats allow files to be less than 256 bytes. This case is precisely what I have at E2E tests, and thus My solution - reading only the first line - has very similar properties to the checksum solution that we already have - but it doesn't have that tradeoff. It does require another set of tradeoffs, but as long as the conditions are met - it will just work properly every time. The said conditions are known to be true in the Now, more about the properties of checksumming the first line (and the required conditions).
Effectively, most of the properties are common with the current length-based checksum algorithm. But there's no downside of ignoring small non-empty files. This solution would be good for the I hope this message is not too big, and at the same time explains my point of view on the matter reasonably. There are a few other improvements that, I figured, we can apply to the fingerprinting design. This one is critical for achieving better trade-offs for |
To summarize all of that, it sounds like @MOZGIII just wants to guarantee that the timestamp is included in the fingerprint since that is likely to be unique. Two solutions:
|
You're absolutely right! I already attempted to use the first approach, but in docker log format the I prepared a concrete implementation of this proposal at #2904. There's a test suite that demonstrates and ensure the properties that we'd want from a checksum. |
I thought about it more, and the summary isn't quite correct.
This is, in essence, the tradeoff of the proposed checksum implementation. While the current one has a slightly different one:
In practice, it means that if a |
I understand that there are valid files smaller than 256 bytes and that it is your position that ignoring those makes our implementation incorrect.
Every solution will work every time if you assume that all of its required conditions are met 😄
This needs to be made more clear. Are we talking about a solution specifically for the k8s logs source? Are we talking about something that's intended to be more general-purpose? Is it intended to replace the existing checksum algorithm? With a clear understanding of the intended scope, it's impossible to evaluate changes like this.
This ignores our existing support for gzipped files.
To be clear, you're claiming that log files smaller than 256 bytes are the common case? We all understand that they are valid and possible, but our tradeoff was made on the assumption that it is extremely uncommon for real files to remain that small for any meaningful amount of time. If you disagree, it's far more helpful to explain why than to repeatedly claim that we're ignoring a "common case".
Statements like this are not helpful. Please be willing to accept that our position is the result of logical reasoning from a reasonable set of assumptions. If you can't understand why something is the way is it, ask for clarification. The implication that we've made unreasonable decisions to this point is a serious distraction from your actual proposal.
This is where I see the largest technical problem with your proposal. The combination of a maximum line length and the fact that timestamps do not come first in these formats means you need to set the max larger than the largest log message you expect to see. How do you make that decision without meaningful correctness tradeoffs? |
Another practical problem is when the first event is too long: {"log":"... more than 256 symbols, easily possible with nested json ...", "time": "...", ... } This demonstrates the case where current checksum wouldn't read long enough into the file, that the timestamp is covered, meaning the checksum will be generated solely on the payload of the log message. This may be very problematic, as in k8s it's a usual case when multiple |
Oh, I missed this, true. 🤔 It explains a lot. 😄
No, definitely not replace the current checksum, not without more confidence and discussions.
Initially - just for the k8s source. Once we discuss it in detail - maybe we can expose is as an option to the users at the file source too.
I've left my system running k8s E2E tests with the implementation from #2904 in a loop, and so far no failures 🎉 |
Thanks @MOZGIII, you should have lead with that example 😄. Vector aims to "just work" for the common observability use case. That is:
That's it. We are not concerned about solving strange edge cases by default, such as:
Solutions for you:
Hope that helps organize this discussion and move it forward. |
Real world data point - we have a number of Python data processing jobs that chuck out a single short "Starting work" message at the start before spending a few hours doing data processing without (hopefully!) outputting anything until "Done" at the end. Getting that initial message come through is reassuring. I suspect most would manage the 256 minimum with the expected final message, but wouldn't like to insist on it, and always getting something by the end would be mandatory. While we could work around this by printing out "Lorem ipsum" filler at the start I think everyone would regard this at really a bit rubbish, and lead to a poor impression of the new system. I was surprised as to why they weren't coming through until I read the Vector output logs at INFO/DEBUG level. My colleagues who didn't just implement a new logging system will have no such clue & will waste time looking at for bugs that aren't there. |
Thanks, @tyrken, that's a useful data point. We definitely want use cases like that to just work as much as possible.
Where are you getting this data?
If we're confident that k8s implementations enforce that upper limit, then it seems like we can be reasonably confident we'll avoid situations where a long line means we don't include a timestamp in our checksum. I'm not opposed to adding an internal mode specifically for k8s if we can really rely on these properties to avoid making equivalent tradeoffs to the current checksumming implementation. I just wish you'd made this clearer earlier in the conversation 😄 As I understand it, the best argument for a new mode is roughly as follows:
If that's all true, then the result is that we can reliably cover all k8s logs (even small ones) by checksumming the first line and I'm fine to proceed with adding an internal-only mode for that (if/how to present it to users is another discussion). The biggest question I'm unsure of is around compressed files. My limited reading of k8s docs implies that files are rotated via normal systems means (e.g. logrotate) and not via some k8s-specific behavior. Does our implementation somehow ensure that we won't encounter rotated files that have been compressed? |
Sorry about that, I was thinking it would make sense even as a generic solution, thus I didn't focus on the k8s use cases initially. Even though I need it specifically for the k8s use cases.
Yep, that's the case.
This is a really good question. The rotated logs may or may not be gzipped. More specifically:
To offer compression support in the first line checksum, we can just uncompress the gzip stream and read the line of the compressed file (in streaming mode). However, compressed rotated files might be one of the things that we don't need to bother to support: neither Either way, I suggest we count this as something we can add later if we see need for it. Our implementation currently only picks up @lukesteensen please take a look at the proposed implementation when you have time: #2904 |
The issue would be if we tried to read from a compressed file as if it were an uncompressed file. I agree we don't need to worry about full support for compression right now.
This seems sufficient to ensure we won't attempt to read any compressed file. If we're confident in that I think we can move forward here. |
Under some circumstances, doing a checksum just by the fixed amount of bytes becomes very suboptimal.
To work around this, we could implement a checksumming process that would read the first line - that is up to the first
\n
character (or a specified ceiling in bytes) - and checksum that. It is a very reasonable measure for line-separated logs, and solves the major issues with the pure bytes-oriented approach.I figured this would help significantly with the proper implementation of the k8s source.
Current code for reference:
https://github.com/timberio/vector/blob/c293c492e97a7249822be4907f6bab84414dae7d/lib/file-source/src/file_server.rs#L391-L409
Ref #2701.
The text was updated successfully, but these errors were encountered: