Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the ability to checksum the first line to the users #2926

Closed
MOZGIII opened this issue Jul 1, 2020 · 3 comments
Closed

Expose the ability to checksum the first line to the users #2926

MOZGIII opened this issue Jul 1, 2020 · 3 comments
Labels
have: should We should have this feature, but is not required. It is medium priority. needs: approval Needs review & approval before work can begin. source: file Anything `file` source related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@MOZGIII
Copy link
Contributor

MOZGIII commented Jul 1, 2020

Now when #2904 is merged, a logical continuation would be to start the discussion on exposing this functionality to the users.

Relevant references:

There are a few details that #2904 doesn't cover, but that we probably want to cover before we can move forward to expose this functionality to the users:

  • support for compressed files
  • skipping file headers (i.e. checksum second line instead of the first line)
  • how do we teach users about the properties and use cases of this checksum mode and how it compares to the others
@binarylogic binarylogic added have: should We should have this feature, but is not required. It is medium priority. needs: approval Needs review & approval before work can begin. source: file Anything `file` source related type: enhancement A value-adding code change that enhances its existing functionality. labels Aug 7, 2020
@binarylogic
Copy link
Contributor

I do like this change for the reasons discussed, but I want to think carefully about the UX here. This is exactly the kind of decision I do not want to present to the user. Checkpointing within the file source is already confusing and this would make it even more so. I wish there was a way to combine this strategy with the current checksum strategy so it "just works" for small and large files.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Aug 7, 2020

As far as I understand, the only use case when we are reading binary files is when we're processing compressed data. The only meaningful way I know that works for compressed log files if the compression algorithm permits streaming the uncompressed data as it arrives.

What if we just do the checksumming on a decompressed stream in case the file is compressed? We can then use the line-aware fingerprinter. I think the resulting solution would just work for any meaningful case - covering all the existing cases, but with less painful tradeoffs.

@binarylogic
Copy link
Contributor

This is done. #5215 should have closed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
have: should We should have this feature, but is not required. It is medium priority. needs: approval Needs review & approval before work can begin. source: file Anything `file` source related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

2 participants