Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read .xz file by requested block #12

Open
jtmoon79 opened this issue Aug 8, 2022 · 7 comments
Open

read .xz file by requested block #12

jtmoon79 opened this issue Aug 8, 2022 · 7 comments
Labels
difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request file parser

Comments

@jtmoon79
Copy link
Owner

jtmoon79 commented Aug 8, 2022

Problem

An .xz file is entirely read during BlockReader::new.
This may cause problems for very large compressed files (the s4 program will hold the entire uncompressed file in memory; it would use too much memory).

The crate lzma-rs does not provide API xz_decompress_with_options which would allow limiting the bytes returned per call. It only provides xz_decompress which decompresses the entire file in one call. See gendx/lzma-rs#110

Solution

Read an .xz file per block request, as done for normal files.


Update: see Issue #283

Meta-Issue #182

@jtmoon79 jtmoon79 added the enhancement New feature or request label Aug 8, 2022
jtmoon79 added a commit that referenced this issue Aug 8, 2022
@jtmoon79 jtmoon79 changed the title read xz file by block read xz file by requested block Aug 9, 2022
@jtmoon79
Copy link
Owner Author

jtmoon79 commented Aug 9, 2022

Similar to Issue #13

@jtmoon79
Copy link
Owner Author

jtmoon79 commented Sep 20, 2022

The current code: https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/readers/blockreader.rs#L932-L943

It uses https://github.com/gendx/lzma-rs/releases/tag/v0.2.0

The problem is due to lzma-rs crate not providing the uncompressed file size. But uncompressed file size must be known before BlockReader::new returns.

The xz format description reads

Uncompressed Size
This field is present only if the appropriate bit is set in
the Block Flags field (see Section 3.1.2).

So a decent partial fix is to manually check of uncompressed size is available, that is, check not using lzma-rs, but by jumping to different bit and byte offsets and processing the raw data.


The format has the caveat...

It should be noted that the only reliable way to determine
the real uncompressed size is to uncompress the Block,
because the Block Header and Index fields may contain
(intentionally or unintentionally) invalid information.

In this sense, the current hacky implementation is guaranteed to be correct.

jtmoon79 added a commit that referenced this issue Oct 20, 2022
Add windows-latest in os matrix.
Workaround git-for-windows/git#2803
using `git config core.protectNTFS false`

Issue #12
jtmoon79 added a commit that referenced this issue Oct 20, 2022
Add windows-latest in os matrix.
Workaround git-for-windows/git#2803
using `git config core.protectNTFS false`

Issue #12
jtmoon79 added a commit that referenced this issue Oct 20, 2022
Add windows-latest in os matrix.
Workaround git-for-windows/git#2803
using `git config core.protectNTFS false`

Issue #12
@jtmoon79
Copy link
Owner Author

jtmoon79 commented Jan 30, 2023

Another way this issue manifests is reading too many blocks for files without syslines.

File eipp.log.xz has decompressed content like

Package: software-properties-common
Architecture: all
Version: 0.99.9.8
APT-ID: 71737
Status: installed
Depends: ca-certificates, gir1.2-glib-2.0, gir1.2-packagekitglib-1.0 (>= 1.1.0-2), packagekit, python-apt-common (>= 0.9), python3, python3-dbus, python3-gi, python3-requests-unixsocket, python3-software-properties (= 0.99.9.8), python3:any
Breaks: python-software-properties (<< 0.85), python3-software-properties (<< 0.85)

Package: liberror-perl
Architecture: all
Version: 0.17029-1
APT-ID: 2280
Multi-Arch: foreign
Status: installed
Depends: perl:any

Package: libpng16-16
Architecture: amd64
Version: 1.6.37-2
APT-ID: 3339
Multi-Arch: same
Status: installed
Depends: libc6 (>= 2.29), zlib1g (>= 1:1.2.11)

s4 reads all 4 blocks (after compression) from this file.

• s4 /var/log/apt/eipp.log.xz  -s
WARNING: no syslines found "/var/log/apt/eipp.log.xz"

Files:

File: /var/log/apt/eipp.log.xz (XZ) MimeGuess(["application/x-xz"])
  Summary Printed:
      bytes          0
      lines          0
      syslines       0
      datetime first None Found
      datetime last  None Found
  Summary Processed:
      file size compressed   31592 (0x7B68) (bytes)
      file size uncompressed 201425 (0x312D1) (bytes)
      bytes          201425
      bytes total    201425
      block size     65535 (0xFFFF)
      blocks         4
      blocks total   4
      blocks high    4
      lines          2334
      lines high     2334
      syslines       0
      syslines high  0

Notice Summary Processed: blocks 4.

For plain log files, the BlockZero analysis would stop processing after the zeroth block (first block) did not have any apparent syslines, e.g. Summary Processed: blocks 1.

For very large files, this is a lot of overhead for naught, and may cause problems where computer memory is constrained.

Reading too many blocks increases likelihood of an errant match, e.g. a datetime string within some message that is mistakenly interpreted as a sysline. (should be fixed; only zeroth block is analyzed for datetime substrings).

@jtmoon79 jtmoon79 added the difficult A difficult problem; a major coding effort or difficult algorithm to perfect label May 21, 2023
@jtmoon79
Copy link
Owner Author

@jtmoon79 jtmoon79 changed the title read xz file by requested block read .xz file by requested block May 21, 2023
@jtmoon79
Copy link
Owner Author

jtmoon79 commented Mar 24, 2024

Update: see Issue #283


A good solution for this Issue and Issue #13 would be having a "sequential read mode" for SyslogProcessor that is also handed down to SyslineReader, to LineReader, and to BlockReader.

In "sequential read mode" mode, there is no binary search for syslines, only reading the file from start to finish. This would allow "progressive" dropping of data at different points. The BlockReader would, during the search for datetime filter A, somehow know to drop Blocks from N - 2 ago... or something like that. Essentially, it's during the phase of finding the first syslog message acceptable to datetime filter A that Blocks would be dropped while searching (and Lines, Syslines).

This should be relatively clean to implement. There would be two paths for searching for the datetime filter A, binary and linear/sequential.

...

except for this one complicating detail from my comment above:

The problem is due to lzma-rs crate not providing the uncompressed file size. But uncompressed file size must be known before BlockReader::new returns.

I should just grab that raw data myself. It would simplify stuff.
From xz format definition 1.2.0

3.1.4. Uncompressed Size

The Uncompressed Size field contains the size of the Block after uncompressing.
...
It should be noted that the only reliable way to determine
the real uncompressed size is to uncompress the Block,
because the Block Header and Index fields may contain
(intentionally or unintentionally) invalid information.

Maybe just decompress the entire file once without saving it, to get the uncompressed size. Currently, the entire file is read once and saved during Blockreader::new.

This proposed implementation means the entire file is read twice, at most. However, the amount of runtime memory required would be a constant of the BlockSz, instead of at least the size of the uncompressed file. I think that's a smarter trade-off.


Also, I could delete one bullet point from the README.md

@jtmoon79
Copy link
Owner Author

jtmoon79 commented May 5, 2024

Cannot read the xz file in chunks/blocks. The crate lzma-rs does not provide API xz_decompress_with_options. See gendx/lzma-rs#110

Consider https://docs.rs/xz2/latest/xz2/read/struct.XzDecoder.html

jtmoon79 added a commit that referenced this issue May 6, 2024
Attempt to parse more of the XZ header and block #0 header.
Unfortunately, I couldn't figure get this working entirely. Leaving
the code in place as it does function.
The intent was to compensate for lzma-rs reading the entire file
during xz_decompress. However, that's a larger problem, see
gendx/lzma-rs#110

Issue #12
Issue 283
@jtmoon79
Copy link
Owner Author

jtmoon79 commented May 31, 2024

#283 refactors handling .xz. However the problem remains of reading the entire file during an open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request file parser
Projects
None yet
Development

No branches or pull requests

1 participant