read `.xz` file by requested block #12

jtmoon79 · 2022-08-08T22:38:12Z

Problem

An .xz file is entirely read during BlockReader::new.
This may cause problems for very large compressed files (the s4 program will hold the entire uncompressed file in memory; it would use too much memory).

The crate lzma-rs does not provide API xz_decompress_with_options which would allow limiting the bytes returned per call. It only provides xz_decompress which decompresses the entire file in one call. See gendx/lzma-rs#110

Solution

Read an .xz file per block request, as done for normal files.

Update: see Issue #283

Meta-Issue #182

The text was updated successfully, but these errors were encountered:

Issue #12

jtmoon79 · 2022-08-09T00:18:54Z

Similar to Issue #13

jtmoon79 · 2022-09-20T21:31:33Z

The current code: https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/readers/blockreader.rs#L932-L943

It uses https://github.com/gendx/lzma-rs/releases/tag/v0.2.0

The problem is due to lzma-rs crate not providing the uncompressed file size. But uncompressed file size must be known before BlockReader::new returns.

The xz format description reads

Uncompressed Size
This field is present only if the appropriate bit is set in
the Block Flags field (see Section 3.1.2).

So a decent partial fix is to manually check of uncompressed size is available, that is, check not using lzma-rs, but by jumping to different bit and byte offsets and processing the raw data.

The format has the caveat...

It should be noted that the only reliable way to determine
the real uncompressed size is to uncompress the Block,
because the Block Header and Index fields may contain
(intentionally or unintentionally) invalid information.

In this sense, the current hacky implementation is guaranteed to be correct.

Add windows-latest in os matrix. Workaround git-for-windows/git#2803 using `git config core.protectNTFS false` Issue #12

jtmoon79 · 2023-01-30T21:17:48Z

Another way this issue manifests is reading too many blocks for files without syslines.

File eipp.log.xz has decompressed content like

Package: software-properties-common
Architecture: all
Version: 0.99.9.8
APT-ID: 71737
Status: installed
Depends: ca-certificates, gir1.2-glib-2.0, gir1.2-packagekitglib-1.0 (>= 1.1.0-2), packagekit, python-apt-common (>= 0.9), python3, python3-dbus, python3-gi, python3-requests-unixsocket, python3-software-properties (= 0.99.9.8), python3:any
Breaks: python-software-properties (<< 0.85), python3-software-properties (<< 0.85)

Package: liberror-perl
Architecture: all
Version: 0.17029-1
APT-ID: 2280
Multi-Arch: foreign
Status: installed
Depends: perl:any

Package: libpng16-16
Architecture: amd64
Version: 1.6.37-2
APT-ID: 3339
Multi-Arch: same
Status: installed
Depends: libc6 (>= 2.29), zlib1g (>= 1:1.2.11)

s4 reads all 4 blocks (after compression) from this file.

• s4 /var/log/apt/eipp.log.xz  -s
WARNING: no syslines found "/var/log/apt/eipp.log.xz"

Files:

File: /var/log/apt/eipp.log.xz (XZ) MimeGuess(["application/x-xz"])
  Summary Printed:
      bytes          0
      lines          0
      syslines       0
      datetime first None Found
      datetime last  None Found
  Summary Processed:
      file size compressed   31592 (0x7B68) (bytes)
      file size uncompressed 201425 (0x312D1) (bytes)
      bytes          201425
      bytes total    201425
      block size     65535 (0xFFFF)
      blocks         4
      blocks total   4
      blocks high    4
      lines          2334
      lines high     2334
      syslines       0
      syslines high  0

Notice Summary Processed: blocks 4.

For plain log files, the BlockZero analysis would stop processing after the zeroth block (first block) did not have any apparent syslines, e.g. Summary Processed: blocks 1.

For very large files, this is a lot of overhead for naught, and may cause problems where computer memory is constrained.

~~Reading too many blocks increases likelihood of an errant match, e.g. a datetime string within some message that is mistakenly interpreted as a sysline.~~ (should be fixed; only zeroth block is analyzed for datetime substrings).

jtmoon79 · 2023-05-21T20:45:41Z

TODO Look into lzma-rs feature Expose a new raw_decoder API

jtmoon79 · 2024-03-24T03:36:39Z

Update: see Issue #283

A good solution for this Issue and Issue #13 would be having a "sequential read mode" for SyslogProcessor that is also handed down to SyslineReader, to LineReader, and to BlockReader.

In "sequential read mode" mode, there is no binary search for syslines, only reading the file from start to finish. This would allow "progressive" dropping of data at different points. The BlockReader would, during the search for datetime filter A, somehow know to drop Blocks from N - 2 ago... or something like that. Essentially, it's during the phase of finding the first syslog message acceptable to datetime filter A that Blocks would be dropped while searching (and Lines, Syslines).

This should be relatively clean to implement. There would be two paths for searching for the datetime filter A, binary and linear/sequential.

...

except for this one complicating detail from my comment above:

The problem is due to lzma-rs crate not providing the uncompressed file size. But uncompressed file size must be known before BlockReader::new returns.

I should just grab that raw data myself. It would simplify stuff.
From xz format definition 1.2.0

3.1.4. Uncompressed Size

The Uncompressed Size field contains the size of the Block after uncompressing.
...
It should be noted that the only reliable way to determine
the real uncompressed size is to uncompress the Block,
because the Block Header and Index fields may contain
(intentionally or unintentionally) invalid information.

Maybe just decompress the entire file once without saving it, to get the uncompressed size. Currently, the entire file is read once and saved during Blockreader::new.

This proposed implementation means the entire file is read twice, at most. However, the amount of runtime memory required would be a constant of the BlockSz, instead of at least the size of the uncompressed file. I think that's a smarter trade-off.

Also, I could delete one bullet point from the README.md

Entire .xz files are read into memory before printing (read .xz file by requested block #12)

jtmoon79 · 2024-05-05T06:12:53Z

Cannot read the xz file in chunks/blocks. The crate lzma-rs does not provide API xz_decompress_with_options. See gendx/lzma-rs#110

Consider https://docs.rs/xz2/latest/xz2/read/struct.XzDecoder.html

Attempt to parse more of the XZ header and block #0 header. Unfortunately, I couldn't figure get this working entirely. Leaving the code in place as it does function. The intent was to compensate for lzma-rs reading the entire file during xz_decompress. However, that's a larger problem, see gendx/lzma-rs#110 Issue #12 Issue 283

jtmoon79 · 2024-05-31T07:08:21Z

#283 refactors handling .xz. However the problem remains of reading the entire file during an open.

jtmoon79 added the enhancement New feature or request label Aug 8, 2022

jtmoon79 mentioned this issue Aug 8, 2022

read tar file by requested block #13

Open

jtmoon79 added a commit that referenced this issue Aug 8, 2022

blockreader.rs NFC comment Issue #12

bc41128

Issue #12

jtmoon79 changed the title ~~read xz file by block~~ read xz file by requested block Aug 9, 2022

jtmoon79 added a commit that referenced this issue Oct 20, 2022

rust.yml allow windows-latest, git config core.protectNTFS false

e82e902

Add windows-latest in os matrix. Workaround git-for-windows/git#2803 using `git config core.protectNTFS false` Issue #12

jtmoon79 added a commit that referenced this issue Oct 20, 2022

rust.yml allow windows-latest, git config core.protectNTFS false

e35faa1

Add windows-latest in os matrix. Workaround git-for-windows/git#2803 using `git config core.protectNTFS false` Issue #12

jtmoon79 added a commit that referenced this issue Oct 20, 2022

rust.yml allow windows-latest, git config core.protectNTFS false

2598719

Add windows-latest in os matrix. Workaround git-for-windows/git#2803 using `git config core.protectNTFS false` Issue #12

jtmoon79 added the difficult A difficult problem; a major coding effort or difficult algorithm to perfect label May 21, 2023

jtmoon79 mentioned this issue May 21, 2023

Support processing LZMA and LZMA2 format .lz files #128

Open

jtmoon79 changed the title ~~read xz file by requested block~~ read .xz file by requested block May 21, 2023

jtmoon79 mentioned this issue Aug 29, 2023

improve memory usage for archived or compressed files #182

Open

jtmoon79 added the file parser label Nov 5, 2023

jtmoon79 mentioned this issue Apr 20, 2024

refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode #283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read `.xz` file by requested block #12

read `.xz` file by requested block #12

jtmoon79 commented Aug 8, 2022 •

edited

Loading

jtmoon79 commented Aug 9, 2022

jtmoon79 commented Sep 20, 2022 •

edited

Loading

jtmoon79 commented Jan 30, 2023 •

edited

Loading

jtmoon79 commented May 21, 2023

jtmoon79 commented Mar 24, 2024 •

edited

Loading

jtmoon79 commented May 5, 2024 •

edited

Loading

jtmoon79 commented May 31, 2024 •

edited

Loading

read .xz file by requested block #12

read .xz file by requested block #12

Comments

jtmoon79 commented Aug 8, 2022 • edited Loading

Problem

Solution

jtmoon79 commented Aug 9, 2022

jtmoon79 commented Sep 20, 2022 • edited Loading

jtmoon79 commented Jan 30, 2023 • edited Loading

jtmoon79 commented May 21, 2023

jtmoon79 commented Mar 24, 2024 • edited Loading

jtmoon79 commented May 5, 2024 • edited Loading

jtmoon79 commented May 31, 2024 • edited Loading

read `.xz` file by requested block #12

read `.xz` file by requested block #12

jtmoon79 commented Aug 8, 2022 •

edited

Loading

jtmoon79 commented Sep 20, 2022 •

edited

Loading

jtmoon79 commented Jan 30, 2023 •

edited

Loading

jtmoon79 commented Mar 24, 2024 •

edited

Loading

jtmoon79 commented May 5, 2024 •

edited

Loading

jtmoon79 commented May 31, 2024 •

edited

Loading