Do not try to uncompress pages that are not compressed #1

papanikge · 2024-03-27T12:57:03Z

Panther is running fraugster/parguet-go in production for some months now ingesting TBs of data.

Some customers reported that they got the following error:

snappy: corrupt input

After some investigation we can see that in the read function of the the V2 DataPage, the flag (already present in DataPageHeaderV2) was not checked.

More context: Parquet files - when compressed - are so in the page layer. Parquet supports compression per page, (as shown from the DataPageHeaderV2 IsCompressed field, which comes directly from the thrift definition). The library detects the compression type (called CompressionCodec) and passes that down to the newBlockReader level. However it still needs to check if that specific page is indeed compressed, and that was missing.

FWIW, I doubled check this with parquet-go/parquet-goparquet-go/parquet-go and confirmed that they don't try to decompress that.

Unit tests added
Full test run (and screenshot)
Run all unit tests with the race detector on
Run the linters locally via golangci-lint run

Ran all the tests with https://github.com/apache/parquet-testing

[Note: I can try adding a file into https://github.com/apache/parquet-testing before trying merging this into upstream]

Ran all unit tests with the race detector on

Added a unit tests
... that passes with false, but breaks if I do IsCompressed => true because the input is not snappy

kouknick

LGTM! That must have been really hard to debug 🐙
Great job!

kouknick · 2024-03-27T14:29:11Z

page_v2.go

@@ -122,6 +122,10 @@ func (dp *dataPageReaderV2) read(r io.Reader, ph *parquet.PageHeader, codec parq
 		}
 	}

+	if !ph.DataPageHeaderV2.IsCompressed {


Wow. I guess we can create an issue in github and ask them why don't they do this. Or even better create a PR to handle this

Yup, that's the next step ;-)
I'll import this into our own prod first though, to battle test it.

papanikge · 2024-03-27T14:50:55Z

Tagged with v0.12.0-panther1

papanikge self-assigned this Mar 27, 2024

Do not try to uncompress pages that are not compressed

e7e10ff

papanikge force-pushed the pap-uncompressed-data-page-v2 branch from 1ba8c9a to e7e10ff Compare March 27, 2024 13:56

papanikge requested a review from a team March 27, 2024 13:57

kouknick approved these changes Mar 27, 2024

View reviewed changes

papanikge merged commit 0cd0b3d into master Mar 27, 2024

papanikge deleted the pap-uncompressed-data-page-v2 branch March 27, 2024 14:50

This was referenced Apr 9, 2024

Library tries to uncompress pages even if declared uncompressed fraugster/parquet-go#102

Open

Do not try to uncompress pages that are not compressed fraugster/parquet-go#103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not try to uncompress pages that are not compressed #1

Do not try to uncompress pages that are not compressed #1

papanikge commented Mar 27, 2024 •

edited

Loading

kouknick left a comment

kouknick Mar 27, 2024

papanikge Mar 27, 2024

papanikge commented Mar 27, 2024

Do not try to uncompress pages that are not compressed #1

Do not try to uncompress pages that are not compressed #1

Conversation

papanikge commented Mar 27, 2024 • edited Loading

kouknick left a comment

Choose a reason for hiding this comment

kouknick Mar 27, 2024

Choose a reason for hiding this comment

papanikge Mar 27, 2024

Choose a reason for hiding this comment

papanikge commented Mar 27, 2024

papanikge commented Mar 27, 2024 •

edited

Loading