Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not try to uncompress pages that are not compressed #1

Merged
merged 1 commit into from
Mar 27, 2024

Conversation

papanikge
Copy link

@papanikge papanikge commented Mar 27, 2024

Panther is running fraugster/parguet-go in production for some months now ingesting TBs of data.

Some customers reported that they got the following error:

snappy: corrupt input

After some investigation we can see that in the read function of the the V2 DataPage, the flag (already present in DataPageHeaderV2) was not checked.

More context: Parquet files - when compressed - are so in the page layer. Parquet supports compression per page, (as shown from the DataPageHeaderV2 IsCompressed field, which comes directly from the thrift definition). The library detects the compression type (called CompressionCodec) and passes that down to the newBlockReader level. However it still needs to check if that specific page is indeed compressed, and that was missing.

FWIW, I doubled check this with parquet-go/parquet-goparquet-go/parquet-go and confirmed that they don't try to decompress that.

  • Unit tests added
  • Full test run (and screenshot)
  • Run all unit tests with the race detector on
  • Run the linters locally via golangci-lint run

Ran all the tests with https://github.com/apache/parquet-testing

image

[Note: I can try adding a file into https://github.com/apache/parquet-testing before trying merging this into upstream]

Ran all unit tests with the race detector on

image

Added a unit tests
... that passes with false, but breaks if I do IsCompressed => true because the input is not snappy

@papanikge papanikge self-assigned this Mar 27, 2024
@papanikge papanikge force-pushed the pap-uncompressed-data-page-v2 branch from 1ba8c9a to e7e10ff Compare March 27, 2024 13:56
@papanikge papanikge requested a review from a team March 27, 2024 13:57
Copy link

@kouknick kouknick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! That must have been really hard to debug 🐙
Great job!

@@ -122,6 +122,10 @@ func (dp *dataPageReaderV2) read(r io.Reader, ph *parquet.PageHeader, codec parq
}
}

if !ph.DataPageHeaderV2.IsCompressed {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. I guess we can create an issue in github and ask them why don't they do this. Or even better create a PR to handle this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's the next step ;-)
I'll import this into our own prod first though, to battle test it.

@papanikge papanikge merged commit 0cd0b3d into master Mar 27, 2024
@papanikge papanikge deleted the pap-uncompressed-data-page-v2 branch March 27, 2024 14:50
@papanikge
Copy link
Author

Tagged with v0.12.0-panther1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants