Skip to content

Commit

Permalink
Verify compression type in Parquet reader (#10610)
Browse files Browse the repository at this point in the history
Closes #10602

This PR adds a compression type check for each chunk in the input file. 
Reader throws in an unsupported compression is used.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - https://github.com/brandon-b-miller
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #10610
  • Loading branch information
vuule authored Apr 7, 2022
1 parent fb03c8b commit 018924f
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 0 deletions.
13 changes: 13 additions & 0 deletions cpp/src/io/parquet/reader_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -1179,6 +1179,19 @@ rmm::device_buffer reader::impl::decompress_page_data(
codec_stats{parquet::SNAPPY, 0, 0},
codec_stats{parquet::BROTLI, 0, 0}};

auto is_codec_supported = [&codecs](int8_t codec) {
if (codec == parquet::UNCOMPRESSED) return true;
return std::find_if(codecs.begin(), codecs.end(), [codec](auto& cstats) {
return codec == cstats.compression_type;
}) != codecs.end();
};
CUDF_EXPECTS(std::all_of(chunks.begin(),
chunks.end(),
[&is_codec_supported](auto const& chunk) {
return is_codec_supported(chunk.codec);
}),
"Unsupported compression type");

for (auto& codec : codecs) {
for_each_codec_page(codec.compression_type, [&](size_t page) {
auto page_uncomp_size = pages[page].uncompressed_page_size;
Expand Down
Binary file not shown.
7 changes: 7 additions & 0 deletions python/cudf/cudf/tests/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -2420,3 +2420,10 @@ def test_parquet_reader_decimal_columns():
expected = pd.read_parquet(buffer, columns=["col3", "col2", "col1"])

assert_eq(actual, expected)


def test_parquet_reader_unsupported_compression(datadir):
fname = datadir / "spark_zstd.parquet"

with pytest.raises(RuntimeError):
cudf.read_parquet(fname)

0 comments on commit 018924f

Please sign in to comment.