[FEA] Add Parquet transcoding to `cpp/examples` #15344

GregoryKimball · 2024-03-19T23:21:19Z

Is your feature request related to a problem? Please describe.
Recently we added a libcudf example for processing nested data types. The deduplication example uses a command line interface to receive a filename, perform some relational algebra and output basic timing data to the console.

Let's add an example that performs parquet file transcoding. The example can read a Parquet file that contains a single column, and then write it using a specified encoding and compression, and then read the file again. Finally, the example can confirm the data is the same between the first and second reads.

Describe the solution you'd like

Here is an snippet showing how the example might be called:
./parquet_io ~/in.pq ~/out.pq DELTA_BYTE_ARRAY ZSTD
where the parameters represent input_filepath, output_filepath, column_encoding and compression_type. Valid encodings include DICTIONARY, PLAIN, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, plus (soon to be) BYTE_STREAM_SPLIT.

The example will print to console the time elapsed for (1) the initial read, (2) the write, and (3) the second read.

The example does not need to verify is the requested encoding is valid (e.g. string column with DELTA_BINARY_PACKED or int64 column with DELTA_BYTE_ARRAY). Let's only operate on the first column to keep things simple.

Describe alternatives you've considered
Use a cuDF-python example instead, but I'd rather have a C++ example for each feature that we write a blog about.

The text was updated successfully, but these errors were encountered:

etseidl · 2024-03-20T05:40:51Z

Currently if an invalid encoding is requested the parquet writer will print a warning and fall back to the default (dictionary in most cases).

This PR adds a new example `parquet_io` to `libcudf/cpp/examples` instrumenting reading and writing parquet files with different column encodings (same for all columns for now) and compressions to close #15344. The example maybe elaborated and/or evolved as needed. #15348 should be merged before this PR to get all CMake updates needed to successfully build and run this example. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Ray Douglass (https://github.com/raydouglass) URL: #15420

GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 19, 2024

GregoryKimball added this to the Helps libcudf C++ integrations milestone Mar 19, 2024

GregoryKimball assigned mhaseeb123 Mar 19, 2024

GregoryKimball added this to libcudf Mar 19, 2024

mhaseeb123 mentioned this issue Apr 1, 2024

Adding parquet transcoding example #15420

Merged

3 tasks

rapids-bot bot closed this as completed in #15420 May 13, 2024

GregoryKimball removed this from libcudf Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add Parquet transcoding to `cpp/examples` #15344

[FEA] Add Parquet transcoding to `cpp/examples` #15344

GregoryKimball commented Mar 19, 2024 •

edited

Loading

etseidl commented Mar 20, 2024

[FEA] Add Parquet transcoding to cpp/examples #15344

[FEA] Add Parquet transcoding to cpp/examples #15344

Comments

GregoryKimball commented Mar 19, 2024 • edited Loading

etseidl commented Mar 20, 2024

[FEA] Add Parquet transcoding to `cpp/examples` #15344

[FEA] Add Parquet transcoding to `cpp/examples` #15344

GregoryKimball commented Mar 19, 2024 •

edited

Loading