Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add Parquet transcoding to cpp/examples #15344

Closed
GregoryKimball opened this issue Mar 19, 2024 · 1 comment · Fixed by #15420
Closed

[FEA] Add Parquet transcoding to cpp/examples #15344

GregoryKimball opened this issue Mar 19, 2024 · 1 comment · Fixed by #15420
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Mar 19, 2024

Is your feature request related to a problem? Please describe.
Recently we added a libcudf example for processing nested data types. The deduplication example uses a command line interface to receive a filename, perform some relational algebra and output basic timing data to the console.

Let's add an example that performs parquet file transcoding. The example can read a Parquet file that contains a single column, and then write it using a specified encoding and compression, and then read the file again. Finally, the example can confirm the data is the same between the first and second reads.

Describe the solution you'd like

Here is an snippet showing how the example might be called:
./parquet_io ~/in.pq ~/out.pq DELTA_BYTE_ARRAY ZSTD
where the parameters represent input_filepath, output_filepath, column_encoding and compression_type. Valid encodings include DICTIONARY, PLAIN, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, plus (soon to be) BYTE_STREAM_SPLIT.

The example will print to console the time elapsed for (1) the initial read, (2) the write, and (3) the second read.

The example does not need to verify is the requested encoding is valid (e.g. string column with DELTA_BINARY_PACKED or int64 column with DELTA_BYTE_ARRAY). Let's only operate on the first column to keep things simple.

Describe alternatives you've considered
Use a cuDF-python example instead, but I'd rather have a C++ example for each feature that we write a blog about.

@GregoryKimball GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 19, 2024
@etseidl
Copy link
Contributor

etseidl commented Mar 20, 2024

Currently if an invalid encoding is requested the parquet writer will print a warning and fall back to the default (dictionary in most cases).

rapids-bot bot pushed a commit that referenced this issue May 13, 2024
This PR adds a new example `parquet_io` to `libcudf/cpp/examples` instrumenting reading and writing parquet files with different column encodings (same for all columns for now) and compressions to close #15344. The example maybe elaborated and/or evolved as needed. #15348 should be merged before this PR to get all CMake updates needed to successfully build and run this example.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Ray Douglass (https://github.com/raydouglass)

URL: #15420
@GregoryKimball GregoryKimball removed this from libcudf Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants