You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Recently we added a libcudf example for processing nested data types. The deduplication example uses a command line interface to receive a filename, perform some relational algebra and output basic timing data to the console.
Let's add an example that performs parquet file transcoding. The example can read a Parquet file that contains a single column, and then write it using a specified encoding and compression, and then read the file again. Finally, the example can confirm the data is the same between the first and second reads.
Describe the solution you'd like
Here is an snippet showing how the example might be called: ./parquet_io ~/in.pq ~/out.pq DELTA_BYTE_ARRAY ZSTD
where the parameters represent input_filepath, output_filepath, column_encoding and compression_type. Valid encodings include DICTIONARY, PLAIN, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, plus (soon to be) BYTE_STREAM_SPLIT.
The example will print to console the time elapsed for (1) the initial read, (2) the write, and (3) the second read.
The example does not need to verify is the requested encoding is valid (e.g. string column with DELTA_BINARY_PACKED or int64 column with DELTA_BYTE_ARRAY). Let's only operate on the first column to keep things simple.
Describe alternatives you've considered
Use a cuDF-python example instead, but I'd rather have a C++ example for each feature that we write a blog about.
The text was updated successfully, but these errors were encountered:
This PR adds a new example `parquet_io` to `libcudf/cpp/examples` instrumenting reading and writing parquet files with different column encodings (same for all columns for now) and compressions to close#15344. The example maybe elaborated and/or evolved as needed. #15348 should be merged before this PR to get all CMake updates needed to successfully build and run this example.
Authors:
- Muhammad Haseeb (https://github.com/mhaseeb123)
Approvers:
- Vukasin Milovanovic (https://github.com/vuule)
- Ray Douglass (https://github.com/raydouglass)
URL: #15420
Is your feature request related to a problem? Please describe.
Recently we added a libcudf example for processing nested data types. The deduplication example uses a command line interface to receive a filename, perform some relational algebra and output basic timing data to the console.
Let's add an example that performs parquet file transcoding. The example can read a Parquet file that contains a single column, and then write it using a specified encoding and compression, and then read the file again. Finally, the example can confirm the data is the same between the first and second reads.
Describe the solution you'd like
Here is an snippet showing how the example might be called:
./parquet_io ~/in.pq ~/out.pq DELTA_BYTE_ARRAY ZSTD
where the parameters represent input_filepath, output_filepath, column_encoding and compression_type. Valid encodings include DICTIONARY, PLAIN, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, plus (soon to be) BYTE_STREAM_SPLIT.
The example will print to console the time elapsed for (1) the initial read, (2) the write, and (3) the second read.
The example does not need to verify is the requested encoding is valid (e.g. string column with DELTA_BINARY_PACKED or int64 column with DELTA_BYTE_ARRAY). Let's only operate on the first column to keep things simple.
Describe alternatives you've considered
Use a cuDF-python example instead, but I'd rather have a C++ example for each feature that we write a blog about.
The text was updated successfully, but these errors were encountered: