[FEA] Ability to control field ID used for columns during Parquet write #10376

jlowe · 2022-03-01T16:04:11Z

Is your feature request related to a problem? Please describe.
Parquet columns include a field_id identifier, and Spark 3.3 recently added improved support for this feature. See SPARK-38094 for details. In order to support this Spark feature in the RAPIDS Accelerator, we need the ability to specify the ID to use for a column's field_id when the Parquet file is written and these field IDs have been specified in the schema to write.

Describe the solution you'd like
column_in_metadata could have two additional fields and a new set method, e.g.:

bool _has_parquet_field_id = false;
int32_t _parquet_field_id;

column_in_metadata& set_parquet_field_id(int32_t field_id)
{
  _has_parquet_field_id = true;
  _parquet_field_id = field_id;
  return *this;
}

During the Parquet write, if a column_in_metadata indicates it has a field ID setting then the specified field ID is used when encoding the column metadata in the Parquet footer, otherwise a field ID is left unspecified.

Additional context
See the field_id in the Parquet Schema Definition

The text was updated successfully, but these errors were encountered:

Closes #10375 Closes #10376 This PR enables column `field_id` control in the parquet writer. When writing a parquet file, users can specify a column's `field_id` via `column_in_metadata.set_parquet_field_id()`. JNI bindings and uni tests are added as well. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Jason Lowe (https://github.com/jlowe) - Vukasin Milovanovic (https://github.com/vuule) - Devavret Makkar (https://github.com/devavret) URL: #10504

jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Mar 1, 2022

jlowe mentioned this issue Mar 1, 2022

[FEA] Support field id meta in Parquet writing #10375

Closed

devavret self-assigned this Mar 14, 2022

GregoryKimball assigned PointKernel and GregoryKimball and unassigned devavret, PointKernel and GregoryKimball Mar 23, 2022

PointKernel mentioned this issue Mar 24, 2022

Add column field ID control in parquet writer #10504

Merged

rapids-bot bot closed this as completed in #10504 Apr 15, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Ability to control field ID used for columns during Parquet write #10376

[FEA] Ability to control field ID used for columns during Parquet write #10376

jlowe commented Mar 1, 2022

[FEA] Ability to control field ID used for columns during Parquet write #10376

[FEA] Ability to control field ID used for columns during Parquet write #10376

Comments

jlowe commented Mar 1, 2022