Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Ability to control field ID used for columns during Parquet write #10376

Closed
jlowe opened this issue Mar 1, 2022 · 0 comments · Fixed by #10504
Closed

[FEA] Ability to control field ID used for columns during Parquet write #10376

jlowe opened this issue Mar 1, 2022 · 0 comments · Fixed by #10504
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Mar 1, 2022

Is your feature request related to a problem? Please describe.
Parquet columns include a field_id identifier, and Spark 3.3 recently added improved support for this feature. See SPARK-38094 for details. In order to support this Spark feature in the RAPIDS Accelerator, we need the ability to specify the ID to use for a column's field_id when the Parquet file is written and these field IDs have been specified in the schema to write.

Describe the solution you'd like
column_in_metadata could have two additional fields and a new set method, e.g.:

bool _has_parquet_field_id = false;
int32_t _parquet_field_id;

column_in_metadata& set_parquet_field_id(int32_t field_id)
{
  _has_parquet_field_id = true;
  _parquet_field_id = field_id;
  return *this;
}

During the Parquet write, if a column_in_metadata indicates it has a field ID setting then the specified field ID is used when encoding the column metadata in the Parquet footer, otherwise a field ID is left unspecified.

Additional context
See the field_id in the Parquet Schema Definition

@jlowe jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Mar 1, 2022
@devavret devavret self-assigned this Mar 14, 2022
rapids-bot bot pushed a commit that referenced this issue Apr 15, 2022
Closes #10375
Closes #10376

This PR enables column `field_id` control in the parquet writer. When writing a parquet file, users can specify a column's `field_id` via `column_in_metadata.set_parquet_field_id()`. JNI bindings and uni tests are added as well.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Devavret Makkar (https://github.com/devavret)

URL: #10504
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants