Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parquet writer unable to set nullability to false on columns with no nulls #13010

Closed
vuule opened this issue Mar 25, 2023 · 3 comments
Closed
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@vuule
Copy link
Contributor

vuule commented Mar 25, 2023

Parquet writer seems too stringent with its checks to write columns as not nullable.
The following cases fail to write when set_nullability(false) is called on the input metadata:

  • Column with no nulls;
  • Sliced column that only has nulls outside of the slice.

This behaviors limits the use of the nullable option.

Repro code:

  column_wrapper<int32_t> ages_col{{1, 2, 3}};
  auto valids = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i > 1; });
  cudf::test::structs_column_wrapper col5{{ages_col}, valids};

  table_view expected({col5});

  cudf::io::table_input_metadata expected_metadata(expected);
  expected_metadata.column_metadata[0].set_nullability(false);

  auto expected_slice = cudf::slice(expected, {1, 3});

  auto filepath = temp_env->get_temp_filepath("SlicedTable.parquet");
  cudf::io::parquet_writer_options out_opts =
    cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected_slice)
      .metadata(&expected_metadata);
  cudf::io::write_parquet(out_opts);

To test the case where the column has no nulls, modify valids to always return true.

@vuule vuule added bug Something isn't working Needs Triage Need team to review and classify and removed Needs Triage Need team to review and classify labels Mar 25, 2023
@vuule vuule self-assigned this Mar 25, 2023
@vuule
Copy link
Contributor Author

vuule commented Mar 25, 2023

Issue partially blocks a Python feature #12952 (part of the PR got reverted in #12996)

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. cuIO cuIO issue labels Apr 2, 2023
rapids-bot bot pushed a commit that referenced this issue Jul 12, 2023
… as non-nullable (#13675)

Issue #7654, #13010

Writers have a strict check for nullability when applying the user metadata's nullability options, because checking the actual number of nulls is (was) not cheap.
Since we now create all columns with a known number of nulls, the `null_count` check became cheap and we have no reason to prevent columns without nulls to be written as non-nullable.
This PR changes the condition to allow this case. 
The PR does not address the issue with sliced columns, where it's not possible to write sliced column as non-nullable, even if the slice has no nulls. That check is still not cheap :)

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13675
@GregoryKimball
Copy link
Contributor

I believe this is closed by #13675

@vuule
Copy link
Contributor Author

vuule commented Feb 15, 2024

@GregoryKimball #13675 does not address the sliced column case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

2 participants