Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] new parquet writer code checks for nullable not has_nulls #7654

Closed
revans2 opened this issue Mar 19, 2021 · 13 comments
Closed

[BUG] new parquet writer code checks for nullable not has_nulls #7654

revans2 opened this issue Mar 19, 2021 · 13 comments
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@revans2
Copy link
Contributor

revans2 commented Mar 19, 2021

Describe the bug
New parquet code was added to support writing nested types. This is great, but it broke the java build. As a part of fixing the java build I found that the new code checks for nullable on all of the columns to see if it matches what was set when the writer was initially configured. But Spark can tell that validity is not needed in some cases where cudf apparently cannot, and cudf will add in a validity column in some cases when it is not needed. because nullable only checks to see if there is a column, and not if there are actually any nulls we can run into a situation where spark tells us that there will be no nulls, but cudf blows up because it thinks that there might be.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Mar 19, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 19, 2021
This fixes the java build, but it required breaking changes to do it. I'll put up a corresponding change in the rapids plugin shortly.

#7654

was found as a part of this.

This is also not the final API that we will want. We need to redo how we configure the builders so that they can take advantage of the new APIs properly.

Authors:
  - Robert (Bobby) Evans (@revans2)

Approvers:
  - Jason Lowe (@jlowe)
  - Raza Jafri (@razajafri)

URL: #7655
@kkraus14 kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 26, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@razajafri
Copy link
Contributor

This is still a valid issue

@harrism
Copy link
Member

harrism commented Apr 26, 2021

@vuule is this on your radar?

@vuule
Copy link
Contributor

vuule commented Apr 26, 2021

CC @devavret

@devavret
Copy link
Contributor

Do you mean this:

bool col_nullable = [&]() {
if (single_write_mode) {
return col->nullable();
} else {
if (col_meta.is_nullability_defined()) {
if (col_meta.nullable() == false) {
CUDF_EXPECTS(
col->nullable() == false,
"Mismatch in metadata prescribed nullability and input column nullability. "
"Metadata for nullable input column cannot prescribe nullability = false");
}
return col_meta.nullable();
} else {
// For chunked write, when not provided nullability, we assume the worst case scenario
// that all columns are nullable.
return true;
}
}
}();

If so, I'm not sure this is really a bug. We have to check whether there can be nulls otherwise we'd break downstream in the reader. So it either has to be has_nulls() or nullable(). The former incurs a kernel call for each column. That's why the latter was chosen. With the understanding, that if the user knew there wouldn't be nulls, they'd drop the mask before passing it in.

@revans2
Copy link
Contributor Author

revans2 commented May 25, 2021

Sorry about the delay in the response. I understand why nullable was chosen, but the issue I run into is that it is very easy to have something that is nullable, but has no nulls. This happens quite often in Spark. We can analyze the query we are able to tell that something will never have a null in it, but the last cudf call was not able to make the same decision and ended up including a validity column. For example we could filter out all of the nulls from a column. In spark we know that all of the nulls are gone, but cudf just called filter on it. It has no knowledge that the boolean mask is correlated to the validity column in any way. I just knows that the input has nulls so there is the possibility that the output will have nulls, and it needs to allocate a validity mask. If you don't do this then I have to figure out a way to walk through all of the columns/child columns and strip out the validity mask if it exists for things that I know don't need it, which then I would also want to run a kernel on it to be sure that there was no bug in the code.

@vuule
Copy link
Contributor

vuule commented May 25, 2021

IMO we can think of this as a perf vs. memory trade off. For ORC/Parquet ATM we generally lean on the side of reducing the memory use, so it would make sense to have a slight perf overhead to avoid allocating the null masks.
I've just realized that the memory use impact would be ~1%, so I'm not sure if my reasoning stands. Still, for your consideration :)
We could also make it a writer option, but that seems like an overkill to me.

@devavret
Copy link
Contributor

@revans2 Then how about a libcudf API (maybe called prune_masks) that takes a table and removes null masks from all columns that have no nulls. This API could be optimized to call a single kernel for the table rather than separate kernels for each column. Filter your tables through it before passing to anything. It will be a one stop optimized solution for all cuIO writers and more (like calling joins on it etc.)

@vuule This doesn't need extra memory because this is the writer. It might create a slightly larger file. But if there are no nulls then all the definition values will be 1 and running it through RLE will all but remove it. Think 8 bytes per 1 MB page.

@revans2
Copy link
Contributor Author

revans2 commented May 26, 2021

I think a prune_masks would be fine. To be clear my concern is not about the size of the file or how much memory/time it takes to compute. It is about trying to match the same output as Spark does. This is not a high priority right now because if the data is read back in by Spark it will still treat it as nullable, even if all of the input files say that the column is not nullable. I am less sure about other tools, so I want to match it as closely as possible.

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@jlowe
Copy link
Contributor

jlowe commented Jul 5, 2023

This issue is still relevant, ran into it again recently. Seems related to #13010.

Note that now that null counts are required to be computed on column construction per the recent null_count changes, there's no longer a potential kernel launch cost to checking has_nulls. Seems like that would be the preferable choice, allow columns with no nulls (validity buffer or not) to meet the criteria of a schema calling for no nulls.

@vuule
Copy link
Contributor

vuule commented Jul 7, 2023

Opened #13675 to fix this limitation.
@jlowe The improvement does not cover sliced columns, is that needed for your use case?

@jlowe
Copy link
Contributor

jlowe commented Jul 10, 2023

The improvement does not cover sliced columns, is that needed for your use case?

The Java cudf API currently does not provide a way to zero-copy slice columns (it has an interface to slice, but it copies the result to separate columns). Therefore we should be fine with that limitation.

rapids-bot bot pushed a commit that referenced this issue Jul 12, 2023
… as non-nullable (#13675)

Issue #7654, #13010

Writers have a strict check for nullability when applying the user metadata's nullability options, because checking the actual number of nulls is (was) not cheap.
Since we now create all columns with a known number of nulls, the `null_count` check became cheap and we have no reason to prevent columns without nulls to be written as non-nullable.
This PR changes the condition to allow this case. 
The PR does not address the issue with sliced columns, where it's not possible to write sliced column as non-nullable, even if the slice has no nulls. That check is still not cheap :)

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13675
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

8 participants