-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] new parquet writer code checks for nullable
not has_nulls
#7654
Comments
This fixes the java build, but it required breaking changes to do it. I'll put up a corresponding change in the rapids plugin shortly. #7654 was found as a part of this. This is also not the final API that we will want. We need to redo how we configure the builders so that they can take advantage of the new APIs properly. Authors: - Robert (Bobby) Evans (@revans2) Approvers: - Jason Lowe (@jlowe) - Raza Jafri (@razajafri) URL: #7655
This issue has been labeled |
This is still a valid issue |
@vuule is this on your radar? |
CC @devavret |
Do you mean this: cudf/cpp/src/io/parquet/writer_impl.cu Lines 388 to 406 in 3c050bb
If so, I'm not sure this is really a bug. We have to check whether there can be nulls otherwise we'd break downstream in the reader. So it either has to be |
Sorry about the delay in the response. I understand why nullable was chosen, but the issue I run into is that it is very easy to have something that is nullable, but has no nulls. This happens quite often in Spark. We can analyze the query we are able to tell that something will never have a null in it, but the last cudf call was not able to make the same decision and ended up including a validity column. For example we could filter out all of the nulls from a column. In spark we know that all of the nulls are gone, but cudf just called |
IMO we can think of this as a perf vs. memory trade off. For ORC/Parquet ATM we generally lean on the side of reducing the memory use, so it would make sense to have a slight perf overhead to avoid allocating the null masks. |
@revans2 Then how about a libcudf API (maybe called @vuule This doesn't need extra memory because this is the writer. It might create a slightly larger file. But if there are no nulls then all the definition values will be 1 and running it through RLE will all but remove it. Think 8 bytes per 1 MB page. |
I think a |
This issue has been labeled |
This issue is still relevant, ran into it again recently. Seems related to #13010. Note that now that null counts are required to be computed on column construction per the recent null_count changes, there's no longer a potential kernel launch cost to checking has_nulls. Seems like that would be the preferable choice, allow columns with no nulls (validity buffer or not) to meet the criteria of a schema calling for no nulls. |
The Java cudf API currently does not provide a way to zero-copy slice columns (it has an interface to slice, but it copies the result to separate columns). Therefore we should be fine with that limitation. |
… as non-nullable (#13675) Issue #7654, #13010 Writers have a strict check for nullability when applying the user metadata's nullability options, because checking the actual number of nulls is (was) not cheap. Since we now create all columns with a known number of nulls, the `null_count` check became cheap and we have no reason to prevent columns without nulls to be written as non-nullable. This PR changes the condition to allow this case. The PR does not address the issue with sliced columns, where it's not possible to write sliced column as non-nullable, even if the slice has no nulls. That check is still not cheap :) Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) URL: #13675
Describe the bug
New parquet code was added to support writing nested types. This is great, but it broke the java build. As a part of fixing the java build I found that the new code checks for
nullable
on all of the columns to see if it matches what was set when the writer was initially configured. But Spark can tell that validity is not needed in some cases where cudf apparently cannot, and cudf will add in a validity column in some cases when it is not needed. becausenullable
only checks to see if there is a column, and not if there are actually any nulls we can run into a situation where spark tells us that there will be no nulls, but cudf blows up because it thinks that there might be.The text was updated successfully, but these errors were encountered: