[BUG] new parquet writer code checks for `nullable` not `has_nulls` #7654

revans2 · 2021-03-19T17:18:52Z

Describe the bug
New parquet code was added to support writing nested types. This is great, but it broke the java build. As a part of fixing the java build I found that the new code checks for nullable on all of the columns to see if it matches what was set when the writer was initially configured. But Spark can tell that validity is not needed in some cases where cudf apparently cannot, and cudf will add in a validity column in some cases when it is not needed. because nullable only checks to see if there is a column, and not if there are actually any nulls we can run into a situation where spark tells us that there will be no nulls, but cudf blows up because it thinks that there might be.

The text was updated successfully, but these errors were encountered:

@revans2

This fixes the java build, but it required breaking changes to do it. I'll put up a corresponding change in the rapids plugin shortly. #7654 was found as a part of this. This is also not the final API that we will want. We need to redo how we configure the builders so that they can take advantage of the new APIs properly. Authors: - Robert (Bobby) Evans (@revans2) Approvers: - Jason Lowe (@jlowe) - Raza Jafri (@razajafri) URL: #7655

github-actions · 2021-04-25T21:04:54Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

razajafri · 2021-04-26T21:42:53Z

This is still a valid issue

harrism · 2021-04-26T23:03:49Z

@vuule is this on your radar?

vuule · 2021-04-26T23:05:35Z

CC @devavret

devavret · 2021-04-27T04:56:45Z

Do you mean this:

cudf/cpp/src/io/parquet/writer_impl.cu

Lines 388 to 406 in 3c050bb

    
           bool col_nullable = [&]() { 
        
             if (single_write_mode) { 
        
               return col->nullable(); 
        
             } else { 
        
               if (col_meta.is_nullability_defined()) { 
        
                 if (col_meta.nullable() == false) { 
        
                   CUDF_EXPECTS( 
        
                     col->nullable() == false, 
        
                     "Mismatch in metadata prescribed nullability and input column nullability. " 
        
                     "Metadata for nullable input column cannot prescribe nullability = false"); 
        
                 } 
        
                 return col_meta.nullable(); 
        
               } else { 
        
                 // For chunked write, when not provided nullability, we assume the worst case scenario 
        
                 // that all columns are nullable. 
        
                 return true; 
        
               } 
        
             } 
        
           }();

If so, I'm not sure this is really a bug. We have to check whether there can be nulls otherwise we'd break downstream in the reader. So it either has to be has_nulls() or nullable(). The former incurs a kernel call for each column. That's why the latter was chosen. With the understanding, that if the user knew there wouldn't be nulls, they'd drop the mask before passing it in.

revans2 · 2021-05-25T21:55:11Z

Sorry about the delay in the response. I understand why nullable was chosen, but the issue I run into is that it is very easy to have something that is nullable, but has no nulls. This happens quite often in Spark. We can analyze the query we are able to tell that something will never have a null in it, but the last cudf call was not able to make the same decision and ended up including a validity column. For example we could filter out all of the nulls from a column. In spark we know that all of the nulls are gone, but cudf just called filter on it. It has no knowledge that the boolean mask is correlated to the validity column in any way. I just knows that the input has nulls so there is the possibility that the output will have nulls, and it needs to allocate a validity mask. If you don't do this then I have to figure out a way to walk through all of the columns/child columns and strip out the validity mask if it exists for things that I know don't need it, which then I would also want to run a kernel on it to be sure that there was no bug in the code.

vuule · 2021-05-25T22:15:24Z

IMO we can think of this as a perf vs. memory trade off. For ORC/Parquet ATM we generally lean on the side of reducing the memory use, so it would make sense to have a slight perf overhead to avoid allocating the null masks.
I've just realized that the memory use impact would be ~1%, so I'm not sure if my reasoning stands. Still, for your consideration :)
We could also make it a writer option, but that seems like an overkill to me.

devavret · 2021-05-26T10:43:14Z

@revans2 Then how about a libcudf API (maybe called prune_masks) that takes a table and removes null masks from all columns that have no nulls. This API could be optimized to call a single kernel for the table rather than separate kernels for each column. Filter your tables through it before passing to anything. It will be a one stop optimized solution for all cuIO writers and more (like calling joins on it etc.)

@vuule This doesn't need extra memory because this is the writer. It might create a slightly larger file. But if there are no nulls then all the definition values will be 1 and running it through RLE will all but remove it. Think 8 bytes per 1 MB page.

revans2 · 2021-05-26T10:50:15Z

I think a prune_masks would be fine. To be clear my concern is not about the size of the file or how much memory/time it takes to compute. It is about trying to match the same output as Spark does. This is not a high priority right now because if the data is read back in by Spark it will still treat it as nullable, even if all of the input files say that the column is not nullable. I am less sure about other tools, so I want to match it as closely as possible.

github-actions · 2022-02-07T21:05:17Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

jlowe · 2023-07-05T16:05:16Z

This issue is still relevant, ran into it again recently. Seems related to #13010.

Note that now that null counts are required to be computed on column construction per the recent null_count changes, there's no longer a potential kernel launch cost to checking has_nulls. Seems like that would be the preferable choice, allow columns with no nulls (validity buffer or not) to meet the criteria of a schema calling for no nulls.

vuule · 2023-07-07T22:04:39Z

Opened #13675 to fix this limitation.
@jlowe The improvement does not cover sliced columns, is that needed for your use case?

jlowe · 2023-07-10T13:31:02Z

The improvement does not cover sliced columns, is that needed for your use case?

The Java cudf API currently does not provide a way to zero-copy slice columns (it has an interface to slice, but it copies the result to separate columns). Therefore we should be fine with that limitation.

… as non-nullable (#13675) Issue #7654, #13010 Writers have a strict check for nullability when applying the user metadata's nullability options, because checking the actual number of nulls is (was) not cheap. Since we now create all columns with a known number of nulls, the `null_count` check became cheap and we have no reason to prevent columns without nulls to be written as non-nullable. This PR changes the condition to allow this case. The PR does not address the issue with sliced columns, where it's not possible to write sliced column as non-nullable, even if the slice has no nulls. That check is still not cheap :) Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) URL: #13675

revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Mar 19, 2021

revans2 mentioned this issue Mar 19, 2021

Fix Java Parquet write after writer API changes [skip ci] #7655

Merged

kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 26, 2021

github-actions bot added the inactive-30d label Apr 25, 2021

github-actions bot removed the inactive-30d label Apr 26, 2021

devavret mentioned this issue May 26, 2021

[FEA] API to prune null masks from all valid columns #8365

Closed

revans2 mentioned this issue Aug 17, 2021

Parquet support for MapType NVIDIA/spark-rapids#3234

Merged

firestarman mentioned this issue Oct 13, 2021

Support nested types in ORC writer NVIDIA/spark-rapids#3696

Merged

github-actions bot added the inactive-90d label Feb 7, 2022

vuule self-assigned this Apr 5, 2023

GregoryKimball removed the inactive-90d label Apr 6, 2023

jlowe mentioned this issue Jul 6, 2023

[BUG] Parquet load failure on repeated_no_annotation.parquet NVIDIA/spark-rapids#8631

Closed

vuule mentioned this issue Jul 7, 2023

Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable #13675

Merged

3 tasks

GregoryKimball closed this as completed Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] new parquet writer code checks for `nullable` not `has_nulls` #7654

[BUG] new parquet writer code checks for `nullable` not `has_nulls` #7654

revans2 commented Mar 19, 2021

github-actions bot commented Apr 25, 2021

razajafri commented Apr 26, 2021

harrism commented Apr 26, 2021

vuule commented Apr 26, 2021

devavret commented Apr 27, 2021

revans2 commented May 25, 2021

vuule commented May 25, 2021

devavret commented May 26, 2021

revans2 commented May 26, 2021

github-actions bot commented Feb 7, 2022

jlowe commented Jul 5, 2023

vuule commented Jul 7, 2023

jlowe commented Jul 10, 2023

[BUG] new parquet writer code checks for nullable not has_nulls #7654

[BUG] new parquet writer code checks for nullable not has_nulls #7654

Comments

revans2 commented Mar 19, 2021

github-actions bot commented Apr 25, 2021

razajafri commented Apr 26, 2021

harrism commented Apr 26, 2021

vuule commented Apr 26, 2021

devavret commented Apr 27, 2021

revans2 commented May 25, 2021

vuule commented May 25, 2021

devavret commented May 26, 2021

revans2 commented May 26, 2021

github-actions bot commented Feb 7, 2022

jlowe commented Jul 5, 2023

vuule commented Jul 7, 2023

jlowe commented Jul 10, 2023

[BUG] new parquet writer code checks for `nullable` not `has_nulls` #7654

[BUG] new parquet writer code checks for `nullable` not `has_nulls` #7654