Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add in asserts when column views are created to verify that array nulls are empty #5430

Closed
revans2 opened this issue May 5, 2022 · 4 comments · Fixed by #8517
Closed
Assignees
Labels
P1 Nice to have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing

Comments

@revans2
Copy link
Collaborator

revans2 commented May 5, 2022

There is some upcoming work in CUDF where to be able to do some complicated operations correctly on arrays/lists we need to be 100% positive that if a list is null then the start and end offsets for that row point to the same index. i.e. a null list must be backed by an empty list. Arrow does not have this same requirement. It is expensive to always check that this is true and it is also expensive to fix it when it is not true, so CUDF has decided to take the route of assuming that it will always be true for these operations and making sure that this will be true for the output of any operation from CUDF so long as the input to it also follows those guidelines.

Now we need to make sure that we are always going to follow those guidelines. We plan on doing this by having an check that can be enabled/disabled for testing every time we create column_view/column of lists and throw an exception if it is not true. CUDF plans to do some of this themselves to make sure that they can guarantee that their processing does the right thing. Hopefully we can use their same functionality when we run tests. If not, then we are going to need to add something ourselves that can be turned on or off when the JVM starts up to do these same kinds of checks.

Then we need to fix anywhere that we find issues. I don't know how slow this is going to be for testing. If it is too slow we may only do it nightly.

At a minimum we are also going to have to explicitly check/clean up any data we get externally, if CUDF does not end up doing it for us. This is any data we get from arrow and just import in ourselves.

@revans2 revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing labels May 5, 2022
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 10, 2022
@sameerz sameerz added the P1 Nice to have for release label Jul 19, 2022
@sameerz
Copy link
Collaborator

sameerz commented Sep 27, 2022

This is a prerequisite to sorting nesting lists.

@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Jan 10, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Jan 12, 2023

Now that some time has passed I found time to look at the CUDF implementations of the checks/fixes.

Inside of cudf/copying.hpp there are three new APIs

We should add CUDF APIs for all of these APIs so we can use them as needed. The problem is with fixing and finding issues. Ideally we would have an assertion when we create a ColumnView or ColumnVector that would throw if has_nonempty_nulls returned true. The problem is that once we turn this on we need a way to disable it temporarily to let tests pass, at least until we can fix all of the places that we have issues. There is also the problem of explicitly allowing specific APIs to create a bad ColumnView (not ColumnVector) instances if we decide that for performance reasons it is okay to do this in the short term and then fix it with purge_nonempty_nulls afterwards.

I think to make all of this work we need to have the assert live in a separate class. this is because the java command line arguments -ea and -da can take specific packages and classes as arguments, but not specific functions within a class. This way we can turn them on and off separate from all other assertions.

As for having an API that lets us create a bad ColumnView. It think we can tackle that when we come to it. I can see us creating a special factory method separate from the constructor or a separate class all together that would disable specific operations that try to do comparisons. But I am not really happy with either so I would prefer that we don't implement either of them up front.

As a side note I did a quick test doing a hacked up version of this (no JNI, I just changed an API that gets used in many places when we create a ColumnVector/ColumnView, but not all of them) and I found that the following tests fail. There are likely more that would fail if my change covered all ColumnViews and ColumnVectors.

FAILED ../../src/main/python/cast_test.py::test_cast_nested[Array(Struct(('a', Byte),('b', Byte)))-ArrayType(StringType(), True)]
FAILED ../../src/main/python/cast_test.py::test_cast_array_to_string[false-Array(Map(Byte(not_null),Date))]
FAILED ../../src/main/python/cast_test.py::test_cast_array_to_string[true-Array(Map(Byte(not_null),Date))]
FAILED ../../src/main/python/cast_test.py::test_cast_array_to_string[false-Array(Struct(['child0', Byte],['child1', String],['child2', Date]))]
FAILED ../../src/main/python/cast_test.py::test_cast_array_to_string[false-Array(Array(Short))]
FAILED ../../src/main/python/cast_test.py::test_cast_array_to_string[true-Array(Array(Null))]
FAILED ../../src/main/python/cast_test.py::test_cast_array_to_string[true-Array(Struct(['child0', Byte],['child1', String],['child2', Date]))]
FAILED ../../src/main/python/cast_test.py::test_cast_array_to_string[false-Array(Array(Null))]
FAILED ../../src/main/python/cast_test.py::test_cast_array_to_string[true-Array(Array(Short))]

We might also want to add in calls to purge before we do some critical operations, like sort, aggregations, upper bound, lower bound, and comparisons. But this should only be done if we add in support for nested types that include LISTs as the key that we are doing a comparison on before we have completed cleaning up all of the operators.

We also need to be careful about getting data back from python as this is not an arrow requirement. We might need to call purge on anything returned from the arrow APIs.

@ttnghia
Copy link
Collaborator

ttnghia commented Jun 28, 2023

@razajafri Is there anything we need to do to close this?

@razajafri
Copy link
Collaborator

There is a PR for this that depends on a cudf issue that I have filed today rapidsai/cudf#13638

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Nice to have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants