-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add in asserts when column views are created to verify that array nulls are empty #5430
Comments
This is a prerequisite to sorting nesting lists. |
Now that some time has passed I found time to look at the CUDF implementations of the checks/fixes. Inside of cudf/copying.hpp there are three new APIs We should add CUDF APIs for all of these APIs so we can use them as needed. The problem is with fixing and finding issues. Ideally we would have an assertion when we create a I think to make all of this work we need to have the assert live in a separate class. this is because the java command line arguments As for having an API that lets us create a bad As a side note I did a quick test doing a hacked up version of this (no JNI, I just changed an API that gets used in many places when we create a ColumnVector/ColumnView, but not all of them) and I found that the following tests fail. There are likely more that would fail if my change covered all ColumnViews and ColumnVectors.
We might also want to add in calls to purge before we do some critical operations, like sort, aggregations, upper bound, lower bound, and comparisons. But this should only be done if we add in support for nested types that include LISTs as the key that we are doing a comparison on before we have completed cleaning up all of the operators. We also need to be careful about getting data back from python as this is not an arrow requirement. We might need to call purge on anything returned from the arrow APIs. |
@razajafri Is there anything we need to do to close this? |
There is a PR for this that depends on a cudf issue that I have filed today rapidsai/cudf#13638 |
There is some upcoming work in CUDF where to be able to do some complicated operations correctly on arrays/lists we need to be 100% positive that if a list is null then the start and end offsets for that row point to the same index. i.e. a null list must be backed by an empty list. Arrow does not have this same requirement. It is expensive to always check that this is true and it is also expensive to fix it when it is not true, so CUDF has decided to take the route of assuming that it will always be true for these operations and making sure that this will be true for the output of any operation from CUDF so long as the input to it also follows those guidelines.
Now we need to make sure that we are always going to follow those guidelines. We plan on doing this by having an check that can be enabled/disabled for testing every time we create column_view/column of lists and throw an exception if it is not true. CUDF plans to do some of this themselves to make sure that they can guarantee that their processing does the right thing. Hopefully we can use their same functionality when we run tests. If not, then we are going to need to add something ourselves that can be turned on or off when the JVM starts up to do these same kinds of checks.
Then we need to fix anywhere that we find issues. I don't know how slow this is going to be for testing. If it is too slow we may only do it nightly.
At a minimum we are also going to have to explicitly check/clean up any data we get externally, if CUDF does not end up doing it for us. This is any data we get from arrow and just import in ourselves.
The text was updated successfully, but these errors were encountered: