-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added documentation for assertion on non-empty nulls [skip ci] #8271
Conversation
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
docs/compatibility.md
Outdated
@@ -15,6 +15,14 @@ problem. Please also look at the current list of | |||
[bugs](https://github.com/NVIDIA/spark-rapids/issues?q=is%3Aopen+is%3Aissue+label%3Abug) which are | |||
typically incompatibilities that we have not yet addressed. | |||
|
|||
## Non-empty nulls | |||
|
|||
The SQL plugin doesn't support nested types to have non-empty nulls and if data being processed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure that this belongs to compatibility.md
The reason is that this is internal and only for the tests (actually this text doesn't mention testing.. I would make sure we qualify that this is only for tests).
Would CONTRIBUTING.md
be a better fit for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't internal. Anyone passing data through our plugin that contains non-empty nulls will see this AssertionError
unless they turn off assertions using the above mentioned Java command line parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can the user control this???? Assertions are also off by default in java. You have to enable them with -ea
or something similar. This is not a user facing problem.
If they pass in arrow data and we get an error because of this, then we have a BUG in our arrow translation. If they use the java APIs directly to write UDFs, then I can see it, but it should be a part of the docs for the CUDF java APIs. Not here, and especially not as the first thing in the compatibility docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are asking two questions here.
- Is adding this documentation needed at all?
I think it does until we have fixed the mergeAndSetValidity
method and everywhere else where we can be introducing non-empty nulls. I wasn't sure if my fix for mergeAndSetValidity
will make it in before 23.06 is released and until it does I imagine a user who has enabled assertions and consuming arrow data will see this.
- Where does this belong?
I wasn't sure where to put it, I think it belongs in compatibility because it has something to do with consuming external data which can be considered valid by some 3rd party tools like Arrow. We can put it wherever we have a consensus.
Another question that arises is if the user doesn't turn on asserts how does the user know that this could result in bad results if they consume from e.g. Arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions are off by default. The only time that a user would see this is if they are running unit tests against our code or enable assertions themselves for some reason and happen to run into one of these problems.
I think this needs to be in an FAQ and not compatibility. This is not a difference between Apache Spark with and without our plugin. Turning on our plugin will still produce the same results as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be in an FAQ
I would argue this doesn't meet the "frequently" part of FAQ. Agree with Bobby that the plugin will do the same thing as Apache Spark here.
What, specifically, are we worried about here that users really need to know about? What is the use case where users would encounter it? If there are no use cases, then I don't see why we need to put this in user-facing documentation.
@@ -15,6 +15,14 @@ problem. Please also look at the current list of | |||
[bugs](https://github.com/NVIDIA/spark-rapids/issues?q=is%3Aopen+is%3Aissue+label%3Abug) which are | |||
typically incompatibilities that we have not yet addressed. | |||
|
|||
## Non-empty nulls | |||
|
|||
The SQL plugin doesn't support nested types to have non-empty nulls and if the data being processed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"non-empty nulls" sounds cryptic. Can you rephrase or illustrate with an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. I can't think of another word, let me explain with an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not something that user needs to worry about or we have done our job incorrectly (aka we have a bug). The user has no control over non-empty nulls. So telling them about this is confusing. The assertions should also be off by default in production and only enabled during our tests. This documentation should be part of developer docs. Not here.
This doc change is not needed as assertions are disabled by default |
This PR attempts at updating documentation to add recently added assertion on non-empty.
fixes #8238