-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: renamed modulus to modulo; updated modulo operator defintion #583
Conversation
Renamed modulus to modulo as modulus is the length of a vector and this function describes calculating the remainder after interger division. Added options and documentation for the modulo operator as it is not consistently defined across mathematics and results are implementation/patform dependent. Refs: substrait-io#353
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
This is very helpful information, thank you for adding it. I think this is all valid for modulo. I have a question though. Do you know of any query engines today that support more than just |
I was wondering the same. Not exactly query engines, but this shows what different programming languages do: https://en.wikipedia.org/wiki/Modulo#In_programming_languages |
Since this is a breaking change, we'll need 2 SMC approvals |
Per the Wikipedia link, I believe that ANSI SQL uses truncate division. So, presumably, any query engine compliant with the specification would use Truncate. However, the Python % operator utilizes Floored division, while the Rust % Operator utilizes Truncated division. Therefore, care would need to be taken to ensure that a Substrait producer written in Python and a Substrait consumer written in Rust consistently perform the Modulo operation. |
Thanks for the example cases. Rust, datafusion, sqlite, postgres, sql server, mysql, spark, and velox all use truncate. Let's update the PR to limit the choices to truncate and floor. Given there are no systems that use ceil/round/euclidian I think the extra information would be confusing to users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want on_domain_error
then we will need to add floating point implementations. Also see my change from my non-review comment about removing ceiling
, round
, and euclidian
.
extensions/functions_arithmetic.yaml
Outdated
on_domain_error: | ||
values: [ NAN, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is for i8
and on_domain_error
does not make sense here.
on_domain_error: | |
values: [ NAN, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think modulo is undefined for a divisor of 0 and args of +/- inf.
extensions/functions_arithmetic.yaml
Outdated
on_domain_error: | ||
values: [ NAN, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on_domain_error: | |
values: [ NAN, ERROR ] |
… EUCLIDIAN, and ROUND
Thanks for the feedback. I just have a couple questions.
Because of how drastic the results can be when performing these different types of modulo, I think the distinctions are important for Data Science, Statistics, and Machine Learning applications. For example, a Data Scientist might specifically call for a Ceil Modulo to be performed from their python data frame library with real implications for the results. |
Float vs integer is relevant because different implementations (e.g f32 vs i32) have different options. Historically we've used The current PR's proposed This does answer my "is overflow possible for modulo?" question though (if we consider division by zero to be overflow).
Our current charter for standard functions is to only capture functions that a sufficient number of query engines currently provide. Rust may offer |
Ok, I see what you are saying about NaN not being relevant for int types. Thank you for clarifying that. However, I think there is a difference between overflow and out-of-domain for modulo. Do we assume that a Substrait consumer will return an Error for out-of-domain values of the modulus function, namely 0, as a divisor? So, no option is necessary, right? I could also imagine an argument for MAX_INT being a return value in these cases.
I understand the rationale, but I am approaching this slightly differently. Shouldn't standard Substrait functions be toward creating a platform, language, and system agnostic specification and network protocol for data compute operations? I think only supporting the current behavior of existing query engines could introduce a historical bias and limit use cases. SQL itself is only loosely based on relational algebra. |
Sadly, it seems we are not so lucky. Postgres, Datafusion, SQL server, velox raise an error Even ignoring the invalid case we still have to consider error and null options.
This is a fair viewpoint but the flipside is that anyone using Substrait to create plans cannot reasonably assume the plan will run. Either way, this issue is probably not the place to discuss the charter for functions. I'm afraid I'm rather inflexible. I've been reviewing based on the last time we discussed this topic which was at #307. If we want a new charter then we should have a separate discussion where everyone can weigh in. Either as a new issue, in a community meeting, or on the mailing list. Discussing the scope of Substrait functions on a per-function / per-PR basis doesn't really work. |
Does this mean that a field to specify domain errors is warranted separately from the overflow error we have discussed?
Per the referenced topic it seems like modulo is exactly the pedantic case that @jvanstraten mentions.
I completely understand. I am relatively new to contributing to open-source. My attempt to resolve the divide-by-zero behavior highlighted these in nuances modulo. Would it then make sense to bring this aspect of the discussion somewhere else? |
For now how does this look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep both overflow
and on_domain_error
but change NAN
to NULL
since we are dealing with integer kernels which cannot represent NAN
.
extensions/functions_arithmetic.yaml
Outdated
overflow: | ||
values: [ SILENT, SATURATE, ERROR ] | ||
on_domain_error: | ||
values: [ NAN, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values: [ NAN, ERROR ] | |
values: [ NULL, ERROR ] |
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Okay. Thank you! |
extensions/functions_arithmetic.yaml
Outdated
values: [ TRUNCATE, FLOOR ] | ||
overflow: | ||
values: [ SILENT, SATURATE, ERROR ] | ||
on_domain_error: values: [ NULL, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this should be on two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My Bad Thanks.
It seems like NULL is not valid for the YAML Linter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be an annoying yaml / yaml-validator gotcha. NULL
is being interpreted as the data type "null" and not as a string.
extensions/functions_arithmetic.yaml
Outdated
overflow: | ||
values: [ SILENT, SATURATE, ERROR ] | ||
on_domain_error: | ||
values: [ NULL, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values: [ NULL, ERROR ] | |
values: [ "NULL", ERROR ] |
The one thing we haven't talked about is whether the name change justifies breaking backwards compatibility. We have a community meeting coming up on Wednesday and I think we should bring this up there. |
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Is it necessary? I added that change after making a very deep dive into mod and thought I read modulus referred to something else. But now I think it's just preference. |
If you're comfortable with the old name then let's keep it. There are systems out there which are relying on the old name and so the breaking change would be a bit inconvenient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now. Thanks for all your work! @EpsilonPrime did you want to take another look?
BREAKING CHANGE: Renamed modulus to modulo.
Added options and documentation for the modulo operator as defined in math and comp sci.
Refs: #353