fix: renamed modulus to modulo; updated modulo operator defintion #583

jordanvieler · 2023-12-13T01:08:59Z

BREAKING CHANGE: Renamed modulus to modulo.

Added options and documentation for the modulo operator as defined in math and comp sci.

Refs: #353

Renamed modulus to modulo as modulus is the length of a vector and this function describes calculating the remainder after interger division. Added options and documentation for the modulo operator as it is not consistently defined across mathematics and results are implementation/patform dependent. Refs: substrait-io#353

CLAassistant · 2023-12-13T01:09:05Z

All committers have signed the CLA.

github-actions · 2023-12-13T01:09:21Z

ACTION NEEDED

Substrait follows the Conventional Commits
specification for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

westonpace · 2023-12-13T18:38:01Z

This is very helpful information, thank you for adding it. I think this is all valid for modulo. I have a question though. Do you know of any query engines today that support more than just truncate behavior?

richtia · 2023-12-13T19:08:32Z

This is very helpful information, thank you for adding it. I think this is all valid for modulo. I have a question though. Do you know of any query engines today that support more than just truncate behavior?

I was wondering the same. Not exactly query engines, but this shows what different programming languages do: https://en.wikipedia.org/wiki/Modulo#In_programming_languages

richtia · 2023-12-13T19:10:02Z

Since this is a breaking change, we'll need 2 SMC approvals

jordanvieler · 2023-12-13T23:51:20Z

Per the Wikipedia link, I believe that ANSI SQL uses truncate division. So, presumably, any query engine compliant with the specification would use Truncate. However, the Python % operator utilizes Floored division, while the Rust % Operator utilizes Truncated division. Therefore, care would need to be taken to ensure that a Substrait producer written in Python and a Substrait consumer written in Rust consistently perform the Modulo operation.
This table from Division and Modulus for Computer Scientists by Daan Leijen illustrates the differences in results from Truncate (T), Floor(F), and Euclidean (E) division. q is the quotient, and r is the remainder.

westonpace · 2023-12-14T14:32:54Z

Thanks for the example cases.

Rust, datafusion, sqlite, postgres, sql server, mysql, spark, and velox all use truncate.
Python, pandas, and cudf are the only examples I can find that use floor, but those are probably enough to justify it.

Let's update the PR to limit the choices to truncate and floor. Given there are no systems that use ceil/round/euclidian I think the extra information would be confusing to users.

westonpace

If we want on_domain_error then we will need to add floating point implementations. Also see my change from my non-review comment about removing ceiling, round, and euclidian.

extensions/functions_arithmetic.yaml

westonpace · 2023-12-14T14:35:56Z

extensions/functions_arithmetic.yaml

+          on_domain_error:
+            values: [ NAN, ERROR ]


This implementation is for i8 and on_domain_error does not make sense here.

Suggested change

on_domain_error:

values: [ NAN, ERROR ]

I think modulo is undefined for a divisor of 0 and args of +/- inf.

extensions/functions_arithmetic.yaml

westonpace · 2023-12-14T14:36:13Z

extensions/functions_arithmetic.yaml

+          on_domain_error:
+            values: [ NAN, ERROR ]


Suggested change

on_domain_error:

values: [ NAN, ERROR ]

extensions/functions_arithmetic.yaml

… EUCLIDIAN, and ROUND

jordanvieler · 2023-12-14T15:07:06Z

Thanks for the feedback. I just have a couple questions.

How is on_domain_error meant to be defined in substrait? The Modulo function is undefined (return ∅) when the divisor is 0 or arguments approach +/- inf. Therefore, I believe it is "out of domain" by definition. How relevant is float vs integer division in this case?
While python uses floor for its % operator in the example, many languages also define a set of other modulo functions in their libraries. For example, Rust defines rem_euclid() for the Euclidean definition of division. Swift defines remainder(dividingBy:) for a modulo with a rounded definition. Are these important cases?

Because of how drastic the results can be when performing these different types of modulo, I think the distinctions are important for Data Science, Statistics, and Machine Learning applications. For example, a Data Scientist might specifically call for a Ceil Modulo to be performed from their python data frame library with real implications for the results.

westonpace · 2023-12-14T15:44:15Z

How is on_domain_error meant to be defined in substrait? The Modulo function is undefined (return ∅) when the divisor is 0 or arguments approach +/- inf. Therefore, I believe it is "out of domain" by definition. How relevant is float vs integer division in this case?

Float vs integer is relevant because different implementations (e.g f32 vs i32) have different options. Historically we've used overflow=SILENT, SATURATE, ERROR for integer operations and on_domain_error=NAN, ERROR for floating point operations. We could have done this other ways too. For example, integer division could have both overflow=SILENT,SATURATE,ERROR (to handle min_int/-1) and on_domain_error=SILENT,ERROR (for division by zero). We went with the simplest approach to match what was currently being offered by available engines.

The current PR's proposed on_domain_error will not work for integer implementations however because NAN is not possible.

This does answer my "is overflow possible for modulo?" question though (if we consider division by zero to be overflow).

While python uses floor for its % operator in the example, many languages also define a set of other modulo functions in their libraries. For example, Rust defines rem_euclid() for the Euclidean definition of division. Swift defines remainder(dividingBy:) for a modulo with a rounded definition. Are these important cases?

Our current charter for standard functions is to only capture functions that a sufficient number of query engines currently provide. Rust may offer rem_euclid but datafusion does not. Even if one engine were to provide such a method that would be a better case for an engine-specific extension function than a standard Substrait function.

jordanvieler · 2023-12-14T16:29:23Z

Float vs integer is relevant because different implementations (e.g f32 vs i32) have different options. Historically we've used overflow=SILENT, SATURATE, ERROR for integer operations and on_domain_error=NAN, ERROR for floating point operations. We could have done this other ways too. For example, integer division could have both overflow=SILENT,SATURATE,ERROR (to handle min_int/-1) and on_domain_error=SILENT,ERROR (for division by zero). We went with the simplest approach to match what was currently being offered by available engines.

The current PR's proposed on_domain_error will not work for integer implementations however because NAN is not possible.

This does answer my "is overflow possible for modulo?" question though (if we consider division by zero to be overflow).

Ok, I see what you are saying about NaN not being relevant for int types. Thank you for clarifying that. However, I think there is a difference between overflow and out-of-domain for modulo. Do we assume that a Substrait consumer will return an Error for out-of-domain values of the modulus function, namely 0, as a divisor? So, no option is necessary, right? I could also imagine an argument for MAX_INT being a return value in these cases.

Our current charter for standard functions is to only capture functions that a sufficient number of query engines currently provide. Rust may offer rem_euclid but datafusion does not. Even if one engine were to provide such a method that would be a better case for an engine-specific extension function than a standard Substrait function.

I understand the rationale, but I am approaching this slightly differently. Shouldn't standard Substrait functions be toward creating a platform, language, and system agnostic specification and network protocol for data compute operations? I think only supporting the current behavior of existing query engines could introduce a historical bias and limit use cases. SQL itself is only loosely based on relational algebra.

westonpace · 2023-12-14T18:36:42Z

Do we assume that a Substrait consumer will return an Error for out-of-domain values of the modulus function, namely 0, as a divisor? So, no option is necessary, right?

Sadly, it seems we are not so lucky.

Postgres, Datafusion, SQL server, velox raise an error
Sqlite, spark, and MySQL return null
Pandas and cudf silently upcast the entire operation to floating point and return NaN (this would not be valid substrait because the return type is not allowed to depend on input and so we can probably ignore this case)

Even ignoring the invalid case we still have to consider error and null options.

I understand the rationale, but I am approaching this slightly differently. Shouldn't standard Substrait functions be toward creating a platform, language, and system agnostic specification and network protocol for data compute operations? I think only supporting the current behavior of existing query engines could introduce a historical bias and limit use cases. SQL itself is only loosely based on relational algebra.

This is a fair viewpoint but the flipside is that anyone using Substrait to create plans cannot reasonably assume the plan will run. Either way, this issue is probably not the place to discuss the charter for functions. I'm afraid I'm rather inflexible. I've been reviewing based on the last time we discussed this topic which was at #307.

If we want a new charter then we should have a separate discussion where everyone can weigh in. Either as a new issue, in a community meeting, or on the mailing list. Discussing the scope of Substrait functions on a per-function / per-PR basis doesn't really work.

jordanvieler · 2023-12-14T19:29:39Z

Sadly, it seems we are not so lucky.

Postgres, Datafusion, SQL server, velox raise an error Sqlite, spark, and MySQL return null Pandas and cudf silently upcast the entire operation to floating point and return NaN (this would not be valid substrait because the return type is not allowed to depend on input and so we can probably ignore this case)

Even ignoring the invalid case we still have to consider error and null options.

Does this mean that a field to specify domain errors is warranted separately from the overflow error we have discussed?

This is a fair viewpoint but the flipside is that anyone using Substrait to create plans cannot reasonably assume the plan will run. Either way, this issue is probably not the place to discuss the charter for functions. I'm afraid I'm rather inflexible. I've been reviewing based on the last time we discussed this topic which was at #307.

Per the referenced topic it seems like modulo is exactly the pedantic case that @jvanstraten mentions.

If we want a new charter then we should have a separate discussion where everyone can weigh in. Either as a new issue, in a community meeting, or on the mailing list. Discussing the scope of Substrait functions on a per-function / per-PR basis doesn't really work.

I completely understand. I am relatively new to contributing to open-source. My attempt to resolve the divide-by-zero behavior highlighted these in nuances modulo. Would it then make sense to bring this aspect of the discussion somewhere else?

jordanvieler · 2023-12-14T19:40:33Z

For now how does this look?

westonpace

Let's keep both overflow and on_domain_error but change NAN to NULL since we are dealing with integer kernels which cannot represent NAN.

extensions/functions_arithmetic.yaml

westonpace · 2023-12-14T22:57:08Z

extensions/functions_arithmetic.yaml

+          overflow:
+            values: [ SILENT, SATURATE, ERROR ]
+          on_domain_error:
+            values: [ NAN, ERROR ]


Suggested change

values: [ NAN, ERROR ]

values: [ NULL, ERROR ]

extensions/functions_arithmetic.yaml

Co-authored-by: Weston Pace <[email protected]>

jordanvieler · 2023-12-15T16:38:16Z

Okay. Thank you!

EpsilonPrime · 2023-12-18T20:55:55Z

extensions/functions_arithmetic.yaml

+            values: [ TRUNCATE, FLOOR ]
+          overflow:
+            values: [ SILENT, SATURATE, ERROR ]
+          on_domain_error: values: [ NULL, ERROR ]


Looks like this should be on two lines.

My Bad Thanks.

jordanvieler · 2023-12-19T03:15:58Z

It seems like NULL is not valid for the YAML Linter.

westonpace

Seems to be an annoying yaml / yaml-validator gotcha. NULL is being interpreted as the data type "null" and not as a string.

extensions/functions_arithmetic.yaml

westonpace · 2023-12-19T03:19:48Z

extensions/functions_arithmetic.yaml

+          overflow:
+            values: [ SILENT, SATURATE, ERROR ]
+          on_domain_error:
+            values: [ NULL, ERROR ]


Suggested change

values: [ NULL, ERROR ]

values: [ "NULL", ERROR ]

westonpace · 2023-12-19T03:20:51Z

The one thing we haven't talked about is whether the name change justifies breaking backwards compatibility. We have a community meeting coming up on Wednesday and I think we should bring this up there.

Co-authored-by: Weston Pace <[email protected]>

… to i16

jordanvieler · 2023-12-19T03:39:52Z

Is it necessary? I added that change after making a very deep dive into mod and thought I read modulus referred to something else. But now I think it's just preference.

westonpace · 2023-12-19T03:45:02Z

Is it necessary? I added that change after making a very deep dive into mod and thought I read modulus referred to something else. But now I think it's just preference.

If you're comfortable with the old name then let's keep it. There are systems out there which are relying on the old name and so the breaking change would be a bit inconvenient.

westonpace

This looks good to me now. Thanks for all your work! @EpsilonPrime did you want to take another look?

jordanvieler added 2 commits December 12, 2023 14:01

fix: corrected line length, changed modulus to modulo

6f5a986

jordanvieler requested review from jacques-n, cpcloud, westonpace, EpsilonPrime and vbarua as code owners December 13, 2023 01:09

jordanvieler changed the title ~~fix: Renamed modulus to modulo. Updated modulo operator defintion~~ fix: renamed modulus to modulo; updated modulo operator defintion Dec 13, 2023

richtia self-requested a review December 13, 2023 01:22

richtia previously approved these changes Dec 13, 2023

View reviewed changes

westonpace requested changes Dec 14, 2023

View reviewed changes

fix: renaimed quotient to division_type for clarity. removed CEILING,…

debc8f0

… EUCLIDIAN, and ROUND

jordanvieler dismissed richtia’s stale review via debc8f0 December 14, 2023 19:39

westonpace requested changes Dec 14, 2023

View reviewed changes

jordanvieler and others added 3 commits December 14, 2023 17:23

Update extensions/functions_arithmetic.yaml

b142555

Co-authored-by: Weston Pace <[email protected]>

Update extensions/functions_arithmetic.yaml

4e40ebf

Co-authored-by: Weston Pace <[email protected]>

Update extensions/functions_arithmetic.yaml

d4c91f5

Co-authored-by: Weston Pace <[email protected]>

jordanvieler and others added 3 commits December 14, 2023 17:24

Update extensions/functions_arithmetic.yaml

20ac546

Co-authored-by: Weston Pace <[email protected]>

Update extensions/functions_arithmetic.yaml

a8c8685

Co-authored-by: Weston Pace <[email protected]>

Update extensions/functions_arithmetic.yaml

a447a89

Co-authored-by: Weston Pace <[email protected]>

fix: corrected missing return

8dc08fe

EpsilonPrime reviewed Dec 18, 2023

View reviewed changes

Merge branch 'main' into main

a6ccabd

westonpace requested changes Dec 19, 2023

View reviewed changes

jordanvieler and others added 3 commits December 18, 2023 21:33

Update functions_arithmetic.yaml

327510e

Co-authored-by: Weston Pace <[email protected]>

Update functions_arithmetic.yaml

6377977

Co-authored-by: Weston Pace <[email protected]>

reverted modulo to modulus, fixed NULL quoting, added on_domain_error…

2484fc8

… to i16

westonpace approved these changes Dec 19, 2023

View reviewed changes

EpsilonPrime approved these changes Dec 19, 2023

View reviewed changes

EpsilonPrime merged commit aba1bc7 into substrait-io:main Dec 19, 2023
13 checks passed

EpsilonPrime mentioned this pull request Dec 19, 2023

Modulus function in functions_arithmetic.yaml is missing division-by-zero behavior #353

Closed

richtia mentioned this pull request Jan 18, 2024

[Func/Arith] Multiply, Division and Modulus substrait-io/bft#32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: renamed modulus to modulo; updated modulo operator defintion #583

fix: renamed modulus to modulo; updated modulo operator defintion #583

jordanvieler commented Dec 13, 2023 •

edited by richtia

Loading

CLAassistant commented Dec 13, 2023 •

edited

Loading

github-actions bot commented Dec 13, 2023

westonpace commented Dec 13, 2023

richtia commented Dec 13, 2023

richtia commented Dec 13, 2023

jordanvieler commented Dec 13, 2023

westonpace commented Dec 14, 2023

westonpace left a comment

westonpace Dec 14, 2023

jordanvieler Dec 14, 2023

westonpace Dec 14, 2023

jordanvieler commented Dec 14, 2023 •

edited

Loading

westonpace commented Dec 14, 2023

jordanvieler commented Dec 14, 2023 •

edited

Loading

westonpace commented Dec 14, 2023

jordanvieler commented Dec 14, 2023 •

edited

Loading

jordanvieler commented Dec 14, 2023

westonpace left a comment

westonpace Dec 14, 2023

jordanvieler commented Dec 15, 2023

EpsilonPrime Dec 18, 2023

jordanvieler Dec 19, 2023

jordanvieler commented Dec 19, 2023

westonpace left a comment

westonpace Dec 19, 2023

westonpace commented Dec 19, 2023

jordanvieler commented Dec 19, 2023

westonpace commented Dec 19, 2023

westonpace left a comment

fix: renamed modulus to modulo; updated modulo operator defintion #583

fix: renamed modulus to modulo; updated modulo operator defintion #583

Conversation

jordanvieler commented Dec 13, 2023 • edited by richtia Loading

CLAassistant commented Dec 13, 2023 • edited Loading

github-actions bot commented Dec 13, 2023

westonpace commented Dec 13, 2023

richtia commented Dec 13, 2023

richtia commented Dec 13, 2023

jordanvieler commented Dec 13, 2023

westonpace commented Dec 14, 2023

westonpace left a comment

Choose a reason for hiding this comment

westonpace Dec 14, 2023

Choose a reason for hiding this comment

jordanvieler Dec 14, 2023

Choose a reason for hiding this comment

westonpace Dec 14, 2023

Choose a reason for hiding this comment

jordanvieler commented Dec 14, 2023 • edited Loading

westonpace commented Dec 14, 2023

jordanvieler commented Dec 14, 2023 • edited Loading

westonpace commented Dec 14, 2023

jordanvieler commented Dec 14, 2023 • edited Loading

jordanvieler commented Dec 14, 2023

westonpace left a comment

Choose a reason for hiding this comment

westonpace Dec 14, 2023

Choose a reason for hiding this comment

jordanvieler commented Dec 15, 2023

EpsilonPrime Dec 18, 2023

Choose a reason for hiding this comment

jordanvieler Dec 19, 2023

Choose a reason for hiding this comment

jordanvieler commented Dec 19, 2023

westonpace left a comment

Choose a reason for hiding this comment

westonpace Dec 19, 2023

Choose a reason for hiding this comment

westonpace commented Dec 19, 2023

jordanvieler commented Dec 19, 2023

westonpace commented Dec 19, 2023

westonpace left a comment

Choose a reason for hiding this comment

jordanvieler commented Dec 13, 2023 •

edited by richtia

Loading

CLAassistant commented Dec 13, 2023 •

edited

Loading

jordanvieler commented Dec 14, 2023 •

edited

Loading

jordanvieler commented Dec 14, 2023 •

edited

Loading

jordanvieler commented Dec 14, 2023 •

edited

Loading