Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: renamed modulus to modulo; updated modulo operator defintion #583

Merged
merged 14 commits into from
Dec 19, 2023

Conversation

jordanvieler
Copy link
Contributor

@jordanvieler jordanvieler commented Dec 13, 2023

BREAKING CHANGE: Renamed modulus to modulo.

Added options and documentation for the modulo operator as defined in math and comp sci.

Refs: #353

Renamed modulus to modulo as modulus is the length of a vector and this
function describes calculating the remainder after interger division.

Added options and documentation for the modulo operator as it is not
consistently defined across mathematics and results are implementation/patform dependent.

Refs: substrait-io#353
@CLAassistant
Copy link

CLAassistant commented Dec 13, 2023

CLA assistant check
All committers have signed the CLA.

Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@jordanvieler jordanvieler changed the title fix: Renamed modulus to modulo. Updated modulo operator defintion fix: renamed modulus to modulo; updated modulo operator defintion Dec 13, 2023
@richtia richtia self-requested a review December 13, 2023 01:22
@westonpace
Copy link
Member

This is very helpful information, thank you for adding it. I think this is all valid for modulo. I have a question though. Do you know of any query engines today that support more than just truncate behavior?

@richtia
Copy link
Member

richtia commented Dec 13, 2023

This is very helpful information, thank you for adding it. I think this is all valid for modulo. I have a question though. Do you know of any query engines today that support more than just truncate behavior?

I was wondering the same. Not exactly query engines, but this shows what different programming languages do: https://en.wikipedia.org/wiki/Modulo#In_programming_languages

richtia
richtia previously approved these changes Dec 13, 2023
@richtia
Copy link
Member

richtia commented Dec 13, 2023

Since this is a breaking change, we'll need 2 SMC approvals

@jordanvieler
Copy link
Contributor Author

Per the Wikipedia link, I believe that ANSI SQL uses truncate division. So, presumably, any query engine compliant with the specification would use Truncate. However, the Python % operator utilizes Floored division, while the Rust % Operator utilizes Truncated division. Therefore, care would need to be taken to ensure that a Substrait producer written in Python and a Substrait consumer written in Rust consistently perform the Modulo operation.
This table from Division and Modulus for Computer Scientists by Daan Leijen illustrates the differences in results from Truncate (T), Floor(F), and Euclidean (E) division. q is the quotient, and r is the remainder.

modulo_table

@westonpace
Copy link
Member

Thanks for the example cases.

Rust, datafusion, sqlite, postgres, sql server, mysql, spark, and velox all use truncate.
Python, pandas, and cudf are the only examples I can find that use floor, but those are probably enough to justify it.

Let's update the PR to limit the choices to truncate and floor. Given there are no systems that use ceil/round/euclidian I think the extra information would be confusing to users.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want on_domain_error then we will need to add floating point implementations. Also see my change from my non-review comment about removing ceiling, round, and euclidian.

extensions/functions_arithmetic.yaml Show resolved Hide resolved
Comment on lines 329 to 330
on_domain_error:
values: [ NAN, ERROR ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is for i8 and on_domain_error does not make sense here.

Suggested change
on_domain_error:
values: [ NAN, ERROR ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think modulo is undefined for a divisor of 0 and args of +/- inf.

extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
Comment on lines 355 to 356
on_domain_error:
values: [ NAN, ERROR ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on_domain_error:
values: [ NAN, ERROR ]

extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
@jordanvieler
Copy link
Contributor Author

jordanvieler commented Dec 14, 2023

Thanks for the feedback. I just have a couple questions.

  1. How is on_domain_error meant to be defined in substrait? The Modulo function is undefined (return ∅) when the divisor is 0 or arguments approach +/- inf. Therefore, I believe it is "out of domain" by definition. How relevant is float vs integer division in this case?

  2. While python uses floor for its % operator in the example, many languages also define a set of other modulo functions in their libraries. For example, Rust defines rem_euclid() for the Euclidean definition of division. Swift defines remainder(dividingBy:) for a modulo with a rounded definition. Are these important cases?

Because of how drastic the results can be when performing these different types of modulo, I think the distinctions are important for Data Science, Statistics, and Machine Learning applications. For example, a Data Scientist might specifically call for a Ceil Modulo to be performed from their python data frame library with real implications for the results.

@westonpace
Copy link
Member

How is on_domain_error meant to be defined in substrait? The Modulo function is undefined (return ∅) when the divisor is 0 or arguments approach +/- inf. Therefore, I believe it is "out of domain" by definition. How relevant is float vs integer division in this case?

Float vs integer is relevant because different implementations (e.g f32 vs i32) have different options. Historically we've used overflow=SILENT, SATURATE, ERROR for integer operations and on_domain_error=NAN, ERROR for floating point operations. We could have done this other ways too. For example, integer division could have both overflow=SILENT,SATURATE,ERROR (to handle min_int/-1) and on_domain_error=SILENT,ERROR (for division by zero). We went with the simplest approach to match what was currently being offered by available engines.

The current PR's proposed on_domain_error will not work for integer implementations however because NAN is not possible.

This does answer my "is overflow possible for modulo?" question though (if we consider division by zero to be overflow).

While python uses floor for its % operator in the example, many languages also define a set of other modulo functions in their libraries. For example, Rust defines rem_euclid() for the Euclidean definition of division. Swift defines remainder(dividingBy:) for a modulo with a rounded definition. Are these important cases?

Our current charter for standard functions is to only capture functions that a sufficient number of query engines currently provide. Rust may offer rem_euclid but datafusion does not. Even if one engine were to provide such a method that would be a better case for an engine-specific extension function than a standard Substrait function.

@jordanvieler
Copy link
Contributor Author

jordanvieler commented Dec 14, 2023

Float vs integer is relevant because different implementations (e.g f32 vs i32) have different options. Historically we've used overflow=SILENT, SATURATE, ERROR for integer operations and on_domain_error=NAN, ERROR for floating point operations. We could have done this other ways too. For example, integer division could have both overflow=SILENT,SATURATE,ERROR (to handle min_int/-1) and on_domain_error=SILENT,ERROR (for division by zero). We went with the simplest approach to match what was currently being offered by available engines.

The current PR's proposed on_domain_error will not work for integer implementations however because NAN is not possible.

This does answer my "is overflow possible for modulo?" question though (if we consider division by zero to be overflow).

Ok, I see what you are saying about NaN not being relevant for int types. Thank you for clarifying that. However, I think there is a difference between overflow and out-of-domain for modulo. Do we assume that a Substrait consumer will return an Error for out-of-domain values of the modulus function, namely 0, as a divisor? So, no option is necessary, right? I could also imagine an argument for MAX_INT being a return value in these cases.

Our current charter for standard functions is to only capture functions that a sufficient number of query engines currently provide. Rust may offer rem_euclid but datafusion does not. Even if one engine were to provide such a method that would be a better case for an engine-specific extension function than a standard Substrait function.

I understand the rationale, but I am approaching this slightly differently. Shouldn't standard Substrait functions be toward creating a platform, language, and system agnostic specification and network protocol for data compute operations? I think only supporting the current behavior of existing query engines could introduce a historical bias and limit use cases. SQL itself is only loosely based on relational algebra.

@westonpace
Copy link
Member

Do we assume that a Substrait consumer will return an Error for out-of-domain values of the modulus function, namely 0, as a divisor? So, no option is necessary, right?

Sadly, it seems we are not so lucky.

Postgres, Datafusion, SQL server, velox raise an error
Sqlite, spark, and MySQL return null
Pandas and cudf silently upcast the entire operation to floating point and return NaN (this would not be valid substrait because the return type is not allowed to depend on input and so we can probably ignore this case)

Even ignoring the invalid case we still have to consider error and null options.

I understand the rationale, but I am approaching this slightly differently. Shouldn't standard Substrait functions be toward creating a platform, language, and system agnostic specification and network protocol for data compute operations? I think only supporting the current behavior of existing query engines could introduce a historical bias and limit use cases. SQL itself is only loosely based on relational algebra.

This is a fair viewpoint but the flipside is that anyone using Substrait to create plans cannot reasonably assume the plan will run. Either way, this issue is probably not the place to discuss the charter for functions. I'm afraid I'm rather inflexible. I've been reviewing based on the last time we discussed this topic which was at #307.

If we want a new charter then we should have a separate discussion where everyone can weigh in. Either as a new issue, in a community meeting, or on the mailing list. Discussing the scope of Substrait functions on a per-function / per-PR basis doesn't really work.

@jordanvieler
Copy link
Contributor Author

jordanvieler commented Dec 14, 2023

Sadly, it seems we are not so lucky.

Postgres, Datafusion, SQL server, velox raise an error Sqlite, spark, and MySQL return null Pandas and cudf silently upcast the entire operation to floating point and return NaN (this would not be valid substrait because the return type is not allowed to depend on input and so we can probably ignore this case)

Even ignoring the invalid case we still have to consider error and null options.

Does this mean that a field to specify domain errors is warranted separately from the overflow error we have discussed?

This is a fair viewpoint but the flipside is that anyone using Substrait to create plans cannot reasonably assume the plan will run. Either way, this issue is probably not the place to discuss the charter for functions. I'm afraid I'm rather inflexible. I've been reviewing based on the last time we discussed this topic which was at #307.

Per the referenced topic it seems like modulo is exactly the pedantic case that @jvanstraten mentions.

If we want a new charter then we should have a separate discussion where everyone can weigh in. Either as a new issue, in a community meeting, or on the mailing list. Discussing the scope of Substrait functions on a per-function / per-PR basis doesn't really work.

I completely understand. I am relatively new to contributing to open-source. My attempt to resolve the divide-by-zero behavior highlighted these in nuances modulo. Would it then make sense to bring this aspect of the discussion somewhere else?

@jordanvieler
Copy link
Contributor Author

For now how does this look?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep both overflow and on_domain_error but change NAN to NULL since we are dealing with integer kernels which cannot represent NAN.

extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
overflow:
values: [ SILENT, SATURATE, ERROR ]
on_domain_error:
values: [ NAN, ERROR ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
values: [ NAN, ERROR ]
values: [ NULL, ERROR ]

extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
@jordanvieler
Copy link
Contributor Author

Okay. Thank you!

values: [ TRUNCATE, FLOOR ]
overflow:
values: [ SILENT, SATURATE, ERROR ]
on_domain_error: values: [ NULL, ERROR ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should be on two lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Bad Thanks.

@jordanvieler
Copy link
Contributor Author

It seems like NULL is not valid for the YAML Linter.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be an annoying yaml / yaml-validator gotcha. NULL is being interpreted as the data type "null" and not as a string.

extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
overflow:
values: [ SILENT, SATURATE, ERROR ]
on_domain_error:
values: [ NULL, ERROR ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
values: [ NULL, ERROR ]
values: [ "NULL", ERROR ]

@westonpace
Copy link
Member

The one thing we haven't talked about is whether the name change justifies breaking backwards compatibility. We have a community meeting coming up on Wednesday and I think we should bring this up there.

@jordanvieler
Copy link
Contributor Author

Is it necessary? I added that change after making a very deep dive into mod and thought I read modulus referred to something else. But now I think it's just preference.

@westonpace
Copy link
Member

Is it necessary? I added that change after making a very deep dive into mod and thought I read modulus referred to something else. But now I think it's just preference.

If you're comfortable with the old name then let's keep it. There are systems out there which are relying on the old name and so the breaking change would be a bit inconvenient.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. Thanks for all your work! @EpsilonPrime did you want to take another look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants