Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add functions for arithmetic, rounding, logarithmic, and string transformations #230

Closed
wants to merge 6 commits into from

Conversation

sanjibansg
Copy link
Contributor

This PR adds definitions of extension functions in Arithmetic, Rounding, Logarithmic & String Transformations.

description: "Negation of the value"
impls:
- args:
- options: [ SILENT, SATURATE, ERROR ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh... I suppose this exists because of the uneveness of twos-compliment but ugh...

name: "power"
description: "Take the power with the first value as the base and second as exponent"
impls:
- args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you come up with these signatures from? For example, why does i8^i8 return i8? It seems like we'll run out of space very quickly. I feel like many systems support something like power(i64, fp64) => fp64 and power(fp64,fp64) => fp64 and that is enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @sanjibansg was basing these on Arrow's implementation. Arrow's implementation currently does not promote:

>>> import pyarrow.compute as pc
>>> import pyarrow as pa
>>> x = pa.array([1, 2, 3])
>>> pc.power(x, 320)
<pyarrow.lib.Int64Array object at 0x7f442c7fda00>
[
  1,
  0,
  -9149805402889408255
]

Postgres and MySQL promote integers to decimal. SQL Server does not promote (and raises an overflow error)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacques-n do you have a strong preference which direction we go here? Should we have multiple power functions?

name: "sqrt"
description: "Square root of the value"
impls:
- args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason that smaller numbers use fp32 as an output? It seems like maybe they should all be fp64. Are using a particular system's definition here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Arrow also upcasts all to fp64.

description: Whether a value is not a number.
impls:
- args:
- value: any1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like the right argument type. Should we just have two of these: one for fp32 and one for fp64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears Arrow supports all numeric types but for anything other than float or double the return is hard-coded to false. I'd be onboard with only supporting floating point types.

- args:
- value: any1
return: BOOLEAN
nullability: DECLARED_OUTPUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this declared output? Shouldn't nan be null if the input is null?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings (null is not nan) but Arrow seems to agree with @jacques-n:

>>> pc.is_nan(None)
<pyarrow.BooleanScalar: None>

description: "Rounding to the floor of the value"
impls:
- args:
- options: [ SILENT, SATURATE, ERROR ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would I have overflow behaviors for floor?

- args:
- options: [ SILENT, SATURATE, ERROR ]
required: false
- value: i8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the floor of an i8? For example, what is the floor of 7?

@@ -163,3 +164,23 @@ scalar_functions:
- value: "fixedchar<L1>"
- value: "varchar<L2>"
return: "BOOLEAN"
-
name: lower
description: Transforms the string to lower case characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some additional definition here of how lower case is defined. I assume there is a utf definition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow defers to utf8proc which is somewhat sparse on details. However, you are correct, there is a standard UTF-8 way of lowercasing, though it sometimes does the wrong thing semantically.

description: Transforms the string to lower case characters
impls:
- args:
- value: "varchar<L1>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add fixedchar?

- value: "string"
return: "string"
-
name: upper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as on lower.

@jacques-n
Copy link
Contributor

An additional note here: several of these functions should also be supported for decimal types.

@westonpace westonpace self-requested a review June 23, 2022 14:42
Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address comments provided.

Comment on lines 181 to 186
- options: [ SILENT, SATURATE, ERROR ]
required: false
- value: fp32
return: fp32
- args:
- options: [ SILENT, SATURATE, ERROR ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can floating point negation overflow?

name: "power"
description: "Take the power with the first value as the base and second as exponent"
impls:
- args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @sanjibansg was basing these on Arrow's implementation. Arrow's implementation currently does not promote:

>>> import pyarrow.compute as pc
>>> import pyarrow as pa
>>> x = pa.array([1, 2, 3])
>>> pc.power(x, 320)
<pyarrow.lib.Int64Array object at 0x7f442c7fda00>
[
  1,
  0,
  -9149805402889408255
]

Postgres and MySQL promote integers to decimal. SQL Server does not promote (and raises an overflow error)

name: "sqrt"
description: "Square root of the value"
impls:
- args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Arrow also upcasts all to fp64.

return: fp64
-
name: "sqrt"
description: "Square root of the value"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the value is negative?

description: Whether a value is not a number.
impls:
- args:
- value: any1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears Arrow supports all numeric types but for anything other than float or double the return is hard-coded to false. I'd be onboard with only supporting floating point types.

- args:
- value: any1
return: BOOLEAN
nullability: DECLARED_OUTPUT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings (null is not nan) but Arrow seems to agree with @jacques-n:

>>> pc.is_nan(None)
<pyarrow.BooleanScalar: None>

---
scalar_functions:
-
name: "ln"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to document what these options apply to (and their meaning) somewhere. In arrow, arithmetic functions overflow by default. However, logrithmic functions go to -inf or NaN. I'm not sure if this latter behavior is "saturate" or "silent".

- args:
- options: [ SILENT, SATURATE, ERROR ]
required: false
- value: i8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow does not support integer round explicitly. However, we will auto-cast an integer column to a double column and then call round on that. I think this has led to some confusion in the docs. I agree with @jacques-n that these kernels should not exist.

@@ -163,3 +164,23 @@ scalar_functions:
- value: "fixedchar<L1>"
- value: "varchar<L2>"
return: "BOOLEAN"
-
name: lower
description: Transforms the string to lower case characters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow defers to utf8proc which is somewhat sparse on details. However, you are correct, there is a standard UTF-8 way of lowercasing, though it sometimes does the wrong thing semantically.

@westonpace
Copy link
Member

@jacques-n curious on your opinion here. A number of these functions are floating point functions and in Arrow we have two variants. A checked version which returns an error and an unchecked version which returns NaN. For example, dealing with negative numbers in sqrt. Do we consider returning NaN to be SATURATE or SILENT? Or should we call it something else? Does both SATURATE and SILENT apply here?

@jvanstraten
Copy link
Contributor

No one asked me, but signalling NaNs by definition are error markers, so ignoring the insertion of one should hopefully be "silent", not "saturate". Also, if anything, all floating point operations that don't have domain issues saturate to +/- infinity on overflow or +/- 0 on underflow; put differently, overflow does not exist in the first place, because positive and negative infinities can be represented. So basically, I don't think trying to coerce the semantics of unsigned and two's complement integer arithmetic into IEEE 754 arithmetic is very sensible. What might be sensible though is to add the rounding options that IEEE 754 itself defines. I'd propose something like

- name: rounding
  options: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ]
  required: false
- name: on_domain_error
  options: [ NAN, ERROR ]
  required: false

if we want to be thorough, with the latter only defined for operations that are actually affected by domain errors.

@cpcloud
Copy link
Contributor

cpcloud commented Jul 5, 2022

@jacques-n curious on your opinion here. A number of these functions are floating point functions and in Arrow we have two variants. A checked version which returns an error and an unchecked version which returns NaN. For example, dealing with negative numbers in sqrt. Do we consider returning NaN to be SATURATE or SILENT? Or should we call it something else? Does both SATURATE and SILENT apply here?

I agree with @jvanstraten's assessment here. It's not really possible to say whether an operation should be saturating or silent given only the knowledge that it returns NaN. Even if we could, as Jeroen said saturation doesn't make a whole lot of sense for IEEE floating point numbers with valid domains.

@westonpace
Copy link
Member

+1 from me as well for that answer. Thanks @jvanstraten . I have one question regarding rounding. Is that needed for the case where the inputs/outputs are integers? Or is this some kind of floating point rounding rule for the case where the exact answer falls between two valid floating point numbers due to limits in precision?

@jvanstraten
Copy link
Contributor

Or is this some kind of floating point rounding rule for the case where the exact answer falls between two valid floating point numbers due to limits in precision?

It's that; this isn't about rounding to some nearby integer but about rounding to one of the two nearest possible representations.

I suppose the same options could be used for float -> int, though the CEILING and FLOOR options would need some additional specification for the boundary conditions (e.g. does 127.5 with CEILING round to i8 127 or does it trigger overflow behavior?). Note that i32/i64 -> fp32 and i64 to fp64 also need this rule because the mantissa isn't large enough to represent all possible values precisely, but i8/i16 -> fp32 and i8/i16/i32 -> fp64 never need to round.

Also, in case anyone is unsure:

  • TIE_TO_EVEN: choose the nearest possible representation. If the desired value is exactly halfway between the representations, tie toward the nearest even integer.
  • TIE_AWAY_FROM_ZERO: choose the nearest possible representation. If the desired value is exactly halfway between the representations, tie toward the one with the larger magnitude.
  • TRUNCATE: if the desired value is not exactly representable, choose the one that is closest to zero.
  • CEILING: if the desired value is not exactly representable, choose the one that is closest to positive infinity.
  • FLOOR: if the desired value is not exactly representable, choose the one that is closest to negative infinity.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

1 similar comment
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ianmcook
Copy link
Contributor

ianmcook commented Nov 28, 2022

This PR should be closed. It is superseded by other PRs including #245, #267, #322.

@sanjibansg
Copy link
Contributor Author

This PR should be closed. It is superseded by other PRs including #245, #267, #322.

Sure, agreed.

@sanjibansg sanjibansg closed this Nov 29, 2022
@sanjibansg sanjibansg deleted the functions branch November 29, 2022 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants