-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add functions for arithmetic, rounding, logarithmic, and string transformations #230
Conversation
description: "Negation of the value" | ||
impls: | ||
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh... I suppose this exists because of the uneveness of twos-compliment but ugh...
name: "power" | ||
description: "Take the power with the first value as the base and second as exponent" | ||
impls: | ||
- args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you come up with these signatures from? For example, why does i8^i8 return i8? It seems like we'll run out of space very quickly. I feel like many systems support something like power(i64, fp64) => fp64
and power(fp64,fp64) => fp64
and that is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @sanjibansg was basing these on Arrow's implementation. Arrow's implementation currently does not promote:
>>> import pyarrow.compute as pc
>>> import pyarrow as pa
>>> x = pa.array([1, 2, 3])
>>> pc.power(x, 320)
<pyarrow.lib.Int64Array object at 0x7f442c7fda00>
[
1,
0,
-9149805402889408255
]
Postgres and MySQL promote integers to decimal. SQL Server does not promote (and raises an overflow error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacques-n do you have a strong preference which direction we go here? Should we have multiple power
functions?
name: "sqrt" | ||
description: "Square root of the value" | ||
impls: | ||
- args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason that smaller numbers use fp32 as an output? It seems like maybe they should all be fp64. Are using a particular system's definition here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Arrow also upcasts all to fp64.
extensions/functions_comparison.yaml
Outdated
description: Whether a value is not a number. | ||
impls: | ||
- args: | ||
- value: any1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like the right argument type. Should we just have two of these: one for fp32 and one for fp64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears Arrow supports all numeric types but for anything other than float or double the return is hard-coded to false. I'd be onboard with only supporting floating point types.
extensions/functions_comparison.yaml
Outdated
- args: | ||
- value: any1 | ||
return: BOOLEAN | ||
nullability: DECLARED_OUTPUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this declared output? Shouldn't nan be null if the input is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings (null is not nan) but Arrow seems to agree with @jacques-n:
>>> pc.is_nan(None)
<pyarrow.BooleanScalar: None>
extensions/functions_rounding.yaml
Outdated
description: "Rounding to the floor of the value" | ||
impls: | ||
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would I have overflow behaviors for floor?
extensions/functions_rounding.yaml
Outdated
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] | ||
required: false | ||
- value: i8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the floor of an i8? For example, what is the floor of 7?
@@ -163,3 +164,23 @@ scalar_functions: | |||
- value: "fixedchar<L1>" | |||
- value: "varchar<L2>" | |||
return: "BOOLEAN" | |||
- | |||
name: lower | |||
description: Transforms the string to lower case characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some additional definition here of how lower case is defined. I assume there is a utf definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow defers to utf8proc which is somewhat sparse on details. However, you are correct, there is a standard UTF-8 way of lowercasing, though it sometimes does the wrong thing semantically.
description: Transforms the string to lower case characters | ||
impls: | ||
- args: | ||
- value: "varchar<L1>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add fixedchar?
- value: "string" | ||
return: "string" | ||
- | ||
name: upper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as on lower.
An additional note here: several of these functions should also be supported for decimal types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address comments provided.
extensions/functions_arithmetic.yaml
Outdated
- options: [ SILENT, SATURATE, ERROR ] | ||
required: false | ||
- value: fp32 | ||
return: fp32 | ||
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can floating point negation overflow?
name: "power" | ||
description: "Take the power with the first value as the base and second as exponent" | ||
impls: | ||
- args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @sanjibansg was basing these on Arrow's implementation. Arrow's implementation currently does not promote:
>>> import pyarrow.compute as pc
>>> import pyarrow as pa
>>> x = pa.array([1, 2, 3])
>>> pc.power(x, 320)
<pyarrow.lib.Int64Array object at 0x7f442c7fda00>
[
1,
0,
-9149805402889408255
]
Postgres and MySQL promote integers to decimal. SQL Server does not promote (and raises an overflow error)
name: "sqrt" | ||
description: "Square root of the value" | ||
impls: | ||
- args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Arrow also upcasts all to fp64.
return: fp64 | ||
- | ||
name: "sqrt" | ||
description: "Square root of the value" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the value is negative?
extensions/functions_comparison.yaml
Outdated
description: Whether a value is not a number. | ||
impls: | ||
- args: | ||
- value: any1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears Arrow supports all numeric types but for anything other than float or double the return is hard-coded to false. I'd be onboard with only supporting floating point types.
extensions/functions_comparison.yaml
Outdated
- args: | ||
- value: any1 | ||
return: BOOLEAN | ||
nullability: DECLARED_OUTPUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings (null is not nan) but Arrow seems to agree with @jacques-n:
>>> pc.is_nan(None)
<pyarrow.BooleanScalar: None>
--- | ||
scalar_functions: | ||
- | ||
name: "ln" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to document what these options apply to (and their meaning) somewhere. In arrow, arithmetic functions overflow by default. However, logrithmic functions go to -inf
or NaN
. I'm not sure if this latter behavior is "saturate" or "silent".
extensions/functions_rounding.yaml
Outdated
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] | ||
required: false | ||
- value: i8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow does not support integer round explicitly. However, we will auto-cast an integer column to a double column and then call round on that. I think this has led to some confusion in the docs. I agree with @jacques-n that these kernels should not exist.
@@ -163,3 +164,23 @@ scalar_functions: | |||
- value: "fixedchar<L1>" | |||
- value: "varchar<L2>" | |||
return: "BOOLEAN" | |||
- | |||
name: lower | |||
description: Transforms the string to lower case characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow defers to utf8proc which is somewhat sparse on details. However, you are correct, there is a standard UTF-8 way of lowercasing, though it sometimes does the wrong thing semantically.
@jacques-n curious on your opinion here. A number of these functions are floating point functions and in Arrow we have two variants. A checked version which returns an error and an unchecked version which returns NaN. For example, dealing with negative numbers in |
No one asked me, but signalling NaNs by definition are error markers, so ignoring the insertion of one should hopefully be "silent", not "saturate". Also, if anything, all floating point operations that don't have domain issues saturate to +/- infinity on overflow or +/- 0 on underflow; put differently, overflow does not exist in the first place, because positive and negative infinities can be represented. So basically, I don't think trying to coerce the semantics of unsigned and two's complement integer arithmetic into IEEE 754 arithmetic is very sensible. What might be sensible though is to add the rounding options that IEEE 754 itself defines. I'd propose something like - name: rounding
options: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ]
required: false
- name: on_domain_error
options: [ NAN, ERROR ]
required: false if we want to be thorough, with the latter only defined for operations that are actually affected by domain errors. |
I agree with @jvanstraten's assessment here. It's not really possible to say whether an operation should be saturating or silent given only the knowledge that it returns NaN. Even if we could, as Jeroen said saturation doesn't make a whole lot of sense for IEEE floating point numbers with valid domains. |
+1 from me as well for that answer. Thanks @jvanstraten . I have one question regarding rounding. Is that needed for the case where the inputs/outputs are integers? Or is this some kind of floating point rounding rule for the case where the exact answer falls between two valid floating point numbers due to limits in precision? |
It's that; this isn't about rounding to some nearby integer but about rounding to one of the two nearest possible representations. I suppose the same options could be used for float -> int, though the Also, in case anyone is unsure:
|
|
1 similar comment
|
This PR adds definitions of extension functions in Arithmetic, Rounding, Logarithmic & String Transformations.