feat: add functions for arithmetic, rounding, logarithmic, and string transformations #230

sanjibansg · 2022-06-22T08:39:17Z

This PR adds definitions of extension functions in Arithmetic, Rounding, Logarithmic & String Transformations.

… transformations

jacques-n · 2022-06-22T23:41:00Z

extensions/functions_arithmetic.yaml

+    description: "Negation of the value"
+    impls:
+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]


Ugh... I suppose this exists because of the uneveness of twos-compliment but ugh...

jacques-n · 2022-06-22T23:42:54Z

extensions/functions_arithmetic.yaml

+    name: "power"
+    description: "Take the power with the first value as the base and second as exponent"
+    impls:
+      - args:


Where did you come up with these signatures from? For example, why does i8^i8 return i8? It seems like we'll run out of space very quickly. I feel like many systems support something like power(i64, fp64) => fp64 and power(fp64,fp64) => fp64 and that is enough.

I believe @sanjibansg was basing these on Arrow's implementation. Arrow's implementation currently does not promote:

>>> import pyarrow.compute as pc >>> import pyarrow as pa >>> x = pa.array([1, 2, 3]) >>> pc.power(x, 320) <pyarrow.lib.Int64Array object at 0x7f442c7fda00> [ 1, 0, -9149805402889408255 ]

Postgres and MySQL promote integers to decimal. SQL Server does not promote (and raises an overflow error)

@jacques-n do you have a strong preference which direction we go here? Should we have multiple power functions?

jacques-n · 2022-06-22T23:44:01Z

extensions/functions_arithmetic.yaml

+    name: "sqrt"
+    description: "Square root of the value"
+    impls:
+      - args:


Is there a particular reason that smaller numbers use fp32 as an output? It seems like maybe they should all be fp64. Are using a particular system's definition here?

I agree. Arrow also upcasts all to fp64.

jacques-n · 2022-06-22T23:44:39Z

extensions/functions_comparison.yaml

+    description: Whether a value is not a number.
+    impls:
+      - args:
+          - value: any1


This doesn't seem like the right argument type. Should we just have two of these: one for fp32 and one for fp64?

It appears Arrow supports all numeric types but for anything other than float or double the return is hard-coded to false. I'd be onboard with only supporting floating point types.

jacques-n · 2022-06-22T23:45:26Z

extensions/functions_comparison.yaml

+      - args:
+          - value: any1
+        return: BOOLEAN
+        nullability: DECLARED_OUTPUT


Why is this declared output? Shouldn't nan be null if the input is null?

I don't have strong feelings (null is not nan) but Arrow seems to agree with @jacques-n:

>>> pc.is_nan(None) <pyarrow.BooleanScalar: None>

jacques-n · 2022-06-22T23:46:55Z

extensions/functions_rounding.yaml

+    description: "Rounding to the floor of the value"
+    impls:
+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]


Why would I have overflow behaviors for floor?

jacques-n · 2022-06-22T23:47:11Z

extensions/functions_rounding.yaml

+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]
+            required: false
+          - value: i8


What is the floor of an i8? For example, what is the floor of 7?

jacques-n · 2022-06-22T23:47:52Z

extensions/functions_string.yaml

@@ -163,3 +164,23 @@ scalar_functions:
          - value: "fixedchar<L1>"
          - value: "varchar<L2>"
        return: "BOOLEAN"
+  -
+    name: lower
+    description: Transforms the string to lower case characters


Let's add some additional definition here of how lower case is defined. I assume there is a utf definition?

Arrow defers to utf8proc which is somewhat sparse on details. However, you are correct, there is a standard UTF-8 way of lowercasing, though it sometimes does the wrong thing semantically.

jacques-n · 2022-06-22T23:48:02Z

extensions/functions_string.yaml

+    description: Transforms the string to lower case characters
+    impls:
+      - args:
+          - value: "varchar<L1>"


Shall we add fixedchar?

jacques-n · 2022-06-22T23:48:11Z

extensions/functions_string.yaml

+          - value: "string"
+        return: "string"
+  -
+    name: upper


same comments as on lower.

jacques-n · 2022-06-23T03:06:10Z

An additional note here: several of these functions should also be supported for decimal types.

jacques-n

Please address comments provided.

westonpace · 2022-06-27T21:32:26Z

extensions/functions_arithmetic.yaml

+          - options: [ SILENT, SATURATE, ERROR ]
+            required: false
+          - value: fp32
+        return: fp32
+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]


Can floating point negation overflow?

westonpace · 2022-06-27T21:42:58Z

extensions/functions_arithmetic.yaml

+    name: "power"
+    description: "Take the power with the first value as the base and second as exponent"
+    impls:
+      - args:


I believe @sanjibansg was basing these on Arrow's implementation. Arrow's implementation currently does not promote:

>>> import pyarrow.compute as pc >>> import pyarrow as pa >>> x = pa.array([1, 2, 3]) >>> pc.power(x, 320) <pyarrow.lib.Int64Array object at 0x7f442c7fda00> [ 1, 0, -9149805402889408255 ]

Postgres and MySQL promote integers to decimal. SQL Server does not promote (and raises an overflow error)

westonpace · 2022-06-27T22:03:28Z

extensions/functions_arithmetic.yaml

+    name: "sqrt"
+    description: "Square root of the value"
+    impls:
+      - args:


I agree. Arrow also upcasts all to fp64.

westonpace · 2022-06-27T22:03:47Z

extensions/functions_arithmetic.yaml

+        return: fp64
+  -
+    name: "sqrt"
+    description: "Square root of the value"


What happens if the value is negative?

westonpace · 2022-06-27T22:05:37Z

extensions/functions_comparison.yaml

+    description: Whether a value is not a number.
+    impls:
+      - args:
+          - value: any1


It appears Arrow supports all numeric types but for anything other than float or double the return is hard-coded to false. I'd be onboard with only supporting floating point types.

westonpace · 2022-06-27T22:10:13Z

extensions/functions_comparison.yaml

+      - args:
+          - value: any1
+        return: BOOLEAN
+        nullability: DECLARED_OUTPUT


I don't have strong feelings (null is not nan) but Arrow seems to agree with @jacques-n:

>>> pc.is_nan(None) <pyarrow.BooleanScalar: None>

westonpace · 2022-06-27T22:12:36Z

extensions/functions_logarithmic.yaml

+---
+scalar_functions:
+  -
+    name: "ln"


It would be good to document what these options apply to (and their meaning) somewhere. In arrow, arithmetic functions overflow by default. However, logrithmic functions go to -inf or NaN. I'm not sure if this latter behavior is "saturate" or "silent".

westonpace · 2022-06-27T22:16:22Z

extensions/functions_rounding.yaml

+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]
+            required: false
+          - value: i8


Arrow does not support integer round explicitly. However, we will auto-cast an integer column to a double column and then call round on that. I think this has led to some confusion in the docs. I agree with @jacques-n that these kernels should not exist.

westonpace · 2022-06-27T22:27:25Z

extensions/functions_string.yaml

@@ -163,3 +164,23 @@ scalar_functions:
          - value: "fixedchar<L1>"
          - value: "varchar<L2>"
        return: "BOOLEAN"
+  -
+    name: lower
+    description: Transforms the string to lower case characters


Arrow defers to utf8proc which is somewhat sparse on details. However, you are correct, there is a standard UTF-8 way of lowercasing, though it sometimes does the wrong thing semantically.

westonpace · 2022-06-29T14:35:32Z

@jacques-n curious on your opinion here. A number of these functions are floating point functions and in Arrow we have two variants. A checked version which returns an error and an unchecked version which returns NaN. For example, dealing with negative numbers in sqrt. Do we consider returning NaN to be SATURATE or SILENT? Or should we call it something else? Does both SATURATE and SILENT apply here?

jvanstraten · 2022-06-29T16:48:20Z

No one asked me, but signalling NaNs by definition are error markers, so ignoring the insertion of one should hopefully be "silent", not "saturate". Also, if anything, all floating point operations that don't have domain issues saturate to +/- infinity on overflow or +/- 0 on underflow; put differently, overflow does not exist in the first place, because positive and negative infinities can be represented. So basically, I don't think trying to coerce the semantics of unsigned and two's complement integer arithmetic into IEEE 754 arithmetic is very sensible. What might be sensible though is to add the rounding options that IEEE 754 itself defines. I'd propose something like

- name: rounding
  options: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ]
  required: false
- name: on_domain_error
  options: [ NAN, ERROR ]
  required: false

if we want to be thorough, with the latter only defined for operations that are actually affected by domain errors.

cpcloud · 2022-07-05T14:19:49Z

@jacques-n curious on your opinion here. A number of these functions are floating point functions and in Arrow we have two variants. A checked version which returns an error and an unchecked version which returns NaN. For example, dealing with negative numbers in sqrt. Do we consider returning NaN to be SATURATE or SILENT? Or should we call it something else? Does both SATURATE and SILENT apply here?

I agree with @jvanstraten's assessment here. It's not really possible to say whether an operation should be saturating or silent given only the knowledge that it returns NaN. Even if we could, as Jeroen said saturation doesn't make a whole lot of sense for IEEE floating point numbers with valid domains.

westonpace · 2022-07-05T14:59:40Z

+1 from me as well for that answer. Thanks @jvanstraten . I have one question regarding rounding. Is that needed for the case where the inputs/outputs are integers? Or is this some kind of floating point rounding rule for the case where the exact answer falls between two valid floating point numbers due to limits in precision?

jvanstraten · 2022-07-05T15:29:56Z

Or is this some kind of floating point rounding rule for the case where the exact answer falls between two valid floating point numbers due to limits in precision?

It's that; this isn't about rounding to some nearby integer but about rounding to one of the two nearest possible representations.

I suppose the same options could be used for float -> int, though the CEILING and FLOOR options would need some additional specification for the boundary conditions (e.g. does 127.5 with CEILING round to i8 127 or does it trigger overflow behavior?). Note that i32/i64 -> fp32 and i64 to fp64 also need this rule because the mantissa isn't large enough to represent all possible values precisely, but i8/i16 -> fp32 and i8/i16/i32 -> fp64 never need to round.

Also, in case anyone is unsure:

TIE_TO_EVEN: choose the nearest possible representation. If the desired value is exactly halfway between the representations, tie toward the nearest even integer.
TIE_AWAY_FROM_ZERO: choose the nearest possible representation. If the desired value is exactly halfway between the representations, tie toward the one with the larger magnitude.
TRUNCATE: if the desired value is not exactly representable, choose the one that is closest to zero.
CEILING: if the desired value is not exactly representable, choose the one that is closest to positive infinity.
FLOOR: if the desired value is not exactly representable, choose the one that is closest to negative infinity.

CLAassistant · 2022-10-06T23:48:08Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

CLAassistant · 2022-10-06T23:48:32Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

ianmcook · 2022-11-28T14:45:34Z

This PR should be closed. It is superseded by other PRs including #245, #267, #322.

sanjibansg · 2022-11-29T07:09:12Z

This PR should be closed. It is superseded by other PRs including #245, #267, #322.

Sure, agreed.

feat: add functions for arithmetic, rounding, logarithmic, and string…

41fb45a

… transformations

sanjibansg force-pushed the functions branch from 99dcfa8 to 41fb45a Compare June 22, 2022 08:41

jacques-n reviewed Jun 22, 2022

View reviewed changes

westonpace self-requested a review June 23, 2022 14:42

jacques-n requested changes Jun 25, 2022

View reviewed changes

westonpace reviewed Jun 27, 2022

View reviewed changes

sanjibansg added 5 commits June 29, 2022 10:20

feat: sqrt returns in fp64 only

e87a152

fix: ceil and floor on floating point only

5bcf8b1

feat: fixedchar in string transformations

31d8d72

fix: overflow on floating point negation

8735b4e

fix: is_nan with only floating points

0042b79

westonpace mentioned this pull request Jul 11, 2022

feat: add trigonometry functions #241

Merged

gforsyth mentioned this pull request Jul 14, 2022

feat: add functions for arithmetic, rounding, logarithmic, and string transformations #245

Merged

spevenhe mentioned this pull request Nov 9, 2022

[POAE7-2544] String LOWER/UPPER op support for Arrow format intel/BDTK#147

Merged

sanjibansg closed this Nov 29, 2022

sanjibansg deleted the functions branch November 29, 2022 07:09

feat: add functions for arithmetic, rounding, logarithmic, and string transformations #230

feat: add functions for arithmetic, rounding, logarithmic, and string transformations #230

Conversation

sanjibansg commented Jun 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n commented Jun 23, 2022

jacques-n left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Jun 29, 2022

jvanstraten commented Jun 29, 2022

cpcloud commented Jul 5, 2022

westonpace commented Jul 5, 2022

jvanstraten commented Jul 5, 2022

CLAassistant commented Oct 6, 2022

CLAassistant commented Oct 6, 2022

ianmcook commented Nov 28, 2022 • edited Loading

sanjibansg commented Nov 29, 2022

ianmcook commented Nov 28, 2022 •

edited

Loading