-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Int128 / Decimal128 design discussion #7178
Comments
I'm happy to see this move forward, here are some thoughts on the matter.
Indeed, and in fact, I believe the arrow spec is considering adding
That doesn't seem possible without first-class Arrow support for
I wasn't comfortable with this
Presumably, having Another concern is inter-op and maybe "surprising" behaviour, e.g.: Other than those 2 small-ish points, I think your assessment makes a lot of sense. |
I'd lean towards that as well – start with decimal and see how it goes from there.
Not being familiar with all corners of the polars codebase, not sure why it has to be that way, would appreciate any explanations! I see there's also
That's the question here really. Either:
Here's an example to ponder on that regards precision and arithmetic ops: from decimal import Decimal
from pyarrow import compute, array, decimal128
arr = array([Decimal(x) for x in ['1.23', '-0.01']], decimal128(10, 5))
print(arr.type)
for op in ['add', 'subtract', 'multiply', 'divide']:
print(op, '=>', getattr(compute, op)(arr, arr).type)
for op in ['negate', 'abs', 'sign', 'sqrt']:
print(op, '=>', getattr(compute, op)(arr).type)
print('mul->mul =>', compute.multiply(compute.multiply(arr, arr), arr).type)
# error ArrowInvalid: Decimal precision out of range [1, 38]: 43
# print('mul->mul =>', compute.multiply(compute.multiply(compute.multiply(arr, arr), arr), arr).type) which outputs
Arrow is definitely doing it "correctly" here, bumping the "maximum possible precision", but that also means that you can't square a series if it has precision of 20 or greater, which would be pretty weird for a compute-first library like polars. |
Having the distinction between physical type and logical type likely can save us a lot of code. We don't plan a
Nope, I don't think we need it. Might be wrong because of problems I don't oversee atm.
I don't understand decimals good enough yet to have a opinion about this any other than let's copy arrow for the time being.
I want |
@ritchie46 Thanks for chiming in!
Gotcha. So we'll go without Int128 logical type then.
Well, sort of – only if scale is the same (if not, it will require multiplication to bring things to the same scale first). Anyways, yea – int128 physical and decimal128 logical.
Makes sense, let's keep it as Now, the two remaining unresolved questions:
I don't have a good understanding for the usage of the Unknown variant and those parametrized datatypes filled out with None's, so any guidance here would be appreciated.
Let me try to eli5 real quick if you don't mind?
One middle-ground-solution with precision and numeric ops might be like this:
Hope this wasn't too much, I tried to make it as brief as possible :) |
Thanks a lot for the ELI5! That helps a lot. I agree with your proposals of
I think that we should error in this case. If I understand it correctly, the only reason to set precision is to restrict loss of precision. If so, you want this to explode in your face when you accidentally get loss of precision. Aside from that point I agree with your proposals |
So do I understand it correctly then that
Hm, not sure I follow, or maybe I'm misreading it :) We will never have a "silent loss of precision". My suggestion re: precision widening was more like:
So my point was, I think a simpler way to get started is discarding precision on any arithmetic ops, otherwise we'll have to tinker with precision widening and invent some rules etc. Like – if you multiply series Cheers. |
Ah, makes sense. We are on the same page. 😊 I agree with your proposals. 👍
I think we don't need it indeed. |
For inter-op, it would nice to support a third option of using a specified precision (with an error if it's narrower than the narrowest required). This comes up when, for example, you maintain partitioned datasets using parquet files. Say you serialize partition for day 1, you may infer precision/scale This might not be a concern polars wants to handle, but at the end of the day, users need to deal with this one way or another. My current solution (using pandas for decimals) is to export to arrow, "widen" decimals, and then serialize to parquet using arrow; it'd be a lot simpler to just say Perhaps this can be done prior to serialization as well, similar to the idea of having |
In general I agree with your points. Once thing to note though, while I guess we could always come up with something more flexible like: df.write_parquet('foo.pq', decimal_precision=7) # for all columns
df.write_parquet('foo.pq', decimal_precision={'a': 7, 'b': 8}) # specific precisions
df.write_parquet('foo.pq', decimal_precision={'a': None, 'b': 8}) # None = infer
df.write_parquet('foo.pq', decimal_precision={'a': None, 'b': 8, None: 9}) # default=9 (and this can all stay almost entirely on the Python side, so can be easily tweaked as needed) |
Yes, absolutely 👍 There are certainly many use-cases for setting / inferring precision across many columns. Not sure what stance polars has on flexibility / complexity of the programmer-facing API, but I personally would prefer more options, have the library default to safe options (widening to 38 in this case for example) and produce errors in face of data loss. |
Ok, well I kind of got it to a point where you can build a decimal There's a whole bunch of confusion (and I think bugs already) re: physical vs logical types. Decimals don't really fit current paradigms (e.g. see On a good side: from decimal import Decimal
import pyarrow as pa
import polars as pl
d = [Decimal('123.45'), Decimal('-0.123')]
print(pl.Series(d)) # python -> polars
arr = pa.array(d, pa.decimal128(10, 5))
s = pl.Series(arr) # arrow -> polars
print(s) # formatting
print(s.dtype) # to python dtype
print(s[0]) # to anyvalue -> to python in my branch prints
which is already kind of a big deal I think :) If we're happy to start with a potentially buggy but sort of working solution that may need heavy refactor - I can open a PR tomorrow for discussion – it might make sense not to merge it too fast until we're happy with physical vs logical stuff and general types, before we start working on tests, arithmetics, conversions and all that. First things first, need to get correct dtypes/anyvalues/correct conversions in all directions first of all. |
This is amazing progress! I would definitely recommend a PR at this point (maybe a draft one?); I can contribute: coding, testing or otherwise. |
Great stuff! 👏 I am all for merging incremental steps in. If we release this as alpha functionality, the bugs will come in and we will get a better understand of how things fail and how we should fix them.
Right, this is awkward indeed. Let's deal with this awkwardness for a little while. Maybe eventually we just have to decide to support |
Closing via #7220 |
@ritchie46 @plaflamme
Not really an issue, just a separate place to discuss design decisions re: int128-decimals.
I have started some work roughly based on @plaflamme's branch (but not directly based on it, so please don't push too much in that branch, lol), and unless there's objections or unexpected roadblocks I hope to get it done in the nearby future.
Here's a few design questions though, some of which I have mentioned in the original PR:
PrimitiveArray<i128>
andi128: arrow2::NativeType
, Arrow itself does not have anInt128
type, it can only be represented via a decimal with a scale of zero.DataType::Int128
type? Would there be a use case for it (as opposed to Decimal128 with scale=0)? Numpy doesn't support it, Python ints we can to int64 anyways, and Arrow doesn't have a direct high-level Int128 type, only the primitive one.Option<(usize, usize)>
in the decimal datatype with a(usize, usize)
(or even justusize
for the scale, see the next point below). If we ever need a separate raw int128 datatype, I believe it should be a separate thing on its own – it's convertible to different Python type (int
as opposed toDecimal
), it uses different formatting etc; it's just the backing container that's the same, and the arithmetic logic is identical to scale=0 decimals.precision
? This field comes all the way from SQL databases where you would have precision + scale and the backing storage type may have depended on precision (so it would choose the narrowest type to fit your data). In our case:i128
even if you use a single bit of data (for in-memory representation; for storage purposes, it's different in parquet, see below).precision
doesn't change how the data is interpreted, it's exactly the same. All arithmetic operations are identical and depend only onscale
.precision
from Python'sdecimal.Decimal
once it's been created, only the exponent (i.e., the scale), which sort of makes sense. TheContext
's main job is to spawn new decimals only.Decimal(p1, s1) + Decimal(p2, s2)
, you can infer the scale (e.g. add them if it's multiplication or pick max if it's addition etc), but can you easily infer correct precision before performing the operation? This may become pretty cumbersome.precision
as optional whilescale
mandatory...arrow-rs
now supports it (although it didn't at some point), but as far as I can tellarrow2
does not. Python's Decimal does. The easiest way out would be to over-restrict it and requirescale >= 0
.TLDR: suggestion:
DataType::Decimal(Option<(usize, usize)>)
withDecimal(usize)
(scale only), orDecimal(usize, Option<usize>)
(see above). If we don't want to just use precision=38 by default, make itDecimal(usize, usize)
.The text was updated successfully, but these errors were encountered: