feat(rust,python): initial working version of Decimal Series #7220

aldanor · 2023-02-27T11:39:49Z

No description provided.

aldanor · 2023-02-27T11:43:02Z

@plaflamme @ritchie46 Pushing my WIP branch in its current state, this obviously needs further work before we even get to arithmetics, ops and the such. But first I think we should make decisions on whether things are generally correct type-wise/dtype-wise etc given that decimals are a bit weird compared to other types.

// there's also a standing bug with creating decimal series from anyvalues – just inferring the type from the first decimal doesn't work, you have to scan the entire sequence first to determine the scale

ritchie46 · 2023-02-27T12:31:08Z

// there's also a standing bug with creating decimal series from anyvalues – just inferring the type from the first decimal doesn't work, you have to scan the entire sequence first to determine the scale

Maybe we could start with negative scale (indicating that it should be set) and then we keep track of it while we are scanning the items. Another strategy is always using a default scale if not set by users. We do the same with datetime units.

aldanor · 2023-02-27T13:05:03Z

// there's also a standing bug with creating decimal series from anyvalues – just inferring the type from the first decimal doesn't work, you have to scan the entire sequence first to determine the scale

Maybe we could start with negative scale (indicating that it should be set) and then we keep track of it while we are scanning the items. Another strategy is always using a default scale if not set by users. We do the same with datetime units.

I think it's a bit different here from datetime units, at least if we restrict it to decimal.Decimal (and potentially arrow scalar decimal128). Those always have defined scale, e.g. you can have input of

[
    Decimal('1.23'),    # scale = 2
    Decimal('-1000'),   # scale = 0
    Decimal('0.0001'),  # scale = 4
    Decimal('0'),       # scale = 0
]

In this case you have to have to use (prec = None, scale = 4), ending up with the following i128 values:

[1_2300, -1000_0000, 1, 0]

One exception would be if we allow to mix integers and Decimals in any-values, e.g. ideally this should produce the same output:

[Decimal('1.23'), -1000, Decimal('0.0001'), Decimal('0') ]

The logic is slightly different if decimal dtype has been already provided to us – then we have to check that (1) the resulting scale is >= than the actual scale of each item, (2) there's no i128 overflow while rescaling, (3) if precision is set, it is being respected.

ritchie46 · 2023-02-27T15:12:36Z

I think it's a bit different here from datetime units, at least if we restrict it to decimal.Decimal (and potentially arrow scalar decimal128). Those always have defined scale, e.g. you can have input of

Yeap, I understand. What I mean is that upon inference we set the DataType to Decimal(?, ?) and once we have build the array we fill in the blanks. We do the same for Categoricals the datatype in categoricals store the dictionary, but that is known once we finished building it.

ritchie46 · 2023-02-27T15:14:51Z

@aldanor shall we try to get an absolute minimal version working and merge that in? I rather break this down in smaller steps and see what we encounter.

aldanor · 2023-02-27T15:35:34Z

@aldanor shall we try to get an absolute minimal version working and merge that in? I rather break this down in smaller steps and see what we encounter.

I think after writing at least some tests for the basic functionality that has been added here, like formatting, parsing converting to/from python decimal, and from arrow, to/from anyvalues etc – I would be pretty uncomfortable merging it otherwise. I can jump on adding some tests here – if you have any particular suggestions for test cases please feel free to post.

// Btw: failing Python tests on the CI are actually a sign that things sort of work as intended – they are the ones that checked that Decimal was converted to Float64 which it isn't the case anymore.

ritchie46 · 2023-02-27T18:17:09Z

I think after writing at least some tests for the basic functionality that has been added here, like formatting, parsing converting to/from python decimal, and from arrow, to/from anyvalues etc – I would be pretty uncomfortable merging it otherwise. I can jump on adding some tests here – if you have any particular suggestions for test cases please feel free to post.

The python tests are preferred test locations. These are the ultimate integation tests and don't hurt compile times. If we want to test on the rust side, the integrations tests folder: tests/it/ is preferred. We had very slow CI builds with inline tests, that's why we move them to tests/it.

aldanor · 2023-02-27T23:04:55Z

So, I guess conversion logic must go into Series::from_any_values_and_dtype(). What's a bit weird is that's it unclear what the dtype that's being provided is and where it comes from (because for decimals, to come up with the dtype, you need to scan the series, so it makes sense to do it in Rust).

The dtype in Series::from_any_values_and_dtype() – must it be respected strictly? I.e. if it tells you dtype==d1, can I return dtype==d2? Or try hard to fit it into d2 and fail if it can't?

So, where does this dtype even come from, is it mainly internal inference or external dtype=...?

I was thinking the logic should be something like this:

Scan through anyvalues list once, compute min scale and max scale (where integers get treated as scale=0)
If min_scale == max_scale == dtype.scale, all good, no conversions needed. Do another pass and convert all raw values to i128 and build Series. The last step, what if prec != None? Do we check values here to see if they fit within precision range?
If min_scale != max_scale, some elements need to be rescaled. Same if either min_scale or max_scale != dtype.scale?... Create a new vector a rescale everything to max_scale if possible, then go back to the previous step.
If some anyvalue objects, hypothetically, have prec != None, what do we do? What if some have prec = None and some have != None? How does it play with dtype.prec?
I was thinking, should AnyValue even have precision? Python integers and decimals don't have precision. The only way to get it would be either from arrow scalars (?) or by indexing pl.Series itself (but if you do it in Python, you get back Python decimal and then precision is lost again, so is there a point?). Really leaning towards dropping precision from AnyValue so it would become just AnyValue::Decimal(value: i128, scale: usize) (any unforeseen catches?)

plaflamme

My 2 cents. This looks great BTW 👍

polars/polars-core/src/datatypes/dtype.rs

py-polars/src/conversion.rs

aldanor · 2023-02-28T18:23:14Z

@ritchie46 @plaflamme Here's a new batch of updates (the whole thing is brittle as f, I'm surprised it actually works but it does; lots of work ahead to cover various corners!)

DataType::Decimal now has an optional scale (heh, we've come full circle). It's a bit weird because for instantiated Series (or DecimalChunked) it will never be None, however it is required to pass the information downstream that "Decimal scale has to be inferred". I initially used some placeholders like usize::MAX, but I'd prefer for it to crash loudly if something wrong happens.
Removed precision from AnyValue::Decimal.
All existing tests now pass (had to add special support for dataclasses).
Added a few very basic tests (instantiation of series/frames with a mixture of decimals, ints and nones, and then recovering them back into python)
Supertype merging logic and constructor type fallback logic is now implemented. This means an array of decimals + ints + nones will be casted to decimals, regardless of order. Decimals + floats should yield floats but that's untested.
Reworked &[AnyValue] -> Decimal logic completely, with a double-pass approach (no allocations/ops if no conversions needed though)
Conversion to Decimal series will now do a precision check (still not sure if we should check for 38 digits if precision=None).
Someone please take a detailed look at any_values_to_decimal() conversion routine, that one wasn't so easy to figure out...?

aldanor · 2023-02-28T19:01:25Z

Next steps, I think: write tests for all possible ways to build series/frames with decimals in them from Python/Arrow, and to convert out of it. Might make sense to write them even if they fail and mark them as @xfail, just to keep the 'todo list' within the codebase.

There are definitely gaps to fill and decisions to make (e.g., should pl.Series(['0.01'], dtype=pl.Decimal) yield decimal series? would be cool if it did. What's the supertype of Decimal and str? not sure).

ritchie46 · 2023-02-28T19:33:57Z

DataType::Decimal now has an optional scale (heh, we've come full circle). It's a bit weird because for instantiated Series (or DecimalChunked) it will never be None, however it is required to pass the information downstream that "Decimal scale has to be inferred". I initially used some placeholders like usize::MAX, but I'd prefer for it to crash loudly if something wrong happens.

This is also the case for Categorical. The data types can hold information for that column that is only known after the column is created. 👍

Removed precision from AnyValue::Decimal.

Yes, that should be in DataType. AnyValue should only have varying data. The constant data per column should be in DataType.

polars/polars-core/src/chunked_array/logical/decimal.rs

py-polars/polars/utils.py

py-polars/src/conversion.rs

ritchie46

Alright. Thanks a lot @aldanor! This is great functionality to have. I have left a few minor remarks and then it is good to go.

py-polars/src/conversion.rs

py-polars/polars/datatypes.py

aldanor · 2023-03-02T21:45:38Z

@ritchie46 fixed/rebased, thanks for pointing a few things out 🙂 I think even though there's tons of work ahead and it's in a barely-working state we should merge it in because otherwise constant rebasing will become pretty painful given the large number of files affected here...

ritchie46 · 2023-03-03T05:53:17Z

Thanks a lot @aldanor. Is it ok if I ping you in future decimal related issues?

aldanor · 2023-03-03T11:00:36Z

Thanks a lot @aldanor. Is it ok if I ping you in future decimal related issues?

@ritchie46 👍

marsupialtail · 2023-03-05T03:42:05Z

Some of my workloads are now breaking because polars infer things to be decimal type and then proceeds to complain that some operations are not supported on the decimal column. This makes me unable to use latest 0.16.10, everything works for me 0.16.8.FYI @ritchie46 @aldanor

ritchie46 · 2023-03-05T07:07:30Z

Can you tell which operations are missing?

aldanor force-pushed the feature/int128-decimal branch 3 times, most recently from 23b4e60 to 50e2dbe Compare February 27, 2023 11:55

plaflamme reviewed Feb 28, 2023

View reviewed changes

polars/polars-core/src/datatypes/dtype.rs Show resolved Hide resolved

polars/polars-core/src/datatypes/dtype.rs Outdated Show resolved Hide resolved

py-polars/src/conversion.rs Show resolved Hide resolved

py-polars/src/conversion.rs Outdated Show resolved Hide resolved

aldanor force-pushed the feature/int128-decimal branch 2 times, most recently from 104d66f to 57d5905 Compare February 28, 2023 18:32

aldanor marked this pull request as ready for review February 28, 2023 18:40

aldanor changed the title ~~[WIP] feat(rust,python): initial working version of Decimal Series~~ feat(rust,python): initial working version of Decimal Series Feb 28, 2023

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Feb 28, 2023

ritchie46 reviewed Feb 28, 2023

View reviewed changes

polars/polars-core/src/chunked_array/logical/decimal.rs Outdated Show resolved Hide resolved

polars/polars-core/src/chunked_array/logical/decimal.rs Outdated Show resolved Hide resolved

py-polars/polars/utils.py Show resolved Hide resolved

py-polars/src/conversion.rs Show resolved Hide resolved

aldanor force-pushed the feature/int128-decimal branch from d85e4f5 to d8e7769 Compare February 28, 2023 22:59

aldanor requested a review from ritchie46 March 1, 2023 12:34

ritchie46 reviewed Mar 1, 2023

View reviewed changes

py-polars/src/conversion.rs Outdated Show resolved Hide resolved

py-polars/src/conversion.rs Outdated Show resolved Hide resolved

py-polars/polars/datatypes.py Outdated Show resolved Hide resolved

aldanor added 4 commits March 2, 2023 21:33

feat(rust): AnyValue::Decimal + fast formatting

3925288

feat(python): support Decimal in IntoPy

51d1ff4

feat(python): support FromPyObject for Decimal

00483ba

refactor(rust): "dtype-decimal", DataType::Decimal

e3a77e1

aldanor added 16 commits March 2, 2023 21:33

refactor(rust): make decimal precision optional

2bf1b0e

feat(rust): impl LogicalType for DecimalChunked

8955c09

feat(rust,python): decimal DataType / conversions

17c5bda

feat(rust,python): initial working Decimal Series

f0f717e

feat(python): impl eq/hash for Decimal dtype

c9ebadf

feat(rust): add get_unchecked() for SWrap<Decimal>

ad9b1e4

feat(rust): full rework of [anyvalues]->Decimal

dba9896

feat(rust): support Decimal in Series::full_null()

45f0b8f

feat(rust): impl supertype merging for decimals

1bcb811

feat(python): Series::new_decimal + ctor fallback

3013d4c

fix(python): fix existing Python decimal tests

c24e214

chore(python): add a few basic tests for decimals

b79dbe9

feat(rust,python): improve Decimal inference/ctors

7229a71

fix(rust,python): Decimal: speed up AV / add test

b9e771a

chore(python): add a warning note to Decimal dtype

47b6011

chore(rust): remove tmp code in py->dec extraction

52923bf

aldanor force-pushed the feature/int128-decimal branch from d8e7769 to 52923bf Compare March 2, 2023 21:42

aldanor requested a review from ritchie46 March 2, 2023 21:43

aldanor mentioned this pull request Mar 2, 2023

Int128 / Decimal128 design discussion #7178

Closed

ritchie46 merged commit 0148801 into pola-rs:master Mar 3, 2023

aldanor deleted the feature/int128-decimal branch March 3, 2023 09:12

sslivkoff mentioned this pull request Apr 12, 2023

pl.read_parquet() and pl.write_parquet() for pl.Decimal #8191

Closed

alexander-beedie added the A-dtype-decimal Area: decimal data type label Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust,python): initial working version of Decimal Series #7220

feat(rust,python): initial working version of Decimal Series #7220

aldanor commented Feb 27, 2023

aldanor commented Feb 27, 2023

ritchie46 commented Feb 27, 2023

aldanor commented Feb 27, 2023 •

edited

Loading

ritchie46 commented Feb 27, 2023

ritchie46 commented Feb 27, 2023

aldanor commented Feb 27, 2023

ritchie46 commented Feb 27, 2023

aldanor commented Feb 27, 2023 •

edited

Loading

plaflamme left a comment

aldanor commented Feb 28, 2023 •

edited

Loading

aldanor commented Feb 28, 2023 •

edited

Loading

ritchie46 commented Feb 28, 2023

ritchie46 left a comment

aldanor commented Mar 2, 2023 •

edited

Loading

ritchie46 commented Mar 3, 2023

aldanor commented Mar 3, 2023

marsupialtail commented Mar 5, 2023

ritchie46 commented Mar 5, 2023

feat(rust,python): initial working version of Decimal Series #7220

feat(rust,python): initial working version of Decimal Series #7220

Conversation

aldanor commented Feb 27, 2023

aldanor commented Feb 27, 2023

ritchie46 commented Feb 27, 2023

aldanor commented Feb 27, 2023 • edited Loading

ritchie46 commented Feb 27, 2023

ritchie46 commented Feb 27, 2023

aldanor commented Feb 27, 2023

ritchie46 commented Feb 27, 2023

aldanor commented Feb 27, 2023 • edited Loading

plaflamme left a comment

Choose a reason for hiding this comment

aldanor commented Feb 28, 2023 • edited Loading

aldanor commented Feb 28, 2023 • edited Loading

ritchie46 commented Feb 28, 2023

ritchie46 left a comment

Choose a reason for hiding this comment

aldanor commented Mar 2, 2023 • edited Loading

ritchie46 commented Mar 3, 2023

aldanor commented Mar 3, 2023

marsupialtail commented Mar 5, 2023

ritchie46 commented Mar 5, 2023

aldanor commented Feb 27, 2023 •

edited

Loading

aldanor commented Feb 27, 2023 •

edited

Loading

aldanor commented Feb 28, 2023 •

edited

Loading

aldanor commented Feb 28, 2023 •

edited

Loading

aldanor commented Mar 2, 2023 •

edited

Loading