Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust,python): initial working version of Decimal Series #7220

Merged
merged 20 commits into from
Mar 3, 2023

Conversation

aldanor
Copy link
Contributor

@aldanor aldanor commented Feb 27, 2023

No description provided.

@aldanor
Copy link
Contributor Author

aldanor commented Feb 27, 2023

@plaflamme @ritchie46 Pushing my WIP branch in its current state, this obviously needs further work before we even get to arithmetics, ops and the such. But first I think we should make decisions on whether things are generally correct type-wise/dtype-wise etc given that decimals are a bit weird compared to other types.

// there's also a standing bug with creating decimal series from anyvalues – just inferring the type from the first decimal doesn't work, you have to scan the entire sequence first to determine the scale

@aldanor aldanor force-pushed the feature/int128-decimal branch 3 times, most recently from 23b4e60 to 50e2dbe Compare February 27, 2023 11:55
@ritchie46
Copy link
Member

// there's also a standing bug with creating decimal series from anyvalues – just inferring the type from the first decimal doesn't work, you have to scan the entire sequence first to determine the scale

Maybe we could start with negative scale (indicating that it should be set) and then we keep track of it while we are scanning the items. Another strategy is always using a default scale if not set by users. We do the same with datetime units.

@aldanor
Copy link
Contributor Author

aldanor commented Feb 27, 2023

// there's also a standing bug with creating decimal series from anyvalues – just inferring the type from the first decimal doesn't work, you have to scan the entire sequence first to determine the scale

Maybe we could start with negative scale (indicating that it should be set) and then we keep track of it while we are scanning the items. Another strategy is always using a default scale if not set by users. We do the same with datetime units.

I think it's a bit different here from datetime units, at least if we restrict it to decimal.Decimal (and potentially arrow scalar decimal128). Those always have defined scale, e.g. you can have input of

[
    Decimal('1.23'),    # scale = 2
    Decimal('-1000'),   # scale = 0
    Decimal('0.0001'),  # scale = 4
    Decimal('0'),       # scale = 0
]

In this case you have to have to use (prec = None, scale = 4), ending up with the following i128 values:

[1_2300, -1000_0000, 1, 0]

One exception would be if we allow to mix integers and Decimals in any-values, e.g. ideally this should produce the same output:

[Decimal('1.23'), -1000, Decimal('0.0001'), Decimal('0') ]

The logic is slightly different if decimal dtype has been already provided to us – then we have to check that (1) the resulting scale is >= than the actual scale of each item, (2) there's no i128 overflow while rescaling, (3) if precision is set, it is being respected.

@ritchie46
Copy link
Member

I think it's a bit different here from datetime units, at least if we restrict it to decimal.Decimal (and potentially arrow scalar decimal128). Those always have defined scale, e.g. you can have input of

Yeap, I understand. What I mean is that upon inference we set the DataType to Decimal(?, ?) and once we have build the array we fill in the blanks. We do the same for Categoricals the datatype in categoricals store the dictionary, but that is known once we finished building it.

@ritchie46
Copy link
Member

@aldanor shall we try to get an absolute minimal version working and merge that in? I rather break this down in smaller steps and see what we encounter.

@aldanor
Copy link
Contributor Author

aldanor commented Feb 27, 2023

@aldanor shall we try to get an absolute minimal version working and merge that in? I rather break this down in smaller steps and see what we encounter.

I think after writing at least some tests for the basic functionality that has been added here, like formatting, parsing converting to/from python decimal, and from arrow, to/from anyvalues etc – I would be pretty uncomfortable merging it otherwise. I can jump on adding some tests here – if you have any particular suggestions for test cases please feel free to post.

// Btw: failing Python tests on the CI are actually a sign that things sort of work as intended – they are the ones that checked that Decimal was converted to Float64 which it isn't the case anymore.

@ritchie46
Copy link
Member

I think after writing at least some tests for the basic functionality that has been added here, like formatting, parsing converting to/from python decimal, and from arrow, to/from anyvalues etc – I would be pretty uncomfortable merging it otherwise. I can jump on adding some tests here – if you have any particular suggestions for test cases please feel free to post.

The python tests are preferred test locations. These are the ultimate integation tests and don't hurt compile times. If we want to test on the rust side, the integrations tests folder: tests/it/ is preferred. We had very slow CI builds with inline tests, that's why we move them to tests/it.

@aldanor
Copy link
Contributor Author

aldanor commented Feb 27, 2023

So, I guess conversion logic must go into Series::from_any_values_and_dtype(). What's a bit weird is that's it unclear what the dtype that's being provided is and where it comes from (because for decimals, to come up with the dtype, you need to scan the series, so it makes sense to do it in Rust).

The dtype in Series::from_any_values_and_dtype() – must it be respected strictly? I.e. if it tells you dtype==d1, can I return dtype==d2? Or try hard to fit it into d2 and fail if it can't?

So, where does this dtype even come from, is it mainly internal inference or external dtype=...?

I was thinking the logic should be something like this:

  • Scan through anyvalues list once, compute min scale and max scale (where integers get treated as scale=0)
  • If min_scale == max_scale == dtype.scale, all good, no conversions needed. Do another pass and convert all raw values to i128 and build Series. The last step, what if prec != None? Do we check values here to see if they fit within precision range?
  • If min_scale != max_scale, some elements need to be rescaled. Same if either min_scale or max_scale != dtype.scale?... Create a new vector a rescale everything to max_scale if possible, then go back to the previous step.
  • If some anyvalue objects, hypothetically, have prec != None, what do we do? What if some have prec = None and some have != None? How does it play with dtype.prec?
  • I was thinking, should AnyValue even have precision? Python integers and decimals don't have precision. The only way to get it would be either from arrow scalars (?) or by indexing pl.Series itself (but if you do it in Python, you get back Python decimal and then precision is lost again, so is there a point?). Really leaning towards dropping precision from AnyValue so it would become just AnyValue::Decimal(value: i128, scale: usize) (any unforeseen catches?)

Copy link
Contributor

@plaflamme plaflamme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents. This looks great BTW 👍

polars/polars-core/src/datatypes/dtype.rs Show resolved Hide resolved
polars/polars-core/src/datatypes/dtype.rs Outdated Show resolved Hide resolved
py-polars/src/conversion.rs Show resolved Hide resolved
py-polars/src/conversion.rs Outdated Show resolved Hide resolved
@aldanor
Copy link
Contributor Author

aldanor commented Feb 28, 2023

@ritchie46 @plaflamme Here's a new batch of updates (the whole thing is brittle as f, I'm surprised it actually works but it does; lots of work ahead to cover various corners!)

  • DataType::Decimal now has an optional scale (heh, we've come full circle). It's a bit weird because for instantiated Series (or DecimalChunked) it will never be None, however it is required to pass the information downstream that "Decimal scale has to be inferred". I initially used some placeholders like usize::MAX, but I'd prefer for it to crash loudly if something wrong happens.
  • Removed precision from AnyValue::Decimal.
  • All existing tests now pass (had to add special support for dataclasses).
  • Added a few very basic tests (instantiation of series/frames with a mixture of decimals, ints and nones, and then recovering them back into python)
  • Supertype merging logic and constructor type fallback logic is now implemented. This means an array of decimals + ints + nones will be casted to decimals, regardless of order. Decimals + floats should yield floats but that's untested.
  • Reworked &[AnyValue] -> Decimal logic completely, with a double-pass approach (no allocations/ops if no conversions needed though)
  • Conversion to Decimal series will now do a precision check (still not sure if we should check for 38 digits if precision=None).
  • Someone please take a detailed look at any_values_to_decimal() conversion routine, that one wasn't so easy to figure out...?

@aldanor aldanor force-pushed the feature/int128-decimal branch 2 times, most recently from 104d66f to 57d5905 Compare February 28, 2023 18:32
@aldanor aldanor marked this pull request as ready for review February 28, 2023 18:40
@aldanor
Copy link
Contributor Author

aldanor commented Feb 28, 2023

Next steps, I think: write tests for all possible ways to build series/frames with decimals in them from Python/Arrow, and to convert out of it. Might make sense to write them even if they fail and mark them as @xfail, just to keep the 'todo list' within the codebase.

There are definitely gaps to fill and decisions to make (e.g., should pl.Series(['0.01'], dtype=pl.Decimal) yield decimal series? would be cool if it did. What's the supertype of Decimal and str? not sure).

@aldanor aldanor changed the title [WIP] feat(rust,python): initial working version of Decimal Series feat(rust,python): initial working version of Decimal Series Feb 28, 2023
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Feb 28, 2023
@ritchie46
Copy link
Member

DataType::Decimal now has an optional scale (heh, we've come full circle). It's a bit weird because for instantiated Series (or DecimalChunked) it will never be None, however it is required to pass the information downstream that "Decimal scale has to be inferred". I initially used some placeholders like usize::MAX, but I'd prefer for it to crash loudly if something wrong happens.

This is also the case for Categorical. The data types can hold information for that column that is only known after the column is created. 👍

Removed precision from AnyValue::Decimal.

Yes, that should be in DataType. AnyValue should only have varying data. The constant data per column should be in DataType.

polars/polars-core/src/chunked_array/logical/decimal.rs Outdated Show resolved Hide resolved
polars/polars-core/src/chunked_array/logical/decimal.rs Outdated Show resolved Hide resolved
py-polars/polars/utils.py Show resolved Hide resolved
py-polars/src/conversion.rs Show resolved Hide resolved
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Thanks a lot @aldanor! This is great functionality to have. I have left a few minor remarks and then it is good to go.

py-polars/src/conversion.rs Outdated Show resolved Hide resolved
py-polars/src/conversion.rs Outdated Show resolved Hide resolved
py-polars/polars/datatypes.py Outdated Show resolved Hide resolved
@aldanor
Copy link
Contributor Author

aldanor commented Mar 2, 2023

@ritchie46 fixed/rebased, thanks for pointing a few things out 🙂 I think even though there's tons of work ahead and it's in a barely-working state we should merge it in because otherwise constant rebasing will become pretty painful given the large number of files affected here...

@ritchie46
Copy link
Member

Thanks a lot @aldanor. Is it ok if I ping you in future decimal related issues?

@ritchie46 ritchie46 merged commit 0148801 into pola-rs:master Mar 3, 2023
@aldanor aldanor deleted the feature/int128-decimal branch March 3, 2023 09:12
@aldanor
Copy link
Contributor Author

aldanor commented Mar 3, 2023

Thanks a lot @aldanor. Is it ok if I ping you in future decimal related issues?

@ritchie46 👍

@marsupialtail
Copy link

Some of my workloads are now breaking because polars infer things to be decimal type and then proceeds to complain that some operations are not supported on the decimal column. This makes me unable to use latest 0.16.10, everything works for me 0.16.8.FYI @ritchie46 @aldanor

@ritchie46
Copy link
Member

Can you tell which operations are missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-decimal Area: decimal data type enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants