Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Format] Add Decimal32 and Decimal64 to Arrow #43956

Open
2 of 5 tasks
zeroshade opened this issue Sep 4, 2024 · 0 comments
Open
2 of 5 tasks

[Format] Add Decimal32 and Decimal64 to Arrow #43956

zeroshade opened this issue Sep 4, 2024 · 0 comments

Comments

@zeroshade
Copy link
Member

zeroshade commented Sep 4, 2024

Describe the enhancement requested

Instead of adding entirely new types, we should widen the existing Decimal128/256 types to allow bit widths of 32 and 64.

This will require:

  • Update specification and Schema.fbs documentation
  • Implement basic type, arrays, builders, scalars
  • Update compute / acero kernels with the new types
  • Update Parquet implementation to leverage the new bitwidths where available
  • Add test cases for Decimal32/Decimal64 to existing tests and integration test cases

Component(s)

Format

zeroshade added a commit to zeroshade/arrow that referenced this issue Sep 4, 2024
zeroshade added a commit to zeroshade/arrow that referenced this issue Sep 5, 2024
zeroshade added a commit to zeroshade/arrow that referenced this issue Sep 5, 2024
zeroshade added a commit to zeroshade/arrow that referenced this issue Sep 10, 2024
zeroshade added a commit that referenced this issue Sep 10, 2024
### Rationale for this change
Widening the Decimal128/256 type to allow for bitwidths of 32 and 64
allows for more interoperability with other libraries and utilities
which already support these types. This provides even more opportunities
for zero-copy interactions between things such as libcudf and various
databases.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

### What changes are included in this PR?
Updating the documentation in Schema.fbs to explicitly state that 32-bit
and 64-bit is now allowed for bitwidths of Decimal types. This is the
only area in the the spec that mentions the allowed decimal bitwidths.
* GitHub Issue: #43956

---------

Co-authored-by: Antoine Pitrou <[email protected]>
khwilson pushed a commit to khwilson/arrow that referenced this issue Sep 14, 2024
…43976)

### Rationale for this change
Widening the Decimal128/256 type to allow for bitwidths of 32 and 64
allows for more interoperability with other libraries and utilities
which already support these types. This provides even more opportunities
for zero-copy interactions between things such as libcudf and various
databases.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

### What changes are included in this PR?
Updating the documentation in Schema.fbs to explicitly state that 32-bit
and 64-bit is now allowed for bitwidths of Decimal types. This is the
only area in the the spec that mentions the allowed decimal bitwidths.
* GitHub Issue: apache#43956

---------

Co-authored-by: Antoine Pitrou <[email protected]>
zeroshade added a commit to zeroshade/arrow that referenced this issue Sep 30, 2024
…43976)

Widening the Decimal128/256 type to allow for bitwidths of 32 and 64
allows for more interoperability with other libraries and utilities
which already support these types. This provides even more opportunities
for zero-copy interactions between things such as libcudf and various
databases.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Updating the documentation in Schema.fbs to explicitly state that 32-bit
and 64-bit is now allowed for bitwidths of Decimal types. This is the
only area in the the spec that mentions the allowed decimal bitwidths.
* GitHub Issue: apache#43956

---------

Co-authored-by: Antoine Pitrou <[email protected]>
zeroshade added a commit to zeroshade/arrow that referenced this issue Sep 30, 2024
zeroshade added a commit that referenced this issue Sep 30, 2024
…ns (#43957)

<!--
Thanks for opening a pull request!
If this is your first pull request you can find detailed information on
how
to contribute here:
* [New Contributor's
Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
* [Contributing
Overview](https://arrow.apache.org/docs/dev/developers/overview.html)


If this is not a [minor
PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes).
Could you open an issue for this pull request on GitHub?
https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the
[Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.)
of the Apache Arrow project.

Then could you also rename the pull request title in the following
format?

    GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

    MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

    PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

-->

### Rationale for this change
Widening the Decimal128/256 type to allow for bitwidths of 32 and 64
allows for more interoperability with other libraries and utilities
which already support these types. This provides even more opportunities
for zero-copy interactions between things such as libcudf and various
databases.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

### What changes are included in this PR?
This PR contains the basic C++ implementations for Decimal32/Decimal64
types, arrays, builders and scalars. It also includes the minimum
necessary to get everything compiling and tests passing without also
extending the acero kernels and parquet handling (both of which will be
handled in follow-up PRs).

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

### Are these changes tested?
Yes, tests were extended where applicable to add decimal32/decimal64
cases.

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

### Are there any user-facing changes?
Currently if a user is using `decimal(precision, scale)` rather than
`decimal128(precision, scale)` they will get a `Decimal128Type` if the
precision is <= 38 (max precision for Decimal128) and `Decimal256Type`
if the precision is higher. Following the same pattern, this change
means that using `decimal(precision, scale)` instead of the specific
`decimal32`/`decimal64`/`decimal128`/`decimal256` functions results in
the following functionality:

- for precisions [1 : 9] => `Decimal32Type`
- for precisions [10 : 18] => `Decimal64Type`
- for precisions [19 : 38] => `Decimal128Type`
- for precisions [39 : 76] => `Decimal256Type`

While many of our tests currently make the assumption that `decimal`
with a low precision would be `Decimal128` and had to be updated, this
may cause an initial surprise if users are making the same assumptions.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please uncomment the
line below and explain which changes are breaking.
-->
<!-- **This PR includes breaking changes to public APIs.** -->

<!--
Please uncomment the line below (and provide explanation) if the changes
fix either (a) a security vulnerability, (b) a bug that caused incorrect
or invalid data to be produced, or (c) a bug that causes a crash (even
when the API contract is upheld). We use this to highlight fixes to
issues that may affect users without their knowledge. For this reason,
fixing bugs that cause errors don't count, since those are usually
obvious.
-->
<!-- **This PR contains a "Critical Fix".** -->
* GitHub Issue: #43956

---------

Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
zeroshade added a commit to zeroshade/arrow that referenced this issue Dec 12, 2024
zeroshade added a commit that referenced this issue Dec 13, 2024
<!--
Thanks for opening a pull request!
If this is your first pull request you can find detailed information on
how
to contribute here:
* [New Contributor's
Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
* [Contributing
Overview](https://arrow.apache.org/docs/dev/developers/overview.html)


If this is not a [minor
PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes).
Could you open an issue for this pull request on GitHub?
https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the
[Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.)
of the Apache Arrow project.

Then could you also rename the pull request title in the following
format?

    GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

    MINOR: [${COMPONENT}] ${SUMMARY}

-->

### Rationale for this change
Furthering the support for Decimal32/Decimal64 among Acero and casting
functionality. This is also necessary for #44882 to add support for
Decimal32/64 to PyArrow

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

### What changes are included in this PR?
Adding kernels for casting to and from Decimal32/Decimal64 between
numeric, floating point, string and other decimal types.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

### Are these changes tested?
Yes, unit tests are added accordingly.

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

* GitHub Issue: #43956
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant