-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an Option for Default Integer and Floating Bitwidth #11272
Provide an Option for Default Integer and Floating Bitwidth #11272
Conversation
…fea/defatul_bit_width_python
…fea/defatul_bit_width_python
python/cudf/cudf/tests/conftest.py
Outdated
def default_32bit_integer(): | ||
cudf.set_option("default_integer_bitwidth", 32) | ||
yield | ||
cudf.set_option("default_integer_bitwidth", 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be None
rather than hardcoded to 64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we actually want to be clear to user what the default bit width is? If you allow None
for default bitwidth, I assume that conveys that "use whatever inferred dtype anywhere the developer of your function decides to use", which doesn't sounds like a good user experience to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using None
is helpful for reducing API promises. We can document how None is interpreted, but I think it’s fine to use None here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, this makes three available values for the option: 64, 32 and None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bdice - Michael and I were pairing on the remaining bits of this PR and we decided not to support None
as an option right now. As far as we can tell, None
makes an artificial distinction between Pandas' current behavior and defaulting to 64-bit data types (Pandas seems to always do that anyway).
That being said, if we do discover a case where e.g., Pandas infers data types differently in different places, we can revisit and change the default to None
(Pandas behaviour).
As for this fixture in particular, I think the following change would suffice:
def default_32bit_integer():
default_integer_bitwidth = cudf.get_option("default_integer_bitwidth")
cudf.set_option("default_integer_bitwidth", 32)
yield
cudf.set_option("default_integer_bitwidth", default_integer_bitwidth)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I wrote this comment we discovered a place where defaulting to 64-bit inference would break current code, but it's subtle:
On branch-22.08, the dtype of a Series constructed from a list of numpy scalars is inferred via NumPy:
>>> cudf.Series([np.int8(1), np.int16(1)])
0 1
1 1
dtype: int16
In this PR, the default data type will be used:
>>> cudf.set_option("default_integer_bitwidth", 32)
>>> cudf.Series([np.int8(1), np.int16(1)])
0 1
1 1
dtype: int32
To prevent breakages in places like this, we decided we should support None
after all :-)
Sorry for the noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! That’s a great catch. Goes to show just how complex this logic really is…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
option is added and tests are updated accordingly. (both default as 64 and 32 are tested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with a couple of minor suggestions. This is really great work!
This PR adds `cudf.options`, a global dictionary to store configurations. A set of helper functions to manage the registries are also included. See documentation included in the PR for detail. See demonstration use in: #11272 Closes #5311 Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Matthew Roeschke (https://github.com/mroeschke) - Bradley Dice (https://github.com/bdice) URL: #11193
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Michael and I discussed this and addressed my review comments in person, so I'm approving now pending his pushing all those final changes.
@gpucibot merge |
rerun tests |
This PR fixes a flaky test introduced by #11272, cudf joins by default does not guarantee return orders and may lead to occasional test regression. This PR adds `sort` argument to make sure result is deterministic. Note that `index.union` and `index.intersection` may also include random output ordering, but by default these methods sorts the result before returning so `sort` argument does not need to be modified. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/brandon-b-miller - Nghia Truong (https://github.com/ttnghia)
These operators rely on a method that was renamed in #11272 and are also out of sync with the rest of the `RangeIndex` design now that the `__getattr__` overload has been removed (#10538). Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11868
This PR introduces a cudf option to allow user to control the default bitwidth for integer and floating types. The first iteration only plans to provide three options:
None
, 32bit and 64bit. When set asNone
, that means the result dtype will align with what pandas constructs. Otherwise, default to what user specifies."Default" implies that it should only affects places that requires type inference, that includes:
This PR is the first demonstration use of
cudf.option
, depending on #11193. Diff will reduce once it's merged.closes #11182 #10318