Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an Option for Default Integer and Floating Bitwidth #11272

Merged
merged 64 commits into from
Aug 1, 2022

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Jul 15, 2022

This PR introduces a cudf option to allow user to control the default bitwidth for integer and floating types. The first iteration only plans to provide three options: None, 32bit and 64bit. When set as None, that means the result dtype will align with what pandas constructs. Otherwise, default to what user specifies.

"Default" implies that it should only affects places that requires type inference, that includes:

  • CSV/JSON readers when dtypes are not specified
  • cuDF constructors
  • Materializing a range index.

This PR is the first demonstration use of cudf.option, depending on #11193. Diff will reduce once it's merged.

closes #11182 #10318

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 15, 2022
def default_32bit_integer():
cudf.set_option("default_integer_bitwidth", 32)
yield
cudf.set_option("default_integer_bitwidth", 64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be None rather than hardcoded to 64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we actually want to be clear to user what the default bit width is? If you allow None for default bitwidth, I assume that conveys that "use whatever inferred dtype anywhere the developer of your function decides to use", which doesn't sounds like a good user experience to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using None is helpful for reducing API promises. We can document how None is interpreted, but I think it’s fine to use None here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, this makes three available values for the option: 64, 32 and None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice - Michael and I were pairing on the remaining bits of this PR and we decided not to support None as an option right now. As far as we can tell, None makes an artificial distinction between Pandas' current behavior and defaulting to 64-bit data types (Pandas seems to always do that anyway).

That being said, if we do discover a case where e.g., Pandas infers data types differently in different places, we can revisit and change the default to None (Pandas behaviour).

As for this fixture in particular, I think the following change would suffice:

  def default_32bit_integer():
    default_integer_bitwidth = cudf.get_option("default_integer_bitwidth")
    cudf.set_option("default_integer_bitwidth", 32)
    yield
    cudf.set_option("default_integer_bitwidth", default_integer_bitwidth)

Copy link
Contributor

@shwina shwina Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I wrote this comment we discovered a place where defaulting to 64-bit inference would break current code, but it's subtle:

On branch-22.08, the dtype of a Series constructed from a list of numpy scalars is inferred via NumPy:

>>> cudf.Series([np.int8(1), np.int16(1)])
0    1
1    1
dtype: int16

In this PR, the default data type will be used:

>>> cudf.set_option("default_integer_bitwidth", 32)
>>> cudf.Series([np.int8(1), np.int16(1)])
0    1
1    1
dtype: int32

To prevent breakages in places like this, we decided we should support None after all :-)

Sorry for the noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! That’s a great catch. Goes to show just how complex this logic really is…

Copy link
Contributor Author

@isVoid isVoid Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None option is added and tests are updated accordingly. (both default as 64 and 32 are tested)

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a couple of minor suggestions. This is really great work!

rapids-bot bot pushed a commit that referenced this pull request Jul 28, 2022
This PR adds `cudf.options`, a global dictionary to store configurations. A set of helper functions to manage the registries are also included. See documentation included in the PR for detail.

See demonstration use in: #11272

Closes #5311

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Bradley Dice (https://github.com/bdice)

URL: #11193
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Michael and I discussed this and addressed my review comments in person, so I'm approving now pending his pushing all those final changes.

@shwina
Copy link
Contributor

shwina commented Jul 29, 2022

@gpucibot merge

@shwina
Copy link
Contributor

shwina commented Aug 1, 2022

rerun tests

@rapids-bot rapids-bot bot merged commit 734cc1f into rapidsai:branch-22.08 Aug 1, 2022
ajschmidt8 pushed a commit that referenced this pull request Aug 3, 2022
This PR fixes a flaky test introduced by #11272, cudf joins by default does not guarantee return orders and may lead to occasional test regression. This PR adds `sort` argument to make sure result is deterministic.

Note that `index.union` and `index.intersection` may also include random output ordering, but by default these methods sorts the result before returning so `sort` argument does not need to be modified.

Authors:
   - Michael Wang (https://github.com/isVoid)

Approvers:
   - GALI PREM SAGAR (https://github.com/galipremsagar)
   - https://github.com/brandon-b-miller
   - Nghia Truong (https://github.com/ttnghia)
@vyasr vyasr mentioned this pull request Oct 5, 2022
3 tasks
rapids-bot bot pushed a commit that referenced this pull request Oct 6, 2022
These operators rely on a method that was renamed in #11272 and are also out of sync with the rest of the `RangeIndex` design now that the `__getattr__` overload has been removed (#10538).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #11868
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Allow Configuring Default Bit Width Used for Json and Csv Readers
4 participants