Provide an Option for Default Integer and Floating Bitwidth #11272

isVoid · 2022-07-15T00:46:35Z

This PR introduces a cudf option to allow user to control the default bitwidth for integer and floating types. The first iteration only plans to provide three options: None, 32bit and 64bit. When set as None, that means the result dtype will align with what pandas constructs. Otherwise, default to what user specifies.

"Default" implies that it should only affects places that requires type inference, that includes:

CSV/JSON readers when dtypes are not specified
cuDF constructors
Materializing a range index.

This PR is the first demonstration use of cudf.option, depending on #11193. Diff will reduce once it's merged.

closes #11182 #10318

…fea/defatul_bit_width_python

shwina · 2022-07-27T20:10:54Z

python/cudf/cudf/tests/conftest.py

+def default_32bit_integer():
+    cudf.set_option("default_integer_bitwidth", 32)
+    yield
+    cudf.set_option("default_integer_bitwidth", 64)


Should this be None rather than hardcoded to 64?

Don't we actually want to be clear to user what the default bit width is? If you allow None for default bitwidth, I assume that conveys that "use whatever inferred dtype anywhere the developer of your function decides to use", which doesn't sounds like a good user experience to me.

Using None is helpful for reducing API promises. We can document how None is interpreted, but I think it’s fine to use None here.

That's fair, this makes three available values for the option: 64, 32 and None.

@bdice - Michael and I were pairing on the remaining bits of this PR and we decided not to support None as an option right now. As far as we can tell, None makes an artificial distinction between Pandas' current behavior and defaulting to 64-bit data types (Pandas seems to always do that anyway).

That being said, if we do discover a case where e.g., Pandas infers data types differently in different places, we can revisit and change the default to None (Pandas behaviour).

As for this fixture in particular, I think the following change would suffice:

def default_32bit_integer(): default_integer_bitwidth = cudf.get_option("default_integer_bitwidth") cudf.set_option("default_integer_bitwidth", 32) yield cudf.set_option("default_integer_bitwidth", default_integer_bitwidth)

Since I wrote this comment we discovered a place where defaulting to 64-bit inference would break current code, but it's subtle:

On branch-22.08, the dtype of a Series constructed from a list of numpy scalars is inferred via NumPy:

>>> cudf.Series([np.int8(1), np.int16(1)]) 0 1 1 1 dtype: int16

In this PR, the default data type will be used:

>>> cudf.set_option("default_integer_bitwidth", 32) >>> cudf.Series([np.int8(1), np.int16(1)]) 0 1 1 1 dtype: int32

To prevent breakages in places like this, we decided we should support None after all :-)

Sorry for the noise.

Wow! That’s a great catch. Goes to show just how complex this logic really is…

None option is added and tests are updated accordingly. (both default as 64 and 32 are tested)

python/cudf/cudf/utils/dtypes.py

shwina

Approving with a couple of minor suggestions. This is really great work!

This PR adds `cudf.options`, a global dictionary to store configurations. A set of helper functions to manage the registries are also included. See documentation included in the PR for detail. See demonstration use in: #11272 Closes #5311 Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Matthew Roeschke (https://github.com/mroeschke) - Bradley Dice (https://github.com/bdice) URL: #11193

…fea/defatul_bit_width_python

vyasr

Michael and I discussed this and addressed my review comments in person, so I'm approving now pending his pushing all those final changes.

python/cudf/cudf/options.py

python/cudf/cudf/utils/dtypes.py

shwina · 2022-07-29T17:27:02Z

@gpucibot merge

shwina · 2022-08-01T13:39:38Z

rerun tests

This PR fixes a flaky test introduced by #11272, cudf joins by default does not guarantee return orders and may lead to occasional test regression. This PR adds `sort` argument to make sure result is deterministic. Note that `index.union` and `index.intersection` may also include random output ordering, but by default these methods sorts the result before returning so `sort` argument does not need to be modified. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/brandon-b-miller - Nghia Truong (https://github.com/ttnghia)

These operators rely on a method that was renamed in #11272 and are also out of sync with the rest of the `RangeIndex` design now that the `__getattr__` overload has been removed (#10538). Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11868

isVoid added 20 commits July 1, 2022 16:13

Config functions and importing to top level namespace

7acc583

Add config tests

c25240c

add documentation

bfa10b6

Merge branch 'feature/config' into fea/defatul_bit_width_python

5e85c52

Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …

82e2c44

…fea/defatul_bit_width_python

csv downcast and add benchmark file

76e5ee6

add default int bitwidth config

6092486

Update to options and conform to pandas API

68878cb

documentation updates

e093eb6

More docs update

bbd9cd4

Merge branch 'feature/config' into fea/defatul_bit_width_python

854730e

Adds default int bitwidth option

1ca48c7

Adds handling to unsigned integers

d36a374

Adds test cases

7a03bce

factor out as helper

20966f2

Move fixture to shared conftest, fix utils bug

c93575e

Add json tests

7041b9d

copyright

f415b93

Add CSV float column tests

d68c622

Add cudf only json tests

26affb5

github-actions bot added the Python Affects Python cuDF API. label Jul 15, 2022

isVoid added 9 commits July 15, 2022 18:09

make changes to column constructor

c141600

rename fixture

75e74f3

default scalar bitwidth

897b69a

Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …

23af144

…fea/defatul_bit_width_python

Materializing rangeindex into 32bit cols.

6e4286d

style

5caa010

remove csv benchmark

ea08f56

style

ddfc1ed

Docs update

90cbe33

Fix failed csv tests

e8a02a3

shwina reviewed Jul 27, 2022

View reviewed changes

python/cudf/cudf/utils/dtypes.py Outdated Show resolved Hide resolved

shwina approved these changes Jul 27, 2022

View reviewed changes

isVoid added 4 commits July 28, 2022 14:55

Add 'None' as default option

ea1f687

Modify tests to address lastest change

f829611

Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …

a1de9a3

…fea/defatul_bit_width_python

Add mixed numpy ints as test case

5c7c3d5

vyasr approved these changes Jul 29, 2022

View reviewed changes

isVoid added 4 commits July 28, 2022 17:13

Add default index constructor tests

a63b760

Fix bug using elif in float bitwith option test

283ca6e

Use canonical index constructor instead of getattr

d9d9799

style fix

26c2bfc

isVoid mentioned this pull request Jul 29, 2022

[FEA] Offer a User Configurable Option to Limit the Output Precision of Binary Ops #11167

Closed

Avoid shadowing dtype variable

5bb5b4e

shwina reviewed Jul 29, 2022

View reviewed changes

python/cudf/cudf/options.py Show resolved Hide resolved

shwina reviewed Jul 29, 2022

View reviewed changes

python/cudf/cudf/utils/dtypes.py Outdated Show resolved Hide resolved

isVoid added 2 commits July 29, 2022 10:01

Function rename

0f938d9

style, docstring

9e08a1a

rapids-bot bot merged commit 734cc1f into rapidsai:branch-22.08 Aug 1, 2022

isVoid mentioned this pull request Aug 3, 2022

Make Index Join Tests on Default Precisions Deterministic #11451

Merged

3 tasks

vyasr mentioned this pull request Oct 5, 2022

Fix RangeIndex unary operators. #11868

Merged

3 tasks

GregoryKimball mentioned this pull request Nov 20, 2022

[FEA] Float32 and Int32 as default for variables. #10318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an Option for Default Integer and Floating Bitwidth #11272

Provide an Option for Default Integer and Floating Bitwidth #11272

isVoid commented Jul 15, 2022 •

edited

Loading

shwina Jul 27, 2022

isVoid Jul 27, 2022

bdice Jul 27, 2022

isVoid Jul 27, 2022

shwina Jul 28, 2022

shwina Jul 28, 2022 •

edited

Loading

bdice Jul 28, 2022

isVoid Jul 28, 2022 •

edited

Loading

shwina left a comment

vyasr left a comment

shwina commented Jul 29, 2022

shwina commented Aug 1, 2022

Provide an Option for Default Integer and Floating Bitwidth #11272

Provide an Option for Default Integer and Floating Bitwidth #11272

Conversation

isVoid commented Jul 15, 2022 • edited Loading

shwina Jul 27, 2022

Choose a reason for hiding this comment

isVoid Jul 27, 2022

Choose a reason for hiding this comment

bdice Jul 27, 2022

Choose a reason for hiding this comment

isVoid Jul 27, 2022

Choose a reason for hiding this comment

shwina Jul 28, 2022

Choose a reason for hiding this comment

shwina Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

bdice Jul 28, 2022

Choose a reason for hiding this comment

isVoid Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

shwina left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

shwina commented Jul 29, 2022

shwina commented Aug 1, 2022

isVoid commented Jul 15, 2022 •

edited

Loading

shwina Jul 28, 2022 •

edited

Loading

isVoid Jul 28, 2022 •

edited

Loading