Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix warnings in test_csv.py. #10362

Merged
merged 3 commits into from
Mar 1, 2022

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Feb 25, 2022

This PR silences warnings in test_csv.py. (I am working through one test file at a time so we can enable -Werr in the future.)

The only warning in this file is related to integer overflow in pandas. Currently, the test data is as follows:

@pytest.mark.parametrize(
"pdf_dtype, gdf_dtype",
[(None, None), ("int", "hex"), ("int32", "hex32"), ("int64", "hex64")],
)
def test_csv_reader_hexadecimals(pdf_dtype, gdf_dtype):
lines = ["0x0", "-0x1000", "0xfedcba", "0xABCDEF", "0xaBcDeF", "9512c20b"]
values = [int(hex_int, 16) for hex_int in lines]

First, I note that this "hex" dtype is not part of the pandas API. It is a cuDF addition (#1925, #2149).

Note that there are dtypes for int32 / hex32, and the test data contains both a negative value -0x1000 and a value 9512c20b. The negative value -0x1000 has a sensible interpretation if the results are meant to be signed, but then the value 9512c20b is out of range (the maximum signed 32-bit value would be 0x7FFFFFFF and the minimum signed 32-bit value would be 0x80000000, using the big-endian convention of the parser). Recognizing this, pandas throws a FutureWarning when parsing the data 9512c20b as int32, and unsafely wraps it to a negative value. This behavior will eventually be replaced by an OverflowError.

In the future, we may need to decide if cuDF should raise an OverflowError when exceeding 0x7FFFFFFF for consistency with pandas, or decide to use unsigned integers when parsing "hex" dtypes and compare to pandas' unsigned types in this test.

@bdice bdice added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 25, 2022
@bdice bdice self-assigned this Feb 25, 2022
@bdice bdice requested a review from a team as a code owner February 25, 2022 22:16
@codecov
Copy link

codecov bot commented Feb 25, 2022

Codecov Report

Merging #10362 (d7045b1) into branch-22.04 (a7d88cd) will increase coverage by 0.15%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10362      +/-   ##
================================================
+ Coverage         10.42%   10.58%   +0.15%     
================================================
  Files               119      125       +6     
  Lines             20603    21058     +455     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18830     +375     
Impacted Files Coverage Δ
...ython/custreamz/custreamz/tests/test_dataframes.py 99.39% <0.00%> (-0.01%) ⬇️
python/cudf/cudf/errors.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/ops.py 0.00% <0.00%> (ø)
python/cudf/cudf/datasets.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64ee514...d7045b1. Read the comment docs.

@vyasr
Copy link
Contributor

vyasr commented Feb 28, 2022

@shwina @galipremsagar @vuule what do you think about @bdice's question here? Should we be changing cuIO's Python bindings here to match pandas? For now I think silencing the pandas warning as we do here is fine, but we should add a TODO comment or something indicating how we want to handle this when pandas starts throwing the error.

@vuule
Copy link
Contributor

vuule commented Feb 28, 2022

@shwina @galipremsagar @vuule what do you think about @bdice's question here? Should we be changing cuIO's Python bindings here to match pandas? For now I think silencing the pandas warning as we do here is fine, but we should add a TODO comment or something indicating how we want to handle this when pandas starts throwing the error.

IIRC, we generally don't want to promise overflow checking in libcudf for performance reasons. So I'm fine with not following Pandas behavior here.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a useful warning, as the test will be broken once Pandas behavior changes.
@bdice could you instead move the overflowing value to a separate test (that does not use Pandas)?

@bdice
Copy link
Contributor Author

bdice commented Feb 28, 2022

I read the conversation on #1925 again, and I now understand that the design intended to use signed values when parsing. That clarifies the intended behavior, so it's just a matter of separating this test as @vuule described. This design decision surprises me a bit (I would have expected 9512c20b to map to 2501034507 and not -1793932789).

@bdice bdice force-pushed the no-warnings-test_csv branch from 6ddf6e8 to e015b23 Compare February 28, 2022 20:45
@bdice bdice requested a review from vuule February 28, 2022 20:45
@bdice
Copy link
Contributor Author

bdice commented Feb 28, 2022

I added some tests that compare with NumPy and expand the tested range of overflow values.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

def test_csv_reader_hexadecimal_overflow(np_dtype, gdf_dtype):
# This tests values which cause an overflow warning that will become an
# error in pandas. NumPy wraps the overflow silently up to the bounds of a
# signed int64.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it always 64?

Copy link
Contributor Author

@bdice bdice Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. By default, numpy treats types larger than int64/uint64 with object as the dtype and uses a Python int to back it. There are larger types, like np.int128, but they're not used by default. The wider the type, the less-wide the support, I guess.

Copy link
Contributor Author

@bdice bdice Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: I was thinking of floating types, not integral types. There are 128-bit float types (depending on the platform/build of NumPy) like np.longdouble but 128-bit integers are not in the NumPy API. However, 128-bit integers seem to be used internally and references can be found in a few places in the source.

References:

>>> np.finfo(np.longdouble)
finfo(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)

@bdice
Copy link
Contributor Author

bdice commented Mar 1, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5d8ea19 into rapidsai:branch-22.04 Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants