Fix warnings in test_csv.py. #10362

bdice · 2022-02-25T22:16:29Z

This PR silences warnings in test_csv.py. (I am working through one test file at a time so we can enable -Werr in the future.)

The only warning in this file is related to integer overflow in pandas. Currently, the test data is as follows:

cudf/python/cudf/cudf/tests/test_csv.py

Lines 1313 to 1319 in 21325e8

    
           @pytest.mark.parametrize( 
        
               "pdf_dtype, gdf_dtype", 
        
               [(None, None), ("int", "hex"), ("int32", "hex32"), ("int64", "hex64")], 
        
           ) 
        
           def test_csv_reader_hexadecimals(pdf_dtype, gdf_dtype): 
        
               lines = ["0x0", "-0x1000", "0xfedcba", "0xABCDEF", "0xaBcDeF", "9512c20b"] 
        
               values = [int(hex_int, 16) for hex_int in lines]

First, I note that this "hex" dtype is not part of the pandas API. It is a cuDF addition (#1925, #2149).

Note that there are dtypes for int32 / hex32, and the test data contains both a negative value -0x1000 and a value 9512c20b. The negative value -0x1000 has a sensible interpretation if the results are meant to be signed, but then the value 9512c20b is out of range (the maximum signed 32-bit value would be 0x7FFFFFFF and the minimum signed 32-bit value would be 0x80000000, using the big-endian convention of the parser). Recognizing this, pandas throws a FutureWarning when parsing the data 9512c20b as int32, and unsafely wraps it to a negative value. This behavior will eventually be replaced by an OverflowError.

In the future, we may need to decide if cuDF should raise an OverflowError when exceeding 0x7FFFFFFF for consistency with pandas, or decide to use unsigned integers when parsing "hex" dtypes and compare to pandas' unsigned types in this test.

codecov · 2022-02-25T23:33:48Z

Codecov Report

Merging #10362 (d7045b1) into branch-22.04 (a7d88cd) will increase coverage by 0.15%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.04   #10362      +/-   ##
================================================
+ Coverage         10.42%   10.58%   +0.15%     
================================================
  Files               119      125       +6     
  Lines             20603    21058     +455     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18830     +375

Impacted Files	Coverage Δ
...ython/custreamz/custreamz/tests/test_dataframes.py	`99.39% <0.00%> (-0.01%)`	⬇️
python/cudf/cudf/errors.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/ops.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/datasets.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64ee514...d7045b1. Read the comment docs.

vyasr · 2022-02-28T19:02:19Z

@shwina @galipremsagar @vuule what do you think about @bdice's question here? Should we be changing cuIO's Python bindings here to match pandas? For now I think silencing the pandas warning as we do here is fine, but we should add a TODO comment or something indicating how we want to handle this when pandas starts throwing the error.

vuule · 2022-02-28T19:38:34Z

@shwina @galipremsagar @vuule what do you think about @bdice's question here? Should we be changing cuIO's Python bindings here to match pandas? For now I think silencing the pandas warning as we do here is fine, but we should add a TODO comment or something indicating how we want to handle this when pandas starts throwing the error.

IIRC, we generally don't want to promise overflow checking in libcudf for performance reasons. So I'm fine with not following Pandas behavior here.

vuule

This is a useful warning, as the test will be broken once Pandas behavior changes.
@bdice could you instead move the overflowing value to a separate test (that does not use Pandas)?

bdice · 2022-02-28T20:07:58Z

I read the conversation on #1925 again, and I now understand that the design intended to use signed values when parsing. That clarifies the intended behavior, so it's just a matter of separating this test as @vuule described. This design decision surprises me a bit (I would have expected 9512c20b to map to 2501034507 and not -1793932789).

bdice · 2022-02-28T20:47:02Z

I added some tests that compare with NumPy and expand the tested range of overflow values.

vuule

🔥

vuule · 2022-02-28T21:30:40Z

python/cudf/cudf/tests/test_csv.py

+def test_csv_reader_hexadecimal_overflow(np_dtype, gdf_dtype):
+    # This tests values which cause an overflow warning that will become an
+    # error in pandas. NumPy wraps the overflow silently up to the bounds of a
+    # signed int64.


is it always 64?

Yup. By default, numpy treats types larger than int64/uint64 with object as the dtype and uses a Python int to back it. There are larger types, like np.int128, but they're not used by default. The wider the type, the less-wide the support, I guess.

Correction: I was thinking of floating types, not integral types. There are 128-bit float types (depending on the platform/build of NumPy) like np.longdouble but 128-bit integers are not in the NumPy API. However, 128-bit integers seem to be used internally and references can be found in a few places in the source.

References:

ENH: int128, uint128 support? numpy/numpy#9992

https://github.com/numpy/numpy/search?q=int128&type=code

>>> np.finfo(np.longdouble) finfo(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)

python/cudf/cudf/tests/test_csv.py

bdice · 2022-03-01T17:07:50Z

@gpucibot merge

bdice added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 25, 2022

bdice self-assigned this Feb 25, 2022

bdice requested a review from a team as a code owner February 25, 2022 22:16

bdice requested review from trxcllnt and brandon-b-miller February 25, 2022 22:16

bdice mentioned this pull request Feb 25, 2022

[FEA] Remove FutureWarnings from Python tests #10363

Closed

17 tasks

vuule requested changes Feb 28, 2022

View reviewed changes

bdice added 2 commits February 28, 2022 14:09

Fix warning related to integer overflow in pandas.

02d1de2

Refactor tests of overflowing hexadecimal values.

e015b23

bdice force-pushed the no-warnings-test_csv branch from 6ddf6e8 to e015b23 Compare February 28, 2022 20:45

bdice requested a review from vuule February 28, 2022 20:45

vuule approved these changes Feb 28, 2022

View reviewed changes

brandon-b-miller reviewed Mar 1, 2022

View reviewed changes

python/cudf/cudf/tests/test_csv.py Outdated Show resolved Hide resolved

brandon-b-miller approved these changes Mar 1, 2022

View reviewed changes

Use string as dtype.

d7045b1

rapids-bot bot merged commit 5d8ea19 into rapidsai:branch-22.04 Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix warnings in test_csv.py. #10362

Fix warnings in test_csv.py. #10362

bdice commented Feb 25, 2022 •

edited

Loading

codecov bot commented Feb 25, 2022 •

edited

Loading

vyasr commented Feb 28, 2022

vuule commented Feb 28, 2022

vuule left a comment

bdice commented Feb 28, 2022 •

edited

Loading

bdice commented Feb 28, 2022

vuule left a comment

vuule Feb 28, 2022

bdice Feb 28, 2022 •

edited

Loading

bdice Mar 1, 2022 •

edited

Loading

bdice commented Mar 1, 2022

	@pytest.mark.parametrize(
	"pdf_dtype, gdf_dtype",
	[(None, None), ("int", "hex"), ("int32", "hex32"), ("int64", "hex64")],
	)
	def test_csv_reader_hexadecimals(pdf_dtype, gdf_dtype):
	lines = ["0x0", "-0x1000", "0xfedcba", "0xABCDEF", "0xaBcDeF", "9512c20b"]
	values = [int(hex_int, 16) for hex_int in lines]

Fix warnings in test_csv.py. #10362

Fix warnings in test_csv.py. #10362

Conversation

bdice commented Feb 25, 2022 • edited Loading

codecov bot commented Feb 25, 2022 • edited Loading

Codecov Report

vyasr commented Feb 28, 2022

vuule commented Feb 28, 2022

vuule left a comment

Choose a reason for hiding this comment

bdice commented Feb 28, 2022 • edited Loading

bdice commented Feb 28, 2022

vuule left a comment

Choose a reason for hiding this comment

vuule Feb 28, 2022

Choose a reason for hiding this comment

bdice Feb 28, 2022 • edited Loading

Choose a reason for hiding this comment

bdice Mar 1, 2022 • edited Loading

Choose a reason for hiding this comment

bdice commented Mar 1, 2022

bdice commented Feb 25, 2022 •

edited

Loading

codecov bot commented Feb 25, 2022 •

edited

Loading

bdice commented Feb 28, 2022 •

edited

Loading

bdice Feb 28, 2022 •

edited

Loading

bdice Mar 1, 2022 •

edited

Loading