Skip to content

Commit

Permalink
Fix warnings in test_csv.py. (#10362)
Browse files Browse the repository at this point in the history
This PR silences warnings in `test_csv.py`. (I am working through one test file at a time so we can enable `-Werr` in the future.)

The only warning in this file is related to integer overflow in pandas. Currently, the test data is as follows:
https://github.com/rapidsai/cudf/blob/21325e8348f33b28e434d08d687a28f251c38f67/python/cudf/cudf/tests/test_csv.py#L1313-L1319

First, I note that this "hex" dtype is not part of the pandas API. It is a cuDF addition (#1925, #2149).

Note that there are dtypes for `int32` / `hex32`, and the test data contains both a negative value `-0x1000` and a value `9512c20b`. The negative value `-0x1000` has a sensible interpretation if the results are meant to be signed, but then the value `9512c20b` is out of range (the maximum signed 32-bit value would be `0x7FFFFFFF` and the minimum signed 32-bit value would be `0x80000000`, using the big-endian convention of the parser). Recognizing this, pandas throws a `FutureWarning` when parsing the data `9512c20b` as `int32`, and unsafely wraps it to a negative value. This behavior will eventually be replaced by an `OverflowError`.

In the future, we may need to decide if cuDF should raise an `OverflowError` when exceeding `0x7FFFFFFF` for consistency with pandas, or decide to use unsigned integers when parsing "hex" dtypes and compare to pandas' unsigned types in this test.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/brandon-b-miller

URL: #10362
  • Loading branch information
bdice authored Mar 1, 2022
1 parent 87a2ea4 commit 5d8ea19
Showing 1 changed file with 30 additions and 1 deletion.
31 changes: 30 additions & 1 deletion python/cudf/cudf/tests/test_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -1315,7 +1315,7 @@ def test_csv_reader_aligned_byte_range(tmpdir):
[(None, None), ("int", "hex"), ("int32", "hex32"), ("int64", "hex64")],
)
def test_csv_reader_hexadecimals(pdf_dtype, gdf_dtype):
lines = ["0x0", "-0x1000", "0xfedcba", "0xABCDEF", "0xaBcDeF", "9512c20b"]
lines = ["0x0", "-0x1000", "0xfedcba", "0xABCDEF", "0xaBcDeF"]
values = [int(hex_int, 16) for hex_int in lines]

buffer = "\n".join(lines)
Expand All @@ -1334,6 +1334,35 @@ def test_csv_reader_hexadecimals(pdf_dtype, gdf_dtype):
assert_eq(pdf, gdf)


@pytest.mark.parametrize(
"np_dtype, gdf_dtype",
[("int", "hex"), ("int32", "hex32"), ("int64", "hex64")],
)
def test_csv_reader_hexadecimal_overflow(np_dtype, gdf_dtype):
# This tests values which cause an overflow warning that will become an
# error in pandas. NumPy wraps the overflow silently up to the bounds of a
# signed int64.
lines = [
"0x0",
"-0x1000",
"0xfedcba",
"0xABCDEF",
"0xaBcDeF",
"0x9512c20b",
"0x7fffffff",
"0x7fffffffffffffff",
"-0x8000000000000000",
]
values = [int(hex_int, 16) for hex_int in lines]
buffer = "\n".join(lines)

gdf = read_csv(StringIO(buffer), dtype=[gdf_dtype], names=["hex_int"])

expected = np.array(values, dtype=np_dtype)
actual = gdf["hex_int"].to_numpy()
np.testing.assert_array_equal(expected, actual)


@pytest.mark.parametrize("quoting", [0, 1, 2, 3])
def test_csv_reader_pd_consistent_quotes(quoting):
names = ["text"]
Expand Down

0 comments on commit 5d8ea19

Please sign in to comment.