Fix warnings in test_csv.py. (#10362)

This PR silences warnings in `test_csv.py`. (I am working through one test file at a time so we can enable `-Werr` in the future.) The only warning in this file is related to integer overflow in pandas. Currently, the test data is as follows: https://github.com/rapidsai/cudf/blob/21325e8348f33b28e434d08d687a28f251c38f67/python/cudf/cudf/tests/test_csv.py#L1313-L1319 First, I note that this "hex" dtype is not part of the pandas API. It is a cuDF addition (#1925, #2149). Note that there are dtypes for `int32` / `hex32`, and the test data contains both a negative value `-0x1000` and a value `9512c20b`. The negative value `-0x1000` has a sensible interpretation if the results are meant to be signed, but then the value `9512c20b` is out of range (the maximum signed 32-bit value would be `0x7FFFFFFF` and the minimum signed 32-bit value would be `0x80000000`, using the big-endian convention of the parser). Recognizing this, pandas throws a `FutureWarning` when parsing the data `9512c20b` as `int32`, and unsafely wraps it to a negative value. This behavior will eventually be replaced by an `OverflowError`. In the future, we may need to decide if cuDF should raise an `OverflowError` when exceeding `0x7FFFFFFF` for consistency with pandas, or decide to use unsigned integers when parsing "hex" dtypes and compare to pandas' unsigned types in this test. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/brandon-b-miller URL: #10362
rapidsai · Mar 1, 2022 · 5d8ea19 · 5d8ea19
1 parent 87a2ea4
commit 5d8ea19
Showing 1 changed file with 30 additions and 1 deletion.
diff --git a/python/cudf/cudf/tests/test_csv.py b/python/cudf/cudf/tests/test_csv.py
@@ -1315,7 +1315,7 @@ def test_csv_reader_aligned_byte_range(tmpdir):
     [(None, None), ("int", "hex"), ("int32", "hex32"), ("int64", "hex64")],
 )
 def test_csv_reader_hexadecimals(pdf_dtype, gdf_dtype):
-    lines = ["0x0", "-0x1000", "0xfedcba", "0xABCDEF", "0xaBcDeF", "9512c20b"]
+    lines = ["0x0", "-0x1000", "0xfedcba", "0xABCDEF", "0xaBcDeF"]
     values = [int(hex_int, 16) for hex_int in lines]
 
     buffer = "\n".join(lines)
@@ -1334,6 +1334,35 @@ def test_csv_reader_hexadecimals(pdf_dtype, gdf_dtype):
         assert_eq(pdf, gdf)
 
 
+@pytest.mark.parametrize(
+    "np_dtype, gdf_dtype",
+    [("int", "hex"), ("int32", "hex32"), ("int64", "hex64")],
+)
+def test_csv_reader_hexadecimal_overflow(np_dtype, gdf_dtype):
+    # This tests values which cause an overflow warning that will become an
+    # error in pandas. NumPy wraps the overflow silently up to the bounds of a
+    # signed int64.
+    lines = [
+        "0x0",
+        "-0x1000",
+        "0xfedcba",
+        "0xABCDEF",
+        "0xaBcDeF",
+        "0x9512c20b",
+        "0x7fffffff",
+        "0x7fffffffffffffff",
+        "-0x8000000000000000",
+    ]
+    values = [int(hex_int, 16) for hex_int in lines]
+    buffer = "\n".join(lines)
+
+    gdf = read_csv(StringIO(buffer), dtype=[gdf_dtype], names=["hex_int"])
+
+    expected = np.array(values, dtype=np_dtype)
+    actual = gdf["hex_int"].to_numpy()
+    np.testing.assert_array_equal(expected, actual)
+
+
 @pytest.mark.parametrize("quoting", [0, 1, 2, 3])
 def test_csv_reader_pd_consistent_quotes(quoting):
     names = ["text"]