Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ORC and JSON tests failures for pandas 2.2 #15062

Merged
merged 3 commits into from
Feb 21, 2024

Conversation

mroeschke
Copy link
Contributor

Description

test_order_nested_json_reader was refactored to use assert_eq instead of comparing via pyarrow. This was failing in pandas 2.2 due to pandas-dev/pandas#57429

test_orc_reader_trailing_nulls I believe was failing due to a change in how integers are compared with assert_series_equal: pandas-dev/pandas#55882. The "casting workaround" doesn't seem necessary in pandas 2.2 so just avoiding it all together

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 15, 2024
@mroeschke mroeschke requested a review from a team as a code owner February 15, 2024 01:38
@mroeschke mroeschke requested review from bdice and isVoid February 15, 2024 01:38
galipremsagar
galipremsagar previously approved these changes Feb 20, 2024
@galipremsagar galipremsagar dismissed their stale review February 20, 2024 13:59

have a question

tag == "dtype_mismatch", reason="int vs float mismatch"
)
)
assert_eq(expected, target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can pandas now handle comparing the equality of nested types/values? I remember the last time I checked pandas wasn't able to that properly and thus we resorted to use pyarrow for comparisons in case of nested types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just was able to triage the issue here, looks like pd.read_json has a regression where the dataframe is expected to have a RangeIndex but a materialized Index of int64 dtype is being returned in pandas-2.2. We should just change this test to the following:

diff --git a/python/cudf/cudf/tests/test_json.py b/python/cudf/cudf/tests/test_json.py
index ec980adc33..5a459e98d1 100644
--- a/python/cudf/cudf/tests/test_json.py
+++ b/python/cudf/cudf/tests/test_json.py
@@ -1179,6 +1179,9 @@ class TestNestedJsonReaderCommon:
 
     def test_order_nested_json_reader(self, tag, data):
         expected = pd.read_json(StringIO(data), lines=True)
+        if PANDAS_GE_200:
+            # TODO: Remove after bug fix: <Pandas-bug-URL>
+            expected = expected.reset_index(drop=True)
         target = cudf.read_json(StringIO(data), lines=True)
         if tag == "dtype_mismatch":
             with pytest.raises(AssertionError):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Yeah testing of nested data isn't well tested in pandas so probably better to use pyarrow to compare here. Will incorporate your change

tag == "dtype_mismatch", reason="int vs float mismatch"
)
)
assert_eq(expected, target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just was able to triage the issue here, looks like pd.read_json has a regression where the dataframe is expected to have a RangeIndex but a materialized Index of int64 dtype is being returned in pandas-2.2. We should just change this test to the following:

diff --git a/python/cudf/cudf/tests/test_json.py b/python/cudf/cudf/tests/test_json.py
index ec980adc33..5a459e98d1 100644
--- a/python/cudf/cudf/tests/test_json.py
+++ b/python/cudf/cudf/tests/test_json.py
@@ -1179,6 +1179,9 @@ class TestNestedJsonReaderCommon:
 
     def test_order_nested_json_reader(self, tag, data):
         expected = pd.read_json(StringIO(data), lines=True)
+        if PANDAS_GE_200:
+            # TODO: Remove after bug fix: <Pandas-bug-URL>
+            expected = expected.reset_index(drop=True)
         target = cudf.read_json(StringIO(data), lines=True)
         if tag == "dtype_mismatch":
             with pytest.raises(AssertionError):

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mroeschke !

@galipremsagar galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 20, 2024
@galipremsagar
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 8ea716b into rapidsai:branch-24.04 Feb 21, 2024
69 checks passed
@mroeschke mroeschke deleted the p22/io_tests branch February 21, 2024 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants