Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for additional metaclasses of proxies and use for ExcelWriter #15399

Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Mar 27, 2024

Description

The ExcelWriter supports the abstract os.PathLike interface, but we would also like that support to be reflected in the class's MRO. Doing so is slightly complicated because os.PathLike is an ABC, and as such has a different metaclass. Therefore, in order to add os.PathLike as a base class, we must also generate a suitable combined metaclass for our ExcelWriter wrapper.

This change ensures the isinstance(pd.ExcelWriter(...), os.PathLike) returns True when using cudf.pandas.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vyasr vyasr added bug Something isn't working non-breaking Non-breaking change labels Mar 27, 2024
@vyasr vyasr self-assigned this Mar 27, 2024
@vyasr vyasr requested a review from a team as a code owner March 27, 2024 00:18
@vyasr vyasr requested review from wence- and mroeschke March 27, 2024 00:18
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 27, 2024
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Might be good to add a unit test that an isinstance check on this object with os.PathLike works

@vyasr
Copy link
Contributor Author

vyasr commented Mar 27, 2024

I uncovered another strange thing while writing a test here. Given this script

# test.py
import pandas as pd
def test_pathlike():
    print(pd.ExcelWriter("foo.xlsx").__class__)

Without the changes in this PR, I see the following behavior if I run that test inside the repo:

(rapids) coder _ ~/cudf $ pytest -p cudf.pandas -s test.py 
...
test.py <class 'pandas.io.excel._openpyxl.OpenpyxlWriter'>
...
(rapids) coder _ ~/cudf $ pytest -s test.py 
...
test.py <class 'pandas.io.excel._openpyxl.OpenpyxlWriter'>

On the other hand, if I run from outside the repo I see this

(rapids) coder _ ~ $ pytest -p cudf.pandas -s test.py
...
test.py <class 'pandas.io.excel._base.ExcelWriter'>
...
(rapids) coder _ ~ $ pytest -s test.py
...
test.py <class 'pandas.io.excel._openpyxl.OpenpyxlWriter'>
...

Note that when run outside the repo the class shows up differently. I don't know why that is, but it makes testing from inside the repo difficult because the issue is masked. I'll dig a little more next week, this is as far as I got today.

@shwina or @wence- if this makes quick sense to you feel free to jump in.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 17, 2024

I need to come back to this and investigate further. @galipremsagar or @wence- if the above result makes sense to you let me know, but no need to spend much time investigating if it doesn't. I'll come back to this when I have a bit of time free.

@vyasr vyasr force-pushed the fix/cudf_pandas_excelwriter_hierarchy branch from fd9616f to b4719b7 Compare May 24, 2024 23:40
@vyasr vyasr requested review from a team as code owners May 24, 2024 23:40
@vyasr vyasr requested review from jameslamb, hyperbolic2346 and PointKernel and removed request for a team May 24, 2024 23:40
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue conda Java Affects Java cuDF API. labels May 24, 2024
@vyasr vyasr changed the base branch from branch-24.06 to branch-24.08 May 24, 2024 23:40
@vyasr vyasr removed request for a team May 24, 2024 23:40
@github-actions github-actions bot removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue conda Java Affects Java cuDF API. labels May 25, 2024
@vyasr
Copy link
Contributor Author

vyasr commented May 25, 2024

OK @mroeschke I've added a test here. The issue above is not related to this PR, and the behavior can be observed even without it. The issue is that the pandas ExcelWriter overrides __new__ to support returning instances of different classes depending on the engine. In this case, while we get the right behavior in cudf.pandas, we do not have the correct repr because we only have the top-level class name available to us and not the real class selected at runtime. As a result, the repr of our object looks wrong even though the real object is correct.

I think fixing this issue is out of scope for this PR, but it's something we may need to investigate further. In pandas >=2.0, with the new index class hierarchy is it still possible to instantiate an index with the base index constructor that doesn't isn't of type Index now that the numerical and string index types are gone? For example, categorical indexes I believe must be constructed with the pd.CategoricalIndex constructor now, right? If there are any exceptions, I would expect them to exhibit the same behavior.

@galipremsagar
Copy link
Contributor

OK @mroeschke I've added a test here. The issue above is not related to this PR, and the behavior can be observed even without it. The issue is that the pandas ExcelWriter overrides __new__ to support returning instances of different classes depending on the engine. In this case, while we get the right behavior in cudf.pandas, we do not have the correct repr because we only have the top-level class name available to us and not the real class selected at runtime. As a result, the repr of our object looks wrong even though the real object is correct.

I think fixing this issue is out of scope for this PR, but it's something we may need to investigate further. In pandas >=2.0, with the new index class hierarchy is it still possible to instantiate an index with the base index constructor that doesn't isn't of type Index now that the numerical and string index types are gone? For example, categorical indexes I believe must be constructed with the pd.CategoricalIndex constructor now, right? If there are any exceptions, I would expect them to exhibit the same behavior.

We have a similar problem for Timedelta and Timestamp classes that can return pd.NA when None is passed, the problem there was solved by having Timestamp_Timedelta__new__ implemented and then assigning it to __new__. I think this PR is doing more, like letting the metaclasses be combined.

@galipremsagar
Copy link
Contributor

I re-ran the CI to see the latest diff of the tests passing/failing.

@vyasr
Copy link
Contributor Author

vyasr commented May 30, 2024

It claims to be a big increase of 1.38% :D do you believe it? I don't.

@galipremsagar
Copy link
Contributor

It claims to be a big increase of 1.38% :D do you believe it? I don't.

It's most likely because the nightly run that happened yesterday doesn't have all the changes of this branch. But since there is no decrease, we can say this is a safe change 👍

@vyasr
Copy link
Contributor Author

vyasr commented Jun 3, 2024

/merge

@rapids-bot rapids-bot bot merged commit 382de32 into rapidsai:branch-24.08 Jun 3, 2024
71 checks passed
@vyasr vyasr deleted the fix/cudf_pandas_excelwriter_hierarchy branch June 3, 2024 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants