Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Enforce and automate compatibility with Pandas methods #6135

Closed
shwina opened this issue Sep 1, 2020 · 8 comments
Closed

[PROPOSAL] Enforce and automate compatibility with Pandas methods #6135

shwina opened this issue Sep 1, 2020 · 8 comments
Labels
0 - Backlog In queue waiting for assignment proposal Change current process or code Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Sep 1, 2020

One of the goals of cuDF is to provide a Pandas-like API. This means that our classes and functions (including methods) should have the same names as those of Pandas. It also means that our functions should accept the same positional and keyword arguments as their Pandas counterparts.

Currently, we have functions with varying degrees of consistency with Pandas:

  1. Functions with the same arguments and behaviour as Pandas
  2. Functions with the same arguments, but different default values for one or more of them
  3. Functions supporting a subset of arguments supported by Pandas
  4. Functions supporting a superset of arguments supported by Pandas
  5. Other inconsistencies (e.g., perhaps arguments appear in a different order)

The problem

The way we currently enforce API consistency is "by hand", i.e., we write functions with the same signature as Pandas, and document unsupported arguments as such. A good example of this is quantile, where we support a subset of the arguments Pandas supports, and document the others as "non functional":

The manual approach is problematic for a few reasons:

  1. It requires boilerplate: the first bit of boilerplate is in the docstring. The second bit is in the code itself, where we check to see if any unsupported arguments are passed with non-default values. If so, we raise a NotImplementedError.
  2. By virtue of being manual, it is error-prone. For APIs that accept several arguments, it's easy to miss one, document it incorrectly, or have arguments in the wrong order.
  3. Changes to the Pandas API, in turn, require changes to our code if we want to maintain consistency.

Proposed solution

The proposal is to both enforce and automate consistency using a @mimic_signature decorator (still a WIP).

Here it is in action:

def foo(a, b=None, c=0):  # imagine `foo` is the "Pandas API"
    pass

@mimic_signature(foo)
def bar(a, c=1, d=[], **kwargs):  # `bar` is the equivalent "cuDF API"
    print("a = ", a)
    print("c = ", c)
    print("d = ", d)

First, the signature of bar is a merge of the signatures of foo and bar:

In [11]: bar?
Signature: bar(a, b=None, c=1, d=[])

bar is called exactly like foo:

>>> bar(1, b=2, c=3)
a =  1
c =  3
d =  []

>>> bar(1, 2, 3)
a =  1
c =  3
d =  []

But it retains its own default values for its args:

>>> bar(1)
a =  1
c =  1
d =  []

And it can be called with its own additional args:

>>> bar(1, c=3, d=4)
a =  1
c =  3
d =  4

Pros

  1. No boilerplate. We don't have to include unsupported arguments in our signature and documentation manually. We can also have the decorator do the work of raising NotImplementedError if unsupported arguments with non-default values are passed. Our own implementations can remain completely unaware of unsupported arguments.

  2. We are always guaranteed to be consistent with the Pandas API.

  3. Changes to the Pandas API (e.g., a new kwarg, or a change in the order of appearance of kwargs) don't require changes to our code. Our own API will change automatically.

Cons

The big disadvantage here is that we (cuDF developers) should be careful when using APIs decorated with @mimic_signature internally. It's important to remember that the signature in the function definition is not the same as the calling signature. Thus, we should always use keyword arguments explicitly. That is, we should prefer:

bar(1, c=3)

not:

bar(1, 3)

This applies only to cuDF developers as we're the only ones with visibility into the "internal" signature.

@shwina shwina added proposal Change current process or code code quality Python Affects Python cuDF API. labels Sep 1, 2020
@rgsl888prabhu
Copy link
Contributor

How can we overwrite docstrings in a case where we need to mention some shortcomings or unsupported types.

@shwina
Copy link
Contributor Author

shwina commented Sep 1, 2020

The docstring is not copied -- only the signature line. We still need to write docstrings and document the arguments we support and their behaviour.

We can add a footnote to docstrings that all undocumented members are provided for Pandas compatibility, or something similar.

@karthikeyann
Copy link
Contributor

is it possible to add extra arguments (apart from pandas API arguments) with @mimic_signature decorator?

@shwina
Copy link
Contributor Author

shwina commented Sep 2, 2020

Yes, any extra arguments that our own APIs have are automatically added to the resulting API's signature. In my example, bar has an extra argument d to represent this:

def foo(a, b=None, c=0):  # imagine `foo` is the "Pandas API"
    pass

@mimic_signature(foo)
def bar(a, c=1, d=[], **kwargs):  # `bar` is the equivalent "cuDF API"
    print("a = ", a)
    print("c = ", c)
    print("d = ", d)

rapids-bot bot pushed a commit that referenced this issue Dec 8, 2020
This PR removes `**kwargs` from the string/categorical accessors where unnecessary, and exposes keyword arguments like `inplace` to the user directly.

If we want to maintain parity with Pandas APIs for Dask/others using cuDF internally, we can consider using the approach described in #6135, which will automatically raise `NotimplementedError` when unsupported kwargs are passed.

Authors:
  - Ashwin Srinath <[email protected]>

Approvers:
  - GALI PREM SAGAR
  - Keith Kraus
  - Keith Kraus

URL: #6750
@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@shwina
Copy link
Contributor Author

shwina commented Feb 16, 2021

Still relevant but it's unclear if this is the route we want to go down. So far, we have been matching the Pandas API manually.

@github-actions github-actions bot removed the rotten label Feb 17, 2021
@kkraus14
Copy link
Collaborator

Lets start exploring our options for this in 0.20

@shwina shwina added the 0 - Backlog In queue waiting for assignment label Oct 24, 2022
@vyasr
Copy link
Contributor

vyasr commented May 10, 2024

The systematic approach for automating this compatibility is cudf.pandas testing, and in particular running the pandas test suite on cudf.pandas. We can investigate incompatibilities more systematically once #15724 is implemented.

@vyasr vyasr closed this as completed May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment proposal Change current process or code Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

5 participants