Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Expand support for cupy universal functions (ufuncs) #9083

Closed
beckernick opened this issue Aug 20, 2021 · 3 comments · Fixed by #10346
Closed

[FEA] Expand support for cupy universal functions (ufuncs) #9083

beckernick opened this issue Aug 20, 2021 · 3 comments · Fixed by #10346
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

Using numpy/cupy universal functions (ufuncs) directly on dataframes is a common usage pattern that we partially support today. #256 requested support for a comprehensive set of ufuncs. Many of these ufuncs are now supported on Series through a codepath that hits:

# Utils for using appropriate dispatch for array functions
def get_appropriate_dispatched_func(
cudf_submodule, cudf_ser_submodule, cupy_submodule, func, args, kwargs
):

For DataFrames, we dispatch only a small subset of functions enumerated in https://github.com/rapidsai/cudf/blob/be25a30ca20f3135f341c51b36cb075b376d5def/python/cudf/cudf/core/ops.py

It would be nice to expand our support for ufuncs on DataFrames (and Series as appropriate). As @vyasr noted offline, we may want to explore our approach to this dispatch as well.

When we don't support a ufunc, we get the following with direct usage of the cupy ufunc:

import cudf
import numpy as npdf = cudf.DataFrame({
    "a": [0,1,2],
    "b": [0,1,2]
})
​
cp.isinf(df)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_89993/1699247483.py in <module>
      7 })
      8 
----> 9 cp.isinf(df)

cupy/_core/_kernel.pyx in cupy._core._kernel.ufunc.__call__()

cupy/_core/_kernel.pyx in cupy._core._kernel._preprocess_args()

TypeError: Unsupported type <class 'cudf.core.dataframe.DataFrame'>

If we rely on the array function protocol to dispatch first from numpy to cupy, we get the following:

import cudf
import cupy as cpdf = cudf.DataFrame({
    "a": [0,1,2],
    "b": [0,1,2]
})
​
np.isinf(df)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_89993/695126611.py in <module>
      7 })
      8 
----> 9 np.isinf(df)

TypeError: operand type(s) all returned NotImplemented from __array_ufunc__(<ufunc 'isinf'>, '__call__',    a  b
0  0  0
1  1  1
2  2  2): 'DataFrame'
@beckernick beckernick added feature request New feature or request Python Affects Python cuDF API. labels Aug 20, 2021
@vyasr
Copy link
Contributor

vyasr commented Aug 20, 2021

Yes, I think we want to unify our approach to __array_ufunc__ and __array_function__ a bit more and standardize across Series and DataFrame. In the best case, our current approach to dispatch is roughly the following cascade of attempts:

  1. Use a method in the cudf namespace with the desired name if one exists.
  2. Use a class method with this name if one exists.
  3. Dispatch to cupy if cupy has the method.

I say best case because I think DataFrame.__array_ufunc__ only does (1) as @beckernick pointed out above. This is another case where not having a single source of truth hurts us because it's not necessarily obvious which methods will get dispatched and how. I think unifying at least the lookup logic between these two classes would help alleviate one source of confusion since Series calls get_appropriate_dispatched_func while DataFrame does not. It would also help to get rid of (1) entirely IMO because that code path is essentially just a trivial wrapper around (2) anyway. In that case we may be able to get rid of our current ops.py module entirely (although I think there may be one other use case for it somewhere that I've forgotten); although pandas.core.ops does exist, it contains very different methods from cudf.core.ops so I would be OK removing ours for now if we could work around its uses.

@vyasr vyasr self-assigned this Aug 24, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@bdice
Copy link
Contributor

bdice commented Jan 4, 2022

Following up on this -- we should update the dispatch so that we do not define all the ufuncs (add, arccos, arcsin, ..., log, logical_and, ...) in the root namespace like cudf.add. These should be called from a submodule like cudf.core.ops.add. This will improve our API alignment with Pandas, which does not define ufuncs in the root namespace.

See also this list of available ufuncs in NumPy: https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs

See also: #9815 (comment)

@bdice bdice mentioned this issue Jan 4, 2022
9 tasks
rapids-bot bot pushed a commit that referenced this issue Feb 8, 2022
This PR is a first step in addressing #9083. It rewrites Series ufunc dispatch to use a simpler dispatch pattern that removes lookup in the overall cudf namespace, restricting only to Series methods or cupy functions. The changes in this PR also enable proper support for indexing so that ufuncs between two Series objects will work correctly and preserve indexes. Additionally, ufuncs will now support Series that contain nulls, which was not previously the case. Series methods that don't exist in `pandas.Series` that were previously only necessary to support ufunc dispatch have now been removed. Once this PR is merged, I will work on generalizing this approach to also support DataFrame objects, which should enable full support for ufuncs as well as allowing us to deprecate significant chunks of existing code.

For the cases that previously worked, this PR does have some performance implications. Any operator that dispatches to cupy is about 3x faster now (assuming no nulls, since the nullable case was not previously supported), with the exception of logical operators for which we previously defined functions in the Series namespace that do not have pandas analogs. I've made a note in the code that we could reintroduce internal versions of these just for ufunc dispatch if that slowdown becomes a bottleneck for users, but for now I would prefer to avoid any more special cases than we really need.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Christopher Harris (https://github.com/cwharris)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #10217
rapids-bot bot pushed a commit that referenced this issue Feb 15, 2022
This PR addresses the primary issue in #9083, enabling all numpy ufuncs for DataFrame objects. It builds on the work in #10217, generalizing that code path to support multiple columns and moving the method up to `IndexedFrame` to share the logic with `DataFrame`. The custom preprocessing of inputs before handing off to cupy that was implemented in #10217 has been replaced by reusing parts of the existing binop machinery for greater generality, which is especially important for DataFrame binops since they support a wider range of alternative operand types. The current internal refactor is intentionally minimal to leave the focus on the new ufunc features. I will make a follow-up to clean up the internal functions by adding a proper set of hooks into the binop and ufunc implementations so that we can share these implementations with Index types as well, at which point we will be able to remove the extraneous APIs discussed in #9083 (comment).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #10287
rapids-bot bot pushed a commit that referenced this issue Feb 25, 2022
This PR builds on #10217 and #10287 to bring full ufunc support for Index types, expanding well beyond the small set previously supported in the `cudf.core.ops` namespace. By using most of the machinery introduced for IndexedFrame in the prior two PRs we avoid duplicating much logic so that all ufunc dispatches flow through a relatively standard path of known methods prior to a common cupy dispatch. With this change we are also able to deprecate the various ufunc operations defined in cudf/core/ops.py that exist only for this purpose as well as a number of Frame methods that are not defined for the corresponding pandas types. Users of those APIs are recommended to calling the corresponding numpy/cupy ufuncs instead to leverage the new dispatch.

This PR also fixes a bug where index binary operations that output booleans would previously return instances of GenericIndex, whereas those pandas operations would return numpy arrays. cudf now returns cupy arrays in those cases.

Resolves #9083. Contributes to #9038.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10346
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants