Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Standardize applymap support with pandas to enable Dask applymap #10169

Closed
beckernick opened this issue Jan 31, 2022 · 5 comments
Closed
Assignees
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jan 31, 2022

Today, we support the applymap interface on Series but not DataFrames. Pandas supports applymap on DataFrames but not Series. In pandas, the interface provides applies a scalar function/UDF to every element in the dataframe (elementwise UDF).

The reason for our Series.applymap implementation may be that, until recently, we did not support an elementwise apply interface directly. Now that we do, it's possible Series.applymap is redundant, as both interfaces explicitly provide users access to elementwise UDFs run via our udf pipeline. (Please feel free to correct me if I'm off base here).

For compatibility with Dask, we should explore aligning our interfaces with pandas. Right now, it's not possible to use applymap with Dask-cuDF, as the Dask.Series object does not have our Series interface and we don't implement the DataFrame interface.

We might consider:

  • Deprecating (and eventually removing) Series.applymap
  • Implementing DataFrame.applymap with our existing udf pipeline machinery (independently and sequentially processing each column)
  • Adding unit tests for dask_cudf Dataframes
  • Update the Guide to UDFs as appropriate
import dask_cudf
import cudf
import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({"a":[0,1,2,3,4]})
gdf = cudf.from_pandas(df)

def func(x):
    return x + 10

# Dask CPU
ddf = dd.from_pandas(df, 2)
ddf.applymap(func).compute() # succeeds
# ddf.a.applymap(func).compute() # doesn't exist

# Dask GPU
ddf = dask_cudf.from_cudf(gdf, 2)
# ddf.applymap(func).compute() # fails
# ddf.a.applymap(func).compute() # fails
@beckernick beckernick added feature request New feature or request Python Affects Python cuDF API. dask Dask issue labels Jan 31, 2022
@brandon-b-miller
Copy link
Contributor

I can look into this, @beckernick .

@brandon-b-miller brandon-b-miller self-assigned this Jan 31, 2022
@github-actions
Copy link

github-actions bot commented Mar 2, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Mar 29, 2022
rapids-bot bot pushed a commit that referenced this issue Apr 13, 2022
Naive implementation of `DataFrame.applymap` that just calls `apply` in a loop over columns.

This could theoretically be made much faster within our framework. This requires at worst `N` compilations and `M` kernel launches, where `N` is the number of different dtypes in the data, and `M` is the number of total columns. We could however as an improvement to this launch just one kernel that populates the entire output data. This would still suffer from the compilation bottleneck however, since the function must be compiled in order for an output dtype to be determined, and this will need to be done for each distinct dtype within the data.

Part of #10169

Authors:
  - https://github.com/brandon-b-miller
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10542
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@brandon-b-miller
Copy link
Contributor

This is in progress, applymap is deprecated and we're moving to supercede it with apply. In addition we now have DataFrame.applymap instead that truly mirrors pandas. This will be closed when Series.applymap is removed.

@brandon-b-miller
Copy link
Contributor

with #11031 merged I think this is done :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

2 participants