Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce deprecations in 23.10 #13732

Merged
merged 2 commits into from
Jul 24, 2023
Merged

Conversation

galipremsagar
Copy link
Contributor

Description

This PR enforces previously deprecated code until 23.08 in 23.10. This PR removes strings_to_categorical parameter support in read_parquet.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar added Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function breaking Breaking change labels Jul 21, 2023
@galipremsagar galipremsagar self-assigned this Jul 21, 2023
@galipremsagar galipremsagar requested review from a team as code owners July 21, 2023 23:31
@galipremsagar galipremsagar requested review from mroeschke and brandon-b-miller and removed request for a team July 21, 2023 23:31
bdice
bdice previously approved these changes Jul 22, 2023
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge 4 - Needs Dask Reviewer and removed 4 - Needs cuDF (Python) Reviewer 5 - Ready to Merge Testing and reviews complete, ready to merge labels Jul 24, 2023
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @galipremsagar - I'm on board with this deprecation/removal. However, I'd like to get some clarification on the new "best practice".

isinstance(meta_cudf._data[col], cudf.core.column.StringColumn)
and strings_to_cats
):
meta_cudf._data[col] = meta_cudf._data[col].astype("int32")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that some RecSys-based users have historically relied on strings_to_categorical=True to deal with the fact that their ML/DL model would often require them to convert string data to numerical data (but they needed to preserve information in the original string data on disk, and couldn't work with a fully-numerical Parquet file).

What is the new "best practice" for these users? Should we suggest something like df["A"] = df["A"].hash_values()? Was strings_to_categorical=True previously doing something clever to avoid reading the entire string column into GPU memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @EvenOldridge (in case you had further input here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was strings_to_categorical=True previously doing something clever to avoid reading the entire string column into GPU memory?

@vuule would be the right one to know if there was any such thing happening.

What is the new "best practice" for these users? Should we suggest something like df["A"] = df["A"].hash_values()?

If someone was relying on this, yes hash_values would be the consistent alternative. hash_values by itself is consistent, but won't yield similar integers as to legacy strings_to_categorical=True behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was strings_to_categorical=True previously doing something clever to avoid reading the entire string column into GPU memory?

We don't create a string column with this option - string data in the file is hashed in the kernel and we just return an int32 column. So, yeah, something like df["A"] = df["A"].hash_values() is likely to have higher memory use at times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't create a string column with this option - string data in the file is hashed in the kernel and we just return an int32 column. So, yeah, something like df["A"] = df["A"].hash_values() is likely to have higher memory use at times.

That's too bad, but certainly makes sense. I don't think anyone was actually using strings_to_categorical, but I get the sense that people will be starting to ask for exactly this functionality in the near future :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are practical use cases, I wouldn't mind the feature as a "read strings as int" conversion support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vuule, that's good to know. I'll raise an issue if/when I have a practical use case to point to (otherwise this is low priority).

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Dask Reviewer labels Jul 24, 2023
@galipremsagar
Copy link
Contributor Author

/merge

@bdice bdice dismissed their stale review July 24, 2023 20:19

Prem thinks there's a GH UI bug. Re-approving.

@bdice bdice self-requested a review July 24, 2023 20:19
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving.

@raydouglass raydouglass removed the request for review from brandon-b-miller July 24, 2023 20:24
@raydouglass raydouglass changed the title Enforce deprecations in 23.10 Enforce deprecations in 23.10 Jul 24, 2023
@raydouglass raydouglass changed the title Enforce deprecations in 23.10 Enforce deprecations in 23.10 Jul 24, 2023
@raydouglass raydouglass changed the title Enforce deprecations in 23.10 Enforce deprecations in 23.10 Jul 24, 2023
@raydouglass raydouglass changed the title Enforce deprecations in 23.10 Enforce deprecations in 23.10 Jul 24, 2023
@rapids-bot rapids-bot bot merged commit 2a590db into rapidsai:branch-23.10 Jul 24, 2023
@wence-
Copy link
Contributor

wence- commented Jul 25, 2023

I'll just note that by removing the cython wrapping of the parquet options builder path there's now no way to call the libcudf functionality from Python. This might mean that there's no longer a need for the libcudf code at all?

@galipremsagar
Copy link
Contributor Author

I'll just note that by removing the cython wrapping of the parquet options builder path there's now no way to call the libcudf functionality from Python. This might mean that there's no longer a need for the libcudf code at all?

Yup, @vuule would likely be removing it from libcudf code aswell.

@vyasr vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change dask Dask issue improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants