Enforce deprecations in `23.10` #13732

galipremsagar · 2023-07-21T23:31:46Z

Description

This PR enforces previously deprecated code until 23.08 in 23.10. This PR removes strings_to_categorical parameter support in read_parquet.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rjzamora

Thanks @galipremsagar - I'm on board with this deprecation/removal. However, I'd like to get some clarification on the new "best practice".

rjzamora · 2023-07-24T18:07:16Z

python/dask_cudf/dask_cudf/io/parquet.py

-                isinstance(meta_cudf._data[col], cudf.core.column.StringColumn)
-                and strings_to_cats
-            ):
-                meta_cudf._data[col] = meta_cudf._data[col].astype("int32")


I know that some RecSys-based users have historically relied on strings_to_categorical=True to deal with the fact that their ML/DL model would often require them to convert string data to numerical data (but they needed to preserve information in the original string data on disk, and couldn't work with a fully-numerical Parquet file).

What is the new "best practice" for these users? Should we suggest something like df["A"] = df["A"].hash_values()? Was strings_to_categorical=True previously doing something clever to avoid reading the entire string column into GPU memory?

cc @EvenOldridge (in case you had further input here)

Was strings_to_categorical=True previously doing something clever to avoid reading the entire string column into GPU memory?

@vuule would be the right one to know if there was any such thing happening.

What is the new "best practice" for these users? Should we suggest something like df["A"] = df["A"].hash_values()?

If someone was relying on this, yes hash_values would be the consistent alternative. hash_values by itself is consistent, but won't yield similar integers as to legacy strings_to_categorical=True behavior.

Was strings_to_categorical=True previously doing something clever to avoid reading the entire string column into GPU memory?

We don't create a string column with this option - string data in the file is hashed in the kernel and we just return an int32 column. So, yeah, something like df["A"] = df["A"].hash_values() is likely to have higher memory use at times.

We don't create a string column with this option - string data in the file is hashed in the kernel and we just return an int32 column. So, yeah, something like df["A"] = df["A"].hash_values() is likely to have higher memory use at times.

That's too bad, but certainly makes sense. I don't think anyone was actually using strings_to_categorical, but I get the sense that people will be starting to ask for exactly this functionality in the near future :/

If there are practical use cases, I wouldn't mind the feature as a "read strings as int" conversion support.

Thanks @vuule, that's good to know. I'll raise an issue if/when I have a practical use case to point to (otherwise this is low priority).

galipremsagar · 2023-07-24T20:16:34Z

/merge

Prem thinks there's a GH UI bug. Re-approving.

bdice

Re-approving.

wence- · 2023-07-25T12:46:57Z

I'll just note that by removing the cython wrapping of the parquet options builder path there's now no way to call the libcudf functionality from Python. This might mean that there's no longer a need for the libcudf code at all?

galipremsagar · 2023-07-25T15:04:54Z

I'll just note that by removing the cython wrapping of the parquet options builder path there's now no way to call the libcudf functionality from Python. This might mean that there's no longer a need for the libcudf code at all?

Yup, @vuule would likely be removing it from libcudf code aswell.

galipremsagar added 2 commits July 21, 2023 16:24

Drop existing deprecations

0f5a86d

Drop support in dask_cudf

221b258

galipremsagar added Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function breaking Breaking change labels Jul 21, 2023

galipremsagar self-assigned this Jul 21, 2023

galipremsagar requested review from a team as code owners July 21, 2023 23:31

galipremsagar requested review from mroeschke and brandon-b-miller and removed request for a team July 21, 2023 23:31

bdice previously approved these changes Jul 22, 2023

View reviewed changes

mroeschke approved these changes Jul 24, 2023

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge 4 - Needs Dask Reviewer and removed 4 - Needs cuDF (Python) Reviewer 5 - Ready to Merge Testing and reviews complete, ready to merge labels Jul 24, 2023

rjzamora approved these changes Jul 24, 2023

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Dask Reviewer labels Jul 24, 2023

bdice self-requested a review July 24, 2023 20:19

bdice approved these changes Jul 24, 2023

View reviewed changes

raydouglass removed the request for review from brandon-b-miller July 24, 2023 20:24

raydouglass changed the title ~~Enforce deprecations in 23.10~~ Enforce deprecations in 23.10 Jul 24, 2023

rapids-bot bot merged commit 2a590db into rapidsai:branch-23.10 Jul 24, 2023

vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce deprecations in `23.10` #13732

Enforce deprecations in `23.10` #13732

galipremsagar commented Jul 21, 2023

rjzamora left a comment

rjzamora Jul 24, 2023

rjzamora Jul 24, 2023

galipremsagar Jul 24, 2023

vuule Jul 24, 2023

rjzamora Jul 25, 2023

vuule Jul 25, 2023

rjzamora Jul 26, 2023

galipremsagar commented Jul 24, 2023

bdice left a comment

wence- commented Jul 25, 2023

galipremsagar commented Jul 25, 2023

Enforce deprecations in 23.10 #13732

Enforce deprecations in 23.10 #13732

Conversation

galipremsagar commented Jul 21, 2023

Description

Checklist

rjzamora left a comment

Choose a reason for hiding this comment

rjzamora Jul 24, 2023

Choose a reason for hiding this comment

rjzamora Jul 24, 2023

Choose a reason for hiding this comment

galipremsagar Jul 24, 2023

Choose a reason for hiding this comment

vuule Jul 24, 2023

Choose a reason for hiding this comment

rjzamora Jul 25, 2023

Choose a reason for hiding this comment

vuule Jul 25, 2023

Choose a reason for hiding this comment

rjzamora Jul 26, 2023

Choose a reason for hiding this comment

galipremsagar commented Jul 24, 2023

bdice left a comment

Choose a reason for hiding this comment

wence- commented Jul 25, 2023

galipremsagar commented Jul 25, 2023

Enforce deprecations in `23.10` #13732

Enforce deprecations in `23.10` #13732