-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce deprecations in 23.10
#13732
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @galipremsagar - I'm on board with this deprecation/removal. However, I'd like to get some clarification on the new "best practice".
isinstance(meta_cudf._data[col], cudf.core.column.StringColumn) | ||
and strings_to_cats | ||
): | ||
meta_cudf._data[col] = meta_cudf._data[col].astype("int32") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that some RecSys-based users have historically relied on strings_to_categorical=True
to deal with the fact that their ML/DL model would often require them to convert string data to numerical data (but they needed to preserve information in the original string data on disk, and couldn't work with a fully-numerical Parquet file).
What is the new "best practice" for these users? Should we suggest something like df["A"] = df["A"].hash_values()
? Was strings_to_categorical=True
previously doing something clever to avoid reading the entire string column into GPU memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @EvenOldridge (in case you had further input here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was
strings_to_categorical=True
previously doing something clever to avoid reading the entire string column into GPU memory?
@vuule would be the right one to know if there was any such thing happening.
What is the new "best practice" for these users? Should we suggest something like
df["A"] = df["A"].hash_values()
?
If someone was relying on this, yes hash_values
would be the consistent alternative. hash_values
by itself is consistent, but won't yield similar integers as to legacy strings_to_categorical=True
behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was strings_to_categorical=True previously doing something clever to avoid reading the entire string column into GPU memory?
We don't create a string column with this option - string data in the file is hashed in the kernel and we just return an int32 column. So, yeah, something like df["A"] = df["A"].hash_values() is likely to have higher memory use at times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't create a string column with this option - string data in the file is hashed in the kernel and we just return an int32 column. So, yeah, something like df["A"] = df["A"].hash_values() is likely to have higher memory use at times.
That's too bad, but certainly makes sense. I don't think anyone was actually using strings_to_categorical
, but I get the sense that people will be starting to ask for exactly this functionality in the near future :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are practical use cases, I wouldn't mind the feature as a "read strings as int" conversion support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vuule, that's good to know. I'll raise an issue if/when I have a practical use case to point to (otherwise this is low priority).
/merge |
Prem thinks there's a GH UI bug. Re-approving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-approving.
23.10
23.10
23.10
23.10
23.10
23.10
23.10
23.10
I'll just note that by removing the cython wrapping of the parquet options builder path there's now no way to call the libcudf functionality from Python. This might mean that there's no longer a need for the libcudf code at all? |
Yup, @vuule would likely be removing it from libcudf code aswell. |
Description
This PR enforces previously deprecated code until
23.08
in23.10
. This PR removesstrings_to_categorical
parameter support inread_parquet
.Checklist