-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. #18992
[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. #18992
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm in favor of renaming the existing methods, but I don't think we should add methods that make it easy to OOM the driver.
Calling .to_pandas() is likely to instantly cause a crash for any reasonably sized dataset, how about we omit this?
Fine to keep the from_() methods though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also rename get_blocks to get_internal_block_refs()
?
Yeah this is a tough one. I know that this can be a big foot-gun, and I know that we're aiming to provide streaming APIs (e.g. give me 5 rows) for introspecting Datasets, but users are going to want a way to poke at small datasets that can fit in memory when first starting out, and not having a plain We could just direct people to use |
@ericl I'll remove it for now, we can revisit it if users ask for it. |
a52988a
to
ffa9c6c
Compare
ffa9c6c
to
7000e76
Compare
How about adding to_pandas(limit=1000) by default then? That should return a single coalesced pandas Df, which would be the most convenient, and also print a warning if the df was truncated to the limit. |
@ericl That's a great idea! Best of both worlds. 😄 |
e7ffdc6
to
b33a0e4
Compare
looks good to me. delegating to @ericl to accept. |
8124fdf
to
3ff7931
Compare
Windows test failures are unrelated. |
@@ -55,7 +55,7 @@ If you already have a Parquet dataset with columns containing serialized tensors | |||
|
|||
# Write the dataset to Parquet. The tensor column will be written as an | |||
# array of opaque byte blobs. | |||
ds = ray.data.from_pandas([ray.put(df)]) | |||
ds = ray.data.from_pandas([df]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we take *args
for from_pandas()? I feel the common case is passing just a single df if not passing distributed refs, so accepting a list is a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``modin.distributed.dataframe.pandas.partitions.from_partitions()``. | ||
|
||
This is only supported for datasets convertible to Arrow records. | ||
This function induces a copy of the data. For zero-copy access to the | ||
underlying data, consider using ``.to_arrow()`` or ``.get_blocks()``. | ||
underlying data, consider using ``.to_arrow()`` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for updating internal comments.
|
||
|
||
@PublicAPI(stability="beta") | ||
def from_pandas(dfs: List[ObjectRef["pandas.DataFrame"]]) -> Dataset[ArrowRow]: | ||
"""Create a dataset from a set of Pandas dataframes. | ||
def from_pandas(dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def from_pandas(dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]: | |
def from_pandas(*dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]: |
Looks good, just one more suggestion. |
Users have been confused by
ray.data.from_pandas()
andray.data.from_arrow()
taking list of object references to tables instead of just a list of tables, and have been likewise confused byDataset.to_pandas()
andDataset.to_arrow()
returning object references instead of tables. This PR renames these ref-centric APIs to have a_refs
suffix to make this signature clearer, and have added new APIs that take/return the raw tables under the current ref-centric API names. This PR also marks the former as developer APIs to make it clear that the ref-centric APIs aren't end-user APIs.Related issue number
Closes #18978
Checks
scripts/format.sh
to lint the changes in this PR.