Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. #18992

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Sep 30, 2021

Users have been confused by ray.data.from_pandas() and ray.data.from_arrow() taking list of object references to tables instead of just a list of tables, and have been likewise confused by Dataset.to_pandas() and Dataset.to_arrow() returning object references instead of tables. This PR renames these ref-centric APIs to have a _refs suffix to make this signature clearer, and have added new APIs that take/return the raw tables under the current ref-centric API names. This PR also marks the former as developer APIs to make it clear that the ref-centric APIs aren't end-user APIs.

Related issue number

Closes #18978

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm in favor of renaming the existing methods, but I don't think we should add methods that make it easy to OOM the driver.

Calling .to_pandas() is likely to instantly cause a crash for any reasonably sized dataset, how about we omit this?

Fine to keep the from_() methods though.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also rename get_blocks to get_internal_block_refs()?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 30, 2021
@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Sep 30, 2021

Hmm I'm in favor of renaming the existing methods, but I don't think we should add methods that make it easy to OOM the driver.

Calling .to_pandas() is likely to instantly cause a crash for any reasonably sized dataset, how about we omit this?

Yeah this is a tough one. I know that this can be a big foot-gun, and I know that we're aiming to provide streaming APIs (e.g. give me 5 rows) for introspecting Datasets, but users are going to want a way to poke at small datasets that can fit in memory when first starting out, and not having a plain .to_pandas() API that returns a DataFrame is going to be seen as a big omission. Also, having a .to_pandas() API for poking at small datasets was explicitly called for by both the CUJ and a few users.

We could just direct people to use ray.get(ds.to_pandas_refs()), but then we're recommending that end-users take extra hops to get what they want, and we're telling end-users to use a developer API. Is that exploration UX hit worth it?

@clarkzinzow
Copy link
Contributor Author

@ericl I'll remove it for now, we can revisit it if users ask for it.

@clarkzinzow clarkzinzow force-pushed the datasets/fix/pandas-arrow-integration-api branch from a52988a to ffa9c6c Compare September 30, 2021 13:36
@clarkzinzow clarkzinzow force-pushed the datasets/fix/pandas-arrow-integration-api branch from ffa9c6c to 7000e76 Compare September 30, 2021 15:19
@ericl
Copy link
Contributor

ericl commented Sep 30, 2021

How about adding to_pandas(limit=1000) by default then? That should return a single coalesced pandas Df, which would be the most convenient, and also print a warning if the df was truncated to the limit.

@clarkzinzow
Copy link
Contributor Author

@ericl That's a great idea! Best of both worlds. 😄

@clarkzinzow clarkzinzow force-pushed the datasets/fix/pandas-arrow-integration-api branch from e7ffdc6 to b33a0e4 Compare September 30, 2021 23:27
@scv119
Copy link
Contributor

scv119 commented Oct 1, 2021

looks good to me. delegating to @ericl to accept.

@clarkzinzow clarkzinzow force-pushed the datasets/fix/pandas-arrow-integration-api branch from 8124fdf to 3ff7931 Compare October 1, 2021 14:00
@clarkzinzow
Copy link
Contributor Author

Windows test failures are unrelated.

@clarkzinzow clarkzinzow added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Oct 1, 2021
@clarkzinzow clarkzinzow requested a review from ericl October 1, 2021 17:15
ericl
ericl previously requested changes Oct 1, 2021
@@ -55,7 +55,7 @@ If you already have a Parquet dataset with columns containing serialized tensors

# Write the dataset to Parquet. The tensor column will be written as an
# array of opaque byte blobs.
ds = ray.data.from_pandas([ray.put(df)])
ds = ray.data.from_pandas([df])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we take *args for from_pandas()? I feel the common case is passing just a single df if not passing distributed refs, so accepting a list is a bit weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl Agreed, I'm primarily trying to not step on this PR's toes!

``modin.distributed.dataframe.pandas.partitions.from_partitions()``.

This is only supported for datasets convertible to Arrow records.
This function induces a copy of the data. For zero-copy access to the
underlying data, consider using ``.to_arrow()`` or ``.get_blocks()``.
underlying data, consider using ``.to_arrow()`` or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for updating internal comments.



@PublicAPI(stability="beta")
def from_pandas(dfs: List[ObjectRef["pandas.DataFrame"]]) -> Dataset[ArrowRow]:
"""Create a dataset from a set of Pandas dataframes.
def from_pandas(dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def from_pandas(dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]:
def from_pandas(*dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]:

@ericl ericl added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Oct 1, 2021
@ericl
Copy link
Contributor

ericl commented Oct 1, 2021

Looks good, just one more suggestion.

@ericl ericl added the do-not-merge Do not merge this PR! label Oct 1, 2021
@clarkzinzow clarkzinzow removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Oct 1, 2021
@clarkzinzow clarkzinzow requested a review from ericl October 1, 2021 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Improve Pandas/Arrow integration APIs
3 participants