[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. #18992

clarkzinzow · 2021-09-30T02:02:37Z

Users have been confused by ray.data.from_pandas() and ray.data.from_arrow() taking list of object references to tables instead of just a list of tables, and have been likewise confused by Dataset.to_pandas() and Dataset.to_arrow() returning object references instead of tables. This PR renames these ref-centric APIs to have a _refs suffix to make this signature clearer, and have added new APIs that take/return the raw tables under the current ref-centric API names. This PR also marks the former as developer APIs to make it clear that the ref-centric APIs aren't end-user APIs.

Related issue number

Closes #18978

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl

Hmm I'm in favor of renaming the existing methods, but I don't think we should add methods that make it easy to OOM the driver.

Calling .to_pandas() is likely to instantly cause a crash for any reasonably sized dataset, how about we omit this?

Fine to keep the from_() methods though.

ericl

Can you also rename get_blocks to get_internal_block_refs()?

clarkzinzow · 2021-09-30T02:38:33Z

Hmm I'm in favor of renaming the existing methods, but I don't think we should add methods that make it easy to OOM the driver.

Calling .to_pandas() is likely to instantly cause a crash for any reasonably sized dataset, how about we omit this?

Yeah this is a tough one. I know that this can be a big foot-gun, and I know that we're aiming to provide streaming APIs (e.g. give me 5 rows) for introspecting Datasets, but users are going to want a way to poke at small datasets that can fit in memory when first starting out, and not having a plain .to_pandas() API that returns a DataFrame is going to be seen as a big omission. Also, having a .to_pandas() API for poking at small datasets was explicitly called for by both the CUJ and a few users.

We could just direct people to use ray.get(ds.to_pandas_refs()), but then we're recommending that end-users take extra hops to get what they want, and we're telling end-users to use a developer API. Is that exploration UX hit worth it?

clarkzinzow · 2021-09-30T02:41:43Z

@ericl I'll remove it for now, we can revisit it if users ask for it.

ericl · 2021-09-30T15:24:35Z

How about adding to_pandas(limit=1000) by default then? That should return a single coalesced pandas Df, which would be the most convenient, and also print a warning if the df was truncated to the limit.

clarkzinzow · 2021-09-30T19:48:59Z

@ericl That's a great idea! Best of both worlds. 😄

python/ray/data/dataset.py

scv119 · 2021-10-01T05:10:09Z

looks good to me. delegating to @ericl to accept.

python/ray/data/dataset.py

clarkzinzow · 2021-10-01T17:15:26Z

Windows test failures are unrelated.

ericl · 2021-10-01T17:37:29Z

doc/source/data/dataset-tensor-support.rst

@@ -55,7 +55,7 @@ If you already have a Parquet dataset with columns containing serialized tensors

    # Write the dataset to Parquet. The tensor column will be written as an
    # array of opaque byte blobs.
-    ds = ray.data.from_pandas([ray.put(df)])
+    ds = ray.data.from_pandas([df])


How about we take *args for from_pandas()? I feel the common case is passing just a single df if not passing distributed refs, so accepting a list is a bit weird.

@ericl Agreed, I'm primarily trying to not step on this PR's toes!

ericl · 2021-10-01T17:37:51Z

python/ray/data/dataset.py

        ``modin.distributed.dataframe.pandas.partitions.from_partitions()``.

        This is only supported for datasets convertible to Arrow records.
        This function induces a copy of the data. For zero-copy access to the
-        underlying data, consider using ``.to_arrow()`` or ``.get_blocks()``.
+        underlying data, consider using ``.to_arrow()`` or


Thx for updating internal comments.

ericl · 2021-10-01T17:38:37Z

python/ray/data/read_api.py



 @PublicAPI(stability="beta")
-def from_pandas(dfs: List[ObjectRef["pandas.DataFrame"]]) -> Dataset[ArrowRow]:
-    """Create a dataset from a set of Pandas dataframes.
+def from_pandas(dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]:


Suggested change

def from_pandas(dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]:

def from_pandas(*dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]:

ericl · 2021-10-01T17:38:56Z

Looks good, just one more suggestion.

done

clarkzinzow added 2 commits September 30, 2021 01:56

Delineate between ref and raw APIs for the Pandas/Arrow integrations.

69d40fa

Docs updates.

36abeae

clarkzinzow assigned ericl and scv119 Sep 30, 2021

ericl reviewed Sep 30, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 30, 2021

clarkzinzow force-pushed the datasets/fix/pandas-arrow-integration-api branch from a52988a to ffa9c6c Compare September 30, 2021 13:36

Usage updates.

7000e76

clarkzinzow force-pushed the datasets/fix/pandas-arrow-integration-api branch from ffa9c6c to 7000e76 Compare September 30, 2021 15:19

Add back to_pandas() with limit.

b33a0e4

clarkzinzow force-pushed the datasets/fix/pandas-arrow-integration-api branch from e7ffdc6 to b33a0e4 Compare September 30, 2021 23:27

scv119 reviewed Oct 1, 2021

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

scv119 reviewed Oct 1, 2021

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

ericl reviewed Oct 1, 2021

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl reviewed Oct 1, 2021

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl reviewed Oct 1, 2021

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

clarkzinzow added 2 commits October 1, 2021 14:00

Use limit()

1d48d95

Single-line for docstring.

3ff7931

clarkzinzow force-pushed the datasets/fix/pandas-arrow-integration-api branch from 8124fdf to 3ff7931 Compare October 1, 2021 14:00

clarkzinzow added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Oct 1, 2021

clarkzinzow requested a review from ericl October 1, 2021 17:15

ericl previously requested changes Oct 1, 2021

View reviewed changes

ericl added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Oct 1, 2021

ericl added the do-not-merge Do not merge this PR! label Oct 1, 2021

clarkzinzow removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Oct 1, 2021

clarkzinzow requested a review from ericl October 1, 2021 19:41

ericl approved these changes Oct 1, 2021

View reviewed changes

ericl merged commit d22f838 into ray-project:master Oct 1, 2021

krfricke mentioned this pull request Oct 13, 2021

Move to new Ray Data to/from_pandas API ray-project/xgboost_ray#161

Merged

scv119 mentioned this pull request Oct 15, 2021

[Dataset] fix from_dask bugs in dataset. #19409

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. #18992

[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. #18992

clarkzinzow commented Sep 30, 2021 •

edited

Loading

ericl left a comment

ericl left a comment

clarkzinzow commented Sep 30, 2021 •

edited

Loading

clarkzinzow commented Sep 30, 2021

ericl commented Sep 30, 2021

clarkzinzow commented Sep 30, 2021

scv119 commented Oct 1, 2021

clarkzinzow commented Oct 1, 2021

ericl Oct 1, 2021

clarkzinzow Oct 1, 2021

ericl Oct 1, 2021

ericl Oct 1, 2021

ericl commented Oct 1, 2021

	def from_pandas(dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]:
	def from_pandas(*dfs: List["pandas.DataFrame"]) -> Dataset[ArrowRow]:

[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. #18992

[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. #18992

Conversation

clarkzinzow commented Sep 30, 2021 • edited Loading

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow commented Sep 30, 2021 • edited Loading

clarkzinzow commented Sep 30, 2021

ericl commented Sep 30, 2021

clarkzinzow commented Sep 30, 2021

scv119 commented Oct 1, 2021

clarkzinzow commented Oct 1, 2021

ericl Oct 1, 2021

Choose a reason for hiding this comment

clarkzinzow Oct 1, 2021

Choose a reason for hiding this comment

ericl Oct 1, 2021

Choose a reason for hiding this comment

ericl Oct 1, 2021

Choose a reason for hiding this comment

ericl commented Oct 1, 2021

clarkzinzow commented Sep 30, 2021 •

edited

Loading

clarkzinzow commented Sep 30, 2021 •

edited

Loading