FEAT-#6492: Add `from_map` feature to create dataframe #7215

YarShev · 2024-04-24T18:31:16Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Equivalent of dask.dataframe.from_map? #6492
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Igoshev, Iaroslav <[email protected]>

anmyachev · 2024-04-25T13:17:27Z

modin/core/execution/unidist/implementations/pandas_on_unidist/io/io.py

@@ -258,3 +261,66 @@ def func(df, **kw):  # pragma: no cover
        UnidistWrapper.materialize(
            [part.list_of_blocks[0] for row in result for part in row]
        )
+
+    @classmethod
+    def from_map(cls, func, iterable, *args, **kwargs):


Is it possible to use already implemented functions with num_splits=1?

I don't quite get what would you like use use instead. Please elaborate. We are adding a new from_map by analogy with other io functions.

I suppose we can't use anything from existing functionality as every method of a Modin Dataframe assumes there is a dataframe with partitions to apply a function to.

anmyachev · 2024-04-25T13:20:22Z

modin/pandas/io.py

@@ -1109,6 +1109,36 @@ def from_dask(dask_obj) -> DataFrame:
    return ModinObjects.DataFrame(query_compiler=FactoryDispatcher.from_dask(dask_obj))


+def from_map(func, iterable, *args, **kwargs) -> DataFrame:


Documentation needs to be updated I suppose.

We don't have docs for such methods as from_pandas, from_ray, from_dask, etc. Do you think we should update docs on this matter in a separate issue in one go?

@YarShev are you going to do this before release?

Yes, that would be great - #7256.

modin/tests/pandas/test_io.py

anmyachev · 2024-04-29T15:03:01Z

modin/core/execution/ray/implementations/pandas_on_ray/io/io.py

+            [
+                [
+                    cls.frame_partition_cls(
+                        deploy_map_func.remote(func, obj, *args, **kwargs)


I suggest to use RayWrraper.deploy here.

And corresponding wrappers for other engines.

RayWrapper.deploy deploys a function that can return any object but here we intentionally wrap a result in a pandas DataFrame if the user hasn't done so. I would leave the changes as is. What do you think?

To reduce the likelihood of error, we need to either have all launch options in one place, or use only one method. There is a tendency that launching functions becomes more difficult due to additional parameters. A good example is resources=RayTaskCustomResources.get(), which is currently not taken into account here.

We can move this function to engine_wrapper.py and call it inside Raywrapper.deploy using an additional parameter.

Re-used *.deploy.

Signed-off-by: Igoshev, Iaroslav <[email protected]>

anmyachev · 2024-04-29T22:38:47Z

modin/core/execution/dask/implementations/pandas_on_dask/io/io.py

+        partitions = np.array(
+            [
+                [
+                    cls.frame_partition_cls(
+                        DaskWrapper.deploy(
+                            func,
+                            f_args=(obj,) + args,
+                            f_kwargs=kwargs,
+                            return_pandas_df=True,
+                        )
+                    )
+                ]
+                for obj in iterable
+            ]
+        )


Based on the information required to perform this task, it seems that a more appropriate level at which to define the function would be a partition manager, for example somewhere around:

modin/modin/core/dataframe/pandas/partitioning/partition_manager.py

Line 186 in c48bb30

def create_partition_from_metadata(cls, **metadata):

I would leave it here. Imagine a case when iterable is a list files.

Imagine a case when iterable is a list files.

We'll be abstracting from the parameters just like we're doing now, so I don't see any difference.

anmyachev · 2024-04-29T22:40:33Z

modin/tests/pandas/test_io.py

+
+
+@pytest.mark.skipif(
+    condition=Engine.get() not in ("Ray", "Dask", "Unidist"),


Would it be more correct to limit it not by engines, but by storage format: pandas?

PandasOnPython wouldn't work. Let's leave as is.

PandasOnPython wouldn't work.

As far as I can see, there are no restrictions on its operation. We just need to add essentially the same code as for the other engines.

FEAT-modin-project#6492: Add from_map feature to create dataframe

1955c72

Signed-off-by: Igoshev, Iaroslav <[email protected]>

YarShev requested review from devin-petersohn, mvashishtha, RehanSD, vnlitvinov, anmyachev, dchigarev and a team as code owners April 24, 2024 18:31

Add docstrings

139d3b5

Signed-off-by: Igoshev, Iaroslav <[email protected]>

YarShev force-pushed the dev/yigoshev-issue6492 branch from 6175037 to 139d3b5 Compare April 24, 2024 18:58

YarShev added 2 commits April 24, 2024 19:25

Add additional attrs

1f4fde4

Signed-off-by: Igoshev, Iaroslav <[email protected]>

Fix Dask

28b0805

Signed-off-by: Igoshev, Iaroslav <[email protected]>

anmyachev reviewed Apr 25, 2024

View reviewed changes

anmyachev reviewed Apr 29, 2024

View reviewed changes

modin/tests/pandas/test_io.py Outdated Show resolved Hide resolved

Update modin/tests/pandas/test_io.py

9fa2891

anmyachev reviewed Apr 29, 2024

View reviewed changes

Address comments

7325e4f

Signed-off-by: Igoshev, Iaroslav <[email protected]>

anmyachev reviewed Apr 29, 2024

View reviewed changes

anmyachev approved these changes Apr 29, 2024

View reviewed changes

anmyachev merged commit 9fa326f into modin-project:main Apr 30, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#6492: Add `from_map` feature to create dataframe #7215

FEAT-#6492: Add `from_map` feature to create dataframe #7215

YarShev commented Apr 24, 2024

anmyachev Apr 25, 2024

YarShev Apr 26, 2024

YarShev Apr 26, 2024

anmyachev Apr 25, 2024

YarShev Apr 26, 2024

anmyachev Apr 29, 2024

anmyachev May 8, 2024

YarShev May 13, 2024

anmyachev Apr 29, 2024

anmyachev Apr 29, 2024

YarShev Apr 29, 2024

anmyachev Apr 29, 2024

YarShev Apr 29, 2024

anmyachev Apr 29, 2024

YarShev Apr 30, 2024

anmyachev Apr 30, 2024

anmyachev Apr 29, 2024

YarShev Apr 30, 2024

anmyachev Apr 30, 2024

		@@ -1109,6 +1109,36 @@ def from_dask(dask_obj) -> DataFrame:
		return ModinObjects.DataFrame(query_compiler=FactoryDispatcher.from_dask(dask_obj))


		def from_map(func, iterable, args, *kwargs) -> DataFrame:



		@pytest.mark.skipif(
		condition=Engine.get() not in ("Ray", "Dask", "Unidist"),

FEAT-#6492: Add from_map feature to create dataframe #7215

FEAT-#6492: Add from_map feature to create dataframe #7215

Conversation

YarShev commented Apr 24, 2024

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FEAT-#6492: Add `from_map` feature to create dataframe #7215

FEAT-#6492: Add `from_map` feature to create dataframe #7215