-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#6492: Add from_map
feature to create dataframe
#7215
Changes from 4 commits
1955c72
139d3b5
1f4fde4
28b0805
9fa2891
7325e4f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,7 +15,9 @@ | |
|
||
import io | ||
|
||
import numpy as np | ||
import pandas | ||
import unidist | ||
from pandas.io.common import get_handle, stringify_path | ||
|
||
from modin.core.execution.unidist.common import SignalActor, UnidistWrapper | ||
|
@@ -62,6 +64,7 @@ class PandasOnUnidistIO(UnidistIO): | |
"""Factory providing methods for performing I/O operations using pandas as storage format on unidist as engine.""" | ||
|
||
frame_cls = PandasOnUnidistDataframe | ||
frame_partition_cls = PandasOnUnidistDataframePartition | ||
query_compiler_cls = PandasQueryCompiler | ||
build_args = dict( | ||
frame_partition_cls=PandasOnUnidistDataframePartition, | ||
|
@@ -258,3 +261,66 @@ def func(df, **kw): # pragma: no cover | |
UnidistWrapper.materialize( | ||
[part.list_of_blocks[0] for row in result for part in row] | ||
) | ||
|
||
@classmethod | ||
def from_map(cls, func, iterable, *args, **kwargs): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to use already implemented functions with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite get what would you like use use instead. Please elaborate. We are adding a new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose we can't use anything from existing functionality as every method of a Modin Dataframe assumes there is a dataframe with partitions to apply a function to. |
||
""" | ||
Create a Modin `query_compiler` from a map function. | ||
|
||
This method will construct a Modin `query_compiler` split by row partitions. | ||
The number of row partitions matches the number of elements in the iterable object. | ||
|
||
Parameters | ||
---------- | ||
func : callable | ||
Function to map across the iterable object. | ||
iterable : Iterable | ||
An iterable object. | ||
*args : tuple | ||
Positional arguments to pass in `func`. | ||
**kwargs : dict | ||
Keyword arguments to pass in `func`. | ||
|
||
Returns | ||
------- | ||
BaseQueryCompiler | ||
QueryCompiler containing data returned by map function. | ||
""" | ||
func = cls.frame_cls._partition_mgr_cls.preprocess_func(func) | ||
partitions = np.array( | ||
[ | ||
[ | ||
cls.frame_partition_cls( | ||
deploy_map_func.remote(func, obj, *args, **kwargs) | ||
) | ||
] | ||
for obj in iterable | ||
] | ||
) | ||
return cls.query_compiler_cls(cls.frame_cls(partitions)) | ||
|
||
|
||
@unidist.remote | ||
def deploy_map_func(func, obj, *args, **kwargs): # pragma: no cover | ||
""" | ||
Deploy a func to apply to an object. | ||
|
||
Parameters | ||
---------- | ||
func : callable | ||
Function to map across the iterable object. | ||
obj : object | ||
An object to apply a function to. | ||
*args : tuple | ||
Positional arguments to pass in `func`. | ||
**kwargs : dict | ||
Keyword arguments to pass in `func`. | ||
|
||
Returns | ||
------- | ||
pandas.DataFrame | ||
""" | ||
result = func(obj, *args, **kwargs) | ||
if not isinstance(result, pandas.DataFrame): | ||
result = pandas.DataFrame(result) | ||
return result |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1109,6 +1109,36 @@ def from_dask(dask_obj) -> DataFrame: | |
return ModinObjects.DataFrame(query_compiler=FactoryDispatcher.from_dask(dask_obj)) | ||
|
||
|
||
def from_map(func, iterable, *args, **kwargs) -> DataFrame: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Documentation needs to be updated I suppose. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't have docs for such methods as from_pandas, from_ray, from_dask, etc. Do you think we should update docs on this matter in a separate issue in one go? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ок There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @YarShev are you going to do this before release? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that would be great - #7256. |
||
""" | ||
Create a Modin DataFrame from map function applied to an iterable object. | ||
|
||
This method will construct a Modin DataFrame split by row partitions. | ||
The number of row partitions matches the number of elements in the iterable object. | ||
|
||
Parameters | ||
---------- | ||
func : callable | ||
Function to map across the iterable object. | ||
iterable : Iterable | ||
An iterable object. | ||
*args : tuple | ||
Positional arguments to pass in `func`. | ||
**kwargs : dict | ||
Keyword arguments to pass in `func`. | ||
|
||
Returns | ||
------- | ||
DataFrame | ||
A new Modin DataFrame object. | ||
""" | ||
from modin.core.execution.dispatching.factories.dispatcher import FactoryDispatcher | ||
|
||
return ModinObjects.DataFrame( | ||
query_compiler=FactoryDispatcher.from_map(func, iterable, *args, *kwargs) | ||
) | ||
|
||
|
||
def to_pandas(modin_obj: SupportsPublicToPandas) -> DataFrame | Series: | ||
""" | ||
Convert a Modin DataFrame/Series to a pandas DataFrame/Series. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to use
RayWrraper.deploy
here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And corresponding wrappers for other engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RayWrapper.deploy deploys a function that can return any object but here we intentionally wrap a result in a pandas DataFrame if the user hasn't done so. I would leave the changes as is. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce the likelihood of error, we need to either have all launch options in one place, or use only one method. There is a tendency that launching functions becomes more difficult due to additional parameters. A good example is
resources=RayTaskCustomResources.get()
, which is currently not taken into account here.We can move this function to
engine_wrapper.py
and call it insideRaywrapper.deploy
using an additional parameter.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-used *.deploy.