Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent of dask.dataframe.from_map? #6492

Closed
zmbc opened this issue Aug 21, 2023 · 3 comments · Fixed by #7215
Closed

Equivalent of dask.dataframe.from_map? #6492

zmbc opened this issue Aug 21, 2023 · 3 comments · Fixed by #7215
Labels
External Pull requests and issues from people who do not regularly contribute to modin new feature/request 💬 Requests and pull requests for new features P2 Minor bugs or low-priority feature requests

Comments

@zmbc
Copy link
Contributor

zmbc commented Aug 21, 2023

dask.dataframe has an experimental function called from_map. This allows the user to define a custom function for loading/creating partitions of a dataframe. Conceptually, I think this is similar to Registering Custom Functions, except that it allows a function to create a DataFrame, rather than operate on an existing DataFrame.

This seems like it would be very useful to have in Modin; I don't see any straightforward way to achieve this currently. Am I missing something? I know that it would need to have some additional complexity that Dask doesn't have to worry about, in order to be general across storage formats and allow 2D partitioning, but these don't seem like insurmountable issues.

Any reactions to the utility and/or feasibility of this feature?

@zmbc zmbc added new feature/request 💬 Requests and pull requests for new features Triage 🩹 Issues that need triage labels Aug 21, 2023
@anmyachev
Copy link
Collaborator

@zmbc I think it's quite similar to what from_partitions function does.

def from_partitions(
partitions: list,
axis: Optional[int],
index: Optional[Axes] = None,
columns: Optional[Axes] = None,
row_lengths: Optional[list] = None,
column_widths: Optional[list] = None,
) -> DataFrame:

Example:

import pandas as pd
from modin.distributed.dataframe.pandas import from_partitions

import ray
ray.init(num_cpus=4)

partitions = list(map(lambda number: ray.put(pd.DataFrame([number])), [0, 1, 2]))

modin_df = from_partitions(partitions, axis=0)

@zmbc
Copy link
Contributor Author

zmbc commented Aug 22, 2023

Thanks! That's helpful, but it does require me to interact directly with the underlying engine (Ray in your example). That's kind of a pain when I want to write code that will work with any Modin engine.

@anmyachev
Copy link
Collaborator

Thanks! That's helpful, but it does require me to interact directly with the underlying engine (Ray in your example). That's kind of a pain when I want to write code that will work with any Modin engine.

Indeed too many low level details for end user. Looks like this functionality can be useful and we can try to implement it. Thanks for the idea @zmbc!

cc @modin-project/modin-core in case you have any more thoughts about that.

@anmyachev anmyachev added P2 Minor bugs or low-priority feature requests and removed Triage 🩹 Issues that need triage labels Aug 23, 2023
@vnlitvinov vnlitvinov added the External Pull requests and issues from people who do not regularly contribute to modin label Sep 20, 2023
YarShev added a commit to YarShev/modin that referenced this issue Apr 24, 2024
YarShev added a commit to YarShev/modin that referenced this issue Apr 24, 2024
YarShev added a commit to YarShev/modin that referenced this issue Apr 24, 2024
anmyachev pushed a commit that referenced this issue Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Pull requests and issues from people who do not regularly contribute to modin new feature/request 💬 Requests and pull requests for new features P2 Minor bugs or low-priority feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants