Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the Fold operator allow a function to change the shape of partitions? #6498

Closed
zmbc opened this issue Aug 22, 2023 · 11 comments · Fixed by #7257
Closed

Does the Fold operator allow a function to change the shape of partitions? #6498

zmbc opened this issue Aug 22, 2023 · 11 comments · Fixed by #7257
Labels
External Pull requests and issues from people who do not regularly contribute to modin Internals Internal modin functionality new feature/request 💬 Requests and pull requests for new features P2 Minor bugs or low-priority feature requests question ❓ Questions about Modin

Comments

@zmbc
Copy link
Contributor

zmbc commented Aug 22, 2023

On the Operators Module Description page, the Map operator has a note that "map function should not change the shape of the partitions."

The Fold operator has no such note, but when I try to run an example that changes the shape, it doesn't work:

import pandas
import modin.pandas as pd

modin_df = pd.DataFrame(pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}))

from modin.core.storage_formats import PandasQueryCompiler
from modin.core.dataframe.algebra import Fold

PandasQueryCompiler.filter_func = Fold.register(lambda df: df[df.index % 2 == 0])


def filter_modin_dataframe(df):
    return df.__constructor__(
        query_compiler=df._query_compiler.filter_func(
            fold_axis=1,
        )
    )


pd.DataFrame.filter_dataframe = filter_modin_dataframe

filtered_df = modin_df.filter_dataframe()

print(filtered_df)

fails with IndexError: positional indexers are out-of-bounds.

Is this intentional? If so, what would be the recommended way to do a fold-like operation, applying a function that requires knowledge of an entire axis, and creating a DataFrame from the resulting partitions?

Note that in this example I am only changing the shape along the axis I am folding on, so it's impossible for this to cause an illogical outcome such as some rows having more columns than others. I don't see a reason why the opposite shouldn't be allowed as well, as long as it doesn't result in such an illogical outcome.

@zmbc zmbc added question ❓ Questions about Modin Triage 🩹 Issues that need triage labels Aug 22, 2023
@zmbc
Copy link
Contributor Author

zmbc commented Aug 22, 2023

By the way, the equivalent functionality in Dask Dataframe is .map_partitions, which because Dask has 1D partitions corresponds to a Modin Fold with fold_axis=1. It says in the Dask docs that "the index and divisions are assumed to remain unchanged" but it doesn't actually stop you from changing them.

Here is an example of how the code above can work with Dask:

if __name__ == "__main__":
    from distributed import Client

    client = Client(n_workers=2, threads_per_worker=1)

    import pandas
    import dask.dataframe as dd

    dask_df = dd.from_pandas(
        pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}), npartitions=2
    )

    filtered_df = dask_df.map_partitions(lambda df: df[df.index % 2 == 0])

    print(filtered_df.compute())

@zmbc
Copy link
Contributor Author

zmbc commented Aug 22, 2023

I found this workaround, though it uses private, undocumented attributes:

import pandas
import modin.pandas as pd

modin_df = pd.DataFrame(pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}))

from modin.core.storage_formats import PandasQueryCompiler

filtered_df = pd.DataFrame(
    query_compiler=PandasQueryCompiler(
        modin_df._query_compiler._modin_frame.apply_full_axis(
            axis=1, func=lambda df: df[df.index % 2 == 0]
        )
    )
)

print(filtered_df)

Note: I am of course aware that none of this necessary to get only the even-numbered rows of a dataframe, that is just a stand-in for an arbitrary computation that maps DataFrame -> DataFrame with a different shape.

@anmyachev
Copy link
Collaborator

The Fold operator has no such note, but when I try to run an example that changes the shape, it doesn't work:

Hi @zmbc! Good observation.

This doesn't work because we are unconditionally copying the index metadata, relying on the shape of the result being unchanged. If they are not copied, then your example with the Fold operator starts working.

return self.__constructor__(
new_partitions,
self.copy_index_cache(),
self.copy_columns_cache(),
self._row_lengths_cache,
self._column_widths_cache,
)

I guess we could implement this operator by adding an additional flag that would control when the metadata should be copied. @dchigarev what do you think?

@anmyachev anmyachev removed the Triage 🩹 Issues that need triage label Aug 23, 2023
@zmbc
Copy link
Contributor Author

zmbc commented Aug 23, 2023

Ah, I see. Is that the same reason why the Map operator requires that the shape doesn't change?

I do think it could be really useful to have operators that are more flexible (but of course slower), like Map or Fold but allowed to change the shape of each partition. Copying the metadata is a nice optimization when the shape doesn't change, but it doesn't seem like it would be absolutely critical to performance in most cases.

@anmyachev
Copy link
Collaborator

Ah, I see. Is that the same reason why the Map operator requires that the shape doesn't change?

Looks like that.

I do think it could be really useful to have operators that are more flexible (but of course slower), like Map or Fold but allowed to change the shape of each partition. Copying the metadata is a nice optimization when the shape doesn't change, but it doesn't seem like it would be absolutely critical to performance in most cases.

Agree.

@anmyachev anmyachev added new feature/request 💬 Requests and pull requests for new features Internals Internal modin functionality P2 Minor bugs or low-priority feature requests labels Aug 24, 2023
@anmyachev anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 15, 2024
@YarShev
Copy link
Collaborator

YarShev commented Apr 24, 2024

It looks like filter operator is suitable here. @anmyachev, do you think we should lift it up to the algebra module?

@anmyachev
Copy link
Collaborator

It looks like filter operator is suitable here. @anmyachev, do you think we should lift it up to the algebra module?

Could you write in more detail? It seems to me that it is possible to implement operators Map and Fold more flexible, as @zmbc wrote about.

@YarShev
Copy link
Collaborator

YarShev commented Apr 25, 2024

PandasQueryCompiler.filter_func = Fold.register(lambda df: df[df.index % 2 == 0]) is a normal filter operator. We have filter method at the Modin Dataframe layer but we don't have it in algebra module so we can add it there. As to Map and Fold operators, I wouldn't like to change their definition since they have already been documented well in different articles/papers.

@zmbc
Copy link
Contributor Author

zmbc commented May 6, 2024

@YarShev the df[df.index % 2 == 0] was just a toy example. In my actual use case, I want to do an operation that filters and changes values and adds/drops columns and changes dtypes.

@anmyachev
Copy link
Collaborator

As to Map and Fold operators, I wouldn't like to change their definition since they have already been documented well in different articles/papers.

@YarShev regarding Fold operator, at the algebra level we did not mention anywhere that it should preserve the shape of a dataframe (I didn’t find this in Devin’s articles either). Therefore, by allowing it to change shape, we give users flexibility and do not contradict what we have already written.

I guess we could implement this operator by adding an additional flag that would control when the metadata should be copied.

I still think this approach might be acceptable here.

@YarShev
Copy link
Collaborator

YarShev commented May 12, 2024

We have a note in the docstring: "The data shape is not changed (length and width of the table).". However, I agree with you and we can expand Fold operator to be more flexible.

YarShev added a commit to YarShev/modin that referenced this issue May 13, 2024
YarShev added a commit to YarShev/modin that referenced this issue May 13, 2024
YarShev added a commit to YarShev/modin that referenced this issue May 14, 2024
anmyachev added a commit that referenced this issue May 14, 2024
Signed-off-by: Igoshev, Iaroslav <[email protected]>
Co-authored-by: Anatoly Myachev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Pull requests and issues from people who do not regularly contribute to modin Internals Internal modin functionality new feature/request 💬 Requests and pull requests for new features P2 Minor bugs or low-priority feature requests question ❓ Questions about Modin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants