-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the Fold operator allow a function to change the shape of partitions? #6498
Comments
By the way, the equivalent functionality in Dask Dataframe is Here is an example of how the code above can work with Dask: if __name__ == "__main__":
from distributed import Client
client = Client(n_workers=2, threads_per_worker=1)
import pandas
import dask.dataframe as dd
dask_df = dd.from_pandas(
pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}), npartitions=2
)
filtered_df = dask_df.map_partitions(lambda df: df[df.index % 2 == 0])
print(filtered_df.compute()) |
I found this workaround, though it uses private, undocumented attributes: import pandas
import modin.pandas as pd
modin_df = pd.DataFrame(pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}))
from modin.core.storage_formats import PandasQueryCompiler
filtered_df = pd.DataFrame(
query_compiler=PandasQueryCompiler(
modin_df._query_compiler._modin_frame.apply_full_axis(
axis=1, func=lambda df: df[df.index % 2 == 0]
)
)
)
print(filtered_df) Note: I am of course aware that none of this necessary to get only the even-numbered rows of a dataframe, that is just a stand-in for an arbitrary computation that maps DataFrame -> DataFrame with a different shape. |
Hi @zmbc! Good observation. This doesn't work because we are unconditionally copying the index metadata, relying on the shape of the result being unchanged. If they are not copied, then your example with the Fold operator starts working. modin/modin/core/dataframe/pandas/dataframe/dataframe.py Lines 2118 to 2124 in 29d9da0
I guess we could implement this operator by adding an additional flag that would control when the metadata should be copied. @dchigarev what do you think? |
Ah, I see. Is that the same reason why the Map operator requires that the shape doesn't change? I do think it could be really useful to have operators that are more flexible (but of course slower), like Map or Fold but allowed to change the shape of each partition. Copying the metadata is a nice optimization when the shape doesn't change, but it doesn't seem like it would be absolutely critical to performance in most cases. |
Looks like that.
Agree. |
It looks like filter operator is suitable here. @anmyachev, do you think we should lift it up to the algebra module? |
Could you write in more detail? It seems to me that it is possible to implement operators |
|
@YarShev the |
@YarShev regarding
I still think this approach might be acceptable here. |
We have a note in the docstring: "The data shape is not changed (length and width of the table).". However, I agree with you and we can expand Fold operator to be more flexible. |
Signed-off-by: Igoshev, Iaroslav <[email protected]>
Signed-off-by: Igoshev, Iaroslav <[email protected]>
Signed-off-by: Igoshev, Iaroslav <[email protected]>
Signed-off-by: Igoshev, Iaroslav <[email protected]> Co-authored-by: Anatoly Myachev <[email protected]>
On the Operators Module Description page, the Map operator has a note that "map function should not change the shape of the partitions."
The Fold operator has no such note, but when I try to run an example that changes the shape, it doesn't work:
fails with
IndexError: positional indexers are out-of-bounds
.Is this intentional? If so, what would be the recommended way to do a fold-like operation, applying a function that requires knowledge of an entire axis, and creating a DataFrame from the resulting partitions?
Note that in this example I am only changing the shape along the axis I am folding on, so it's impossible for this to cause an illogical outcome such as some rows having more columns than others. I don't see a reason why the opposite shouldn't be allowed as well, as long as it doesn't result in such an illogical outcome.
The text was updated successfully, but these errors were encountered: