-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#3043: allow pandas.DataFrame
to be passed into merge
and join
functions
#4324
Conversation
…ge function Signed-off-by: Anatoly Myachev <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #4324 +/- ##
==========================================
+ Coverage 86.91% 90.01% +3.09%
==========================================
Files 214 214
Lines 17367 17998 +631
==========================================
+ Hits 15095 16200 +1105
+ Misses 2272 1798 -474
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
pandas.DataFrame
to be passed into merge
and join
functions
@devin-petersohn @YarShev in Can we add this feature only for these functions, leaving some inconsistency for the rest of the binary operations? What do you think? This is beneficial at least as long as this conversion exists, but I hope that in the future we will get rid of it. However, if we add this feature now, users will rely on it and we will have to support it further. FYI: this case is being considered because it speeds up one of our workloads. |
return self.join( | ||
right, how=how, lsuffix=suffixes[0], rsuffix=suffixes[1], sort=sort | ||
) | ||
|
||
return self.__constructor__( | ||
query_compiler=self._query_compiler.merge( | ||
right._query_compiler, | ||
right._query_compiler | ||
if not isinstance(right, pandas.DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not isinstance(right, pandas.DataFrame) | |
if isinstance(right, (pd.DataFrame, pd.Series)) |
@@ -1258,14 +1258,18 @@ def join( | |||
if on is not None: | |||
return self.__constructor__( | |||
query_compiler=self._query_compiler.join( | |||
other._query_compiler, | |||
other._query_compiler | |||
if not isinstance(other, pandas.DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not isinstance(other, pandas.DataFrame) | |
if isinstance(other, (pd.DataFrame, pd.Series)) |
@@ -448,7 +448,8 @@ def merge(self, right, **kwargs): | |||
sort = kwargs.get("sort", False) | |||
|
|||
if how in ["left", "inner"] and left_index is False and right_index is False: | |||
right = right.to_pandas() | |||
if not isinstance(right, pandas.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A non-negative condition is easier to read
- Checking that
right
is not a pandas.DataFrame does not ensure that it has a.to_pandas
method
if not isinstance(right, pandas.DataFrame): | |
if isinstance(right, BaseQueryCompiler): |
@@ -495,7 +496,8 @@ def join(self, right, **kwargs): | |||
sort = kwargs.get("sort", False) | |||
|
|||
if how in ["left", "inner"]: | |||
right = right.to_pandas() | |||
if not isinstance(right, pandas.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not isinstance(right, pandas.DataFrame): | |
if isinstance(right, BaseQueryCompiler): |
@@ -1387,19 +1391,25 @@ def merge( | |||
raise ValueError("Cannot merge a Series without a name") | |||
else: | |||
right = right.to_frame() | |||
if not isinstance(right, DataFrame): | |||
if not isinstance(right, (DataFrame, pandas.DataFrame)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we unconditionally converted right
to pd.DataFrame above, why do we want to check for pandas.DataFrame here?
if not isinstance(right, DataFrame): | ||
raise TypeError( | ||
f"Can only merge Series or DataFrame objects, a {type(right)} was passed" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the above check is not enough?
@@ -179,6 +180,21 @@ def test_merge(): | |||
pd.merge("Non-valid type", modin_df2) | |||
|
|||
|
|||
@pytest.mark.parametrize("data", test_data_values, ids=test_data_keys) | |||
def test_merge_pandas(data): | |||
modin_df, pandas_df = pd.DataFrame(data), pandas.DataFrame(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has to be imported from modin.pandas.test.utils
modin_df, pandas_df = pd.DataFrame(data), pandas.DataFrame(data) | |
modin_df, pandas_df = create_test_dfs(data) |
|
||
def test_join_pandas(): | ||
data = test_data["int_data"] | ||
modin_df, pandas_df = pd.DataFrame(data), pandas.DataFrame(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modin_df, pandas_df = pd.DataFrame(data), pandas.DataFrame(data) | |
modin_df, pandas_df = create_test_dfs(data) |
@anmyachev what's the status of this PR? |
I think @devin-petersohn was specifically against mixing Modin and pandas in that way, even though it's a "low hanging fruit" optimization in this case. I feel that this should be better addressed by implementing #4605 and allowing the end user to easily "force" Modin to wrap certain pandas dataframe in this specific "small df" QueryCompiler. |
@pyrito for now, this most likely will not be merged, @vnlitvinov is right. By the way, I have an interesting case, which may not be very relevant to this task, but still. This call works for pandas, but fails for modin because modin tries to initialize ray on the worker (if I remember correctly). In this case, the following change helps: import modin.pandas as pd
def agg_func(x):
return int(pd.Series.count(x) > 0)
res = some_df.pivot_table(..., agg_func=agg_func) |
@anmyachev this fails since the function is running on a ray worker - maybe we can wrap deployed functions with a function that just imports pandas as pd, so if the user tries to use pd inside an apply function (for example) it's scope will have pandas instead of Modin, and it'll work? |
The following example doesn't work. def agg_func(x):
return int(pandas.Series.count(x) > 0)
def wrapper(x):
import pandas as pd
return agg_func(x)
res = some_df.pivot_table(..., agg_func=wrapper) if I understand correctly, |
@anmyachev We could modify the scope and replace modin.pandas with pandas by modifying the globals dict (although I'm not sure if we want to get into this level of meta-programming?) |
We need to first decide on an approach. |
@RehanSD this approach is not completely right, you should create new dict for I would guess that:
|
Signed-off-by: Anatoly Myachev [email protected]
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date