-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#6301: Simplify usage of algebra operators to define custom functions #6302
base: master
Are you sure you want to change the base?
Changes from 7 commits
990ec17
ac9732e
5899084
e9f5dcd
ee354fd
884224e
96089da
d65387e
3f3cc5b
52eeecc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -297,7 +297,7 @@ def register( | |
""" | ||
|
||
def caller( | ||
query_compiler, other, broadcast=False, *args, dtypes=None, **kwargs | ||
query_compiler, other, *args, broadcast=False, dtypes=None, **kwargs | ||
): | ||
""" | ||
Apply binary `func` to passed operands. | ||
|
@@ -413,3 +413,47 @@ def caller( | |
) | ||
|
||
return caller | ||
|
||
@classmethod | ||
def apply( | ||
cls, left, right, func, axis=0, func_args=None, func_kwargs=None, **kwargs | ||
): | ||
r""" | ||
Apply a binary function row-wise or column-wise. | ||
|
||
Parameters | ||
---------- | ||
left : modin.pandas.DataFrame or modin.pandas.Series | ||
Left operand. | ||
right : modin.pandas.DataFrame or modin.pandas.Series | ||
Right operand. | ||
Comment on lines
+427
to
+430
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This layer depends on the upper layer in every apply. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The lower layer still takes the object(s) from the API layer, namely, modin.pandas.DataFrame/Series. Then, there is a |
||
func : callable(pandas.DataFrame, pandas.DataFrame, \*args, axis, \*\*kwargs) -> pandas.DataFrame | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the signature is wrong here, as the implementation explicitly passes Query Compilers as arguments... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nope, follow the track of the
|
||
A binary function to apply `left` and `right`. | ||
axis : int, default: 0 | ||
Whether to apply the function across rows (``axis=0``) or across columns (``axis=1``). | ||
func_args : tuple, optional | ||
Positional arguments to pass to the funcs. | ||
func_kwargs : dict, optional | ||
Keyword arguments to pass to the funcs. | ||
**kwargs : dict | ||
Additional arguments to pass to the ``cls.register()``. | ||
|
||
Returns | ||
------- | ||
The same type as `df`. | ||
""" | ||
from modin.pandas import Series | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really like that inner layers depend on upper layers. I do not see any benefit of introducing these changes rather than it will simplify registriation of a UDF for users, which doesn't happen very often from my point of view. I would like to bring more attention to these changes to decide whether we want these changes to be merged or not. cc @modin-project/modin-core There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @YarShev. I'd rather avoid the dependency on a higher layer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe we can make the # modin/pandas/utils.py
def apply_operator(df : modin.pandas.DataFrame, operator_cls : type[Operator], func, *args, **kwargs):
res_qc = operator_cls.register(func)(df._query_compiler, *args, **kwargs)
return type(df)(query_compiler=res_qc)
# the usage then would be:
from modin.pandas.utils import apply_operator
from modin.core.dataframe.algebra import Reduce
res_df = apply_operator(df, Reduce, func=reduce_func, axis=1) One of the problem I see here is that managing operators-dependent behavior can be quite a pain here, as we can no longer use OOP mechanisms to align with operator-specific logic using inheritance and overriding: # modin/pandas/utils.py
def _apply_reduce_op(...):
...
def _apply_groupby_op(...):
...
...
_operators_dict = {
Reduce: _apply_reduce_op,
GroupbyReduce: _apply_groupby_op,
...
}
def apply_operator(df : modin.pandas.DataFrame, operator_cls : type[Operator], func, *args, **kwargs):
return _operators_dict[operator_cls](df, func, *args, **kwargs) Do you have any other suggestion on how to improve this approach (or maybe you have another approach in mind)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dchigarev to me the current approach for using the operators doesn't look very verbose to me. will this PR make it much easier to use the operators anywhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The PR should make the usage of operators much easier for end-users of modin who would like to define their own distributed functions using modin's operators. While optimizing customer workloads for modin we sometimes see places that would perform much better if rewritten from pandas API using modin's operators, however the present API the operators provide causes a lot of complex code written around that customers struggle to understand. That inspired us to create a simple method/function that makes operator's usage as simple as calling one single function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So I made two versions of how we can eliminate this dependency.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mvashishtha @YarShev thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dchigarev I wonder if you could provide an example of user-defined Modin operator (ideally a real case, even if simplified and anonymized) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What we usually use the user-defined operators for is to emulate lazy execution for those types of transformations that can't be written into a one or two pandas API calls (usually such transformations are performed in a for-loop). As an example, consider that we have a dataframe with multiple string columns, and then we want to combine those columns into a single one using the specified separator. Surprisingly, but the fastest way to do so using vanilla pandas is simply writing a for-loop:
pandas codeimport pandas as pd
import numpy as np
from timeit import default_timer as timer
NCOLS = 16
NROWS = 5_000_000
df = pd.DataFrame({f"col{i}": [f"col{i}-{j}" for j in range(NROWS)] for i in range(NCOLS)})
t1 = timer()
res = df.iloc[:, 0]
for col in df.columns[1:]:
res += "_" + df[col]
print(f"for-loop: {timer() - t1}")
t1 = timer()
res = df.apply(lambda row: "_".join(row), axis=1)
print(f"df.apply(join): {timer() - t1}")
t1 = timer()
res = df.iloc[:, 0].str.cat(df.iloc[:, 1:], sep="_")
print(f"df.str.cat(): {timer() - t1}") Then when adapting this code to modin, it appears that the for-loop approach works very slow due to a lot of kernels being submitted to Ray and so causing it to overwhelm (each iteration of the for-loop will result into 3 separate kernels: 1.
modin codeimport modin.pandas as pd
import modin.config as cfg
import numpy as np
from timeit import default_timer as timer
cfg.BenchmarkMode.put(True)
# start all the workers
pd.DataFrame([np.arange(cfg.MinPartitionSize.get()) for _ in range(cfg.NPartitions.get() ** 2)]).to_numpy()
NCOLS = 16
NROWS = 5_000_000
df = pd.DataFrame({f"col{i}": [f"col{i}-{j}" for j in range(NROWS)] for i in range(NCOLS)})
from modin.core.dataframe.algebra import Reduce
def reduction(df):
res = df.iloc[:, 0]
for col in df.columns[1:]:
res += "_" + df[col]
return res
t1 = timer()
res = Reduce.apply(df, reduction, axis=1)
print(f"reduction operator: {timer() - t1}")
from modin.experimental.batch import PandasQueryPipeline
t1 = timer()
pipeline = PandasQueryPipeline(df)
pipeline.add_query(reduction, is_output=True)
res = pipeline.compute_batch()
print(f"batch pipeline API: {timer() - t1}")
t1 = timer()
res = df.iloc[:, 0]
for col in df.columns[1:]:
res += "_" + df[col]
print(f"for-loop: {timer() - t1}")
t1 = timer()
res = df.apply(lambda row: "_".join(row), axis=1)
print(f"df.apply(join): {timer() - t1}")
t1 = timer()
res = df.iloc[:, 0].str.cat(df.iloc[:, 1:], sep="_")
print(f"df.str.cat(): {timer() - t1}") (as I was writing this comment, I found out about the batch API in modin that is supposed to serve exactly the same purpose of "emulating" the lazy execution. However, it seems that it doesn't provide a way to specify the scheme on how the kernels actually should be submitted (map, row-wise, column-wise, ...) and also have some slight overhead when comparing with the pure user-defined operator's approach) |
||
|
||
operator = cls.register(func, **kwargs) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder what is the purpose of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately, that's the only way of how we can get the |
||
|
||
func_args = tuple() if func_args is None else func_args | ||
func_kwargs = dict() if func_kwargs is None else func_kwargs | ||
qc_result = operator( | ||
left._query_compiler, | ||
right._query_compiler, | ||
broadcast=isinstance(right, Series), | ||
*func_args, | ||
axis=axis, | ||
**func_kwargs, | ||
dchigarev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
return type(left)(query_compiler=qc_result) | ||
dchigarev marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the order so the function's positional arguments won't conflict with the keyword
broadcast
arg.