-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support user-defined aggregation and mapping functions #1960
Comments
So, I've been thinking about how to best approach this problem, and I think we could combine decorators with type-annotations for better results. For example: An unannotated function receives its argument @udf
def myfn1(x):
return ... This is equivalent to @udf
def myfn1(x: dt.Frame) -> dt.Frame:
return ... However, if you specify that @udf
def plus_one(x: int):
return x + 1
DT[:, plus_one(f.x)] In this case we will also verify that column Similarly, if you declare that argument @udf
def argmin(x: List):
return min(range(len(x)), key=lambda i: x[i]) Your function may also work with a pandas DataFrame, or a Series, or a numpy array: @udf
def myfn_using_pandas(x: pd.Series):
return ... All of these user-defined functions can, of course, take multiple arguments, or even var-args: @udf
def rowsum(*x: float, skipna=True):
if skipna:
return sum(v for v in x if x is not None)
elif any(v is None for v in x):
return None
else:
return sum(x) ImplementationThe class udf:
def __init__(self, fn):
self._fn = fn
# signature may contain types unknown to datatable, but in that case
# those arguments cannot be bound to dt expressions
self._sig = _resolve_signature(fn)
def __call__(self, *args, **kwds):
return bound_udf(self, args, kwds) The DT[i,j] evaluator would then see a Finally, the results of all these evaluations will be row-bound together into a single column or frame. Thus, a udf is allowed to return the same variety of arguments as it received: int, float, str, list, np.array, pd.DataFrame, or a dt.Frame. |
Looks really good, I like especially the usage of type hints to avoid specifying the function type, like mapping or aggregation. |
A feature which is often used in pandas is
apply
(oraggregate
ortransform
) that basically allow to do a mapping, aggregation or even a partial reduction operation.PySpark basically introduced a way to define user-defined operations for groupby's and select operations:
SCALAR
df.transform(...)
GROUPED_MAP
df.apply(...)
GROUPED_AGG
df.aggregate(...)
I would love to see something like this in Datatable. Maybe it would be possible to have a
udf
decorator such as:The text was updated successfully, but these errors were encountered: