-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easier Dataframe API for map
#11546
Comments
take |
@jayzhan211 I have drafted a PR #11560 for it. The design differs from your proposal, but it makes sense to me. Could you take a look at it? The core concept is providing an expression function and wrapping Maybe we can rename |
My concern is that it might be slower because of additional |
I see. I think I can do some benchmarks for them. Because I have some concerns mentioned in #11452 (comment) for changing the |
I think in this case we should adjust the coercion rule with |
Ideally MapFunc should have the arguments that have minimum transformation and computation cost for creating MapArray, so we can get the most efficient implementation.
|
I followed #11526 to create another implementation for
I ran the benchmark many times, and each time I got similar results. Referring to the result, I think we can just use the original design here. What do you think? By the way, I found that the compile time to run the benchmark in the core is very long. It takes about 9 minutes. I'm not sure if that's normal. 😢
|
pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
let keys = make_array(keys);
let values = make_array(values);
Expr::ScalarFunction(ScalarFunction::new_udf(
map_udf(),
vec![keys, values],
))
}
pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
let keys = make_array(keys);
let values = make_array(values);
Expr::ScalarFunction(ScalarFunction::new_udf(
map_one_udf(),
vec![keys, values],
))
} It seems they both compute pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
let mut args = keys;
args.extend(values);
Expr::ScalarFunction(ScalarFunction::new_udf(
map_one_udf(),
args,
))
} |
I'm not sure whether it is expected, I guess because |
It it not the case for running |
Oops... Sorry about that. I forgot to remove this. I will provide another benchmark result. Many thanks. |
Here is the benchmark result after removing
I think the result is really bad but I tried to understand why datafusion/datafusion/functions-array/src/make_array.rs Lines 102 to 104 in 5da7ab3
I will try to use this way to modify the two version and give another benchmark. |
Ok, I think it's getting worse.
I also tried to remove
Just pass an args vector to Actually, I found that In conclusion, the original design (using |
This comment was marked as outdated.
This comment was marked as outdated.
In theory, I didn't expect this but I don't understand why. We can move on with |
functions-nested? For array, struct, map |
It looks good, but I think we can have another PR for it. It's also related to changing the name of |
Dataframe API for
map
expects us to pass args withmake_array
i.e.
I think we could have easier one with without
make_array
To achieve this we may need to change the arguments of
MapFunc
from two array toVec<Expr>
, which the first half arekeys
, another half arevalues
.Originally posted by @jayzhan211 in #11452 (comment)
Dataframe API is somthing used for building
Expr
Most of them are written in macro if they have similar pattern, others are individual function, like
count_distinct
The idea of
map
is similar toThe text was updated successfully, but these errors were encountered: