Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier Dataframe API for map #11546

Closed
Tracked by #11429
jayzhan211 opened this issue Jul 19, 2024 · 17 comments · Fixed by #11560
Closed
Tracked by #11429

Easier Dataframe API for map #11546

jayzhan211 opened this issue Jul 19, 2024 · 17 comments · Fixed by #11560
Assignees

Comments

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 19, 2024

Dataframe API for map expects us to pass args with make_array

i.e.

map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])])

I think we could have easier one with without make_array

map(vec![lit("a"), lit("b")], vec![lit("1"), lit("2")]])

To achieve this we may need to change the arguments of MapFunc from two array to Vec<Expr>, which the first half are keys, another half are values.

Originally posted by @jayzhan211 in #11452 (comment)

Dataframe API is somthing used for building Expr

Most of them are written in macro if they have similar pattern, others are individual function, like count_distinct

pub fn count_distinct(expr: Expr) -> Expr {
    Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf(
        count_udaf(),
        vec![expr],
        true,
        None,
        None,
        None,
    ))
}

The idea of map is similar to

pub fn map(keys:Vec<Expr>,values:Vec<Expr>) -> Expr {
    let args: Vec<Expr> = concat keys and values
    Expr::ScalarFunction(datafusion_expr::expr::ScalarFunction::new_udf(
        map_udf(),
        vec![args],

    ))
}
@goldmedal
Copy link
Contributor

take

@goldmedal
Copy link
Contributor

@jayzhan211 I have drafted a PR #11560 for it. The design differs from your proposal, but it makes sense to me. Could you take a look at it?

The core concept is providing an expression function and wrapping make_array for the arguments. I think we don't need to change the argument format of MapFunc. However, we need to move map.rs to functions-array because it needs make_array.

Maybe we can rename functions-array to functions-collection? I think there are many implementations that could be shared if we want to support more collection types and usage.

@jayzhan211
Copy link
Contributor Author

@jayzhan211 I have drafted a PR #11560 for it. The design differs from your proposal, but it makes sense to me. Could you take a look at it?

The core concept is providing an expression function and wrapping make_array for the arguments. I think we don't need to change the argument format of MapFunc. However, we need to move map.rs to functions-array because it needs make_array.

Maybe we can rename functions-array to functions-collection? I think there are many implementations that could be shared if we want to support more collection types and usage.

My concern is that it might be slower because of additional make_array() computation.

@goldmedal
Copy link
Contributor

My concern is that it might be slower because of additional make_array() computation.

I see. I think I can do some benchmarks for them.

Because I have some concerns mentioned in #11452 (comment) for changing the MapFunc, I prefer to keep the original design.

@jayzhan211
Copy link
Contributor Author

I have some concerns for it. If we make MapFunc to accept one array, it would be used like
SELECT map([1,2,3,'a','b','c'])
After planning, the input array would be ['1','2','3','a','b','c'] because of the type coercion for array elements. I think the behavior is wrong. If we change the signature of MapFunc, we might need to have another implementation to solve it.

I think in this case we should adjust the coercion rule with coerce_types for MapFunc

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 20, 2024

Ideally MapFunc should have the arguments that have minimum transformation and computation cost for creating MapArray, so we can get the most efficient implementation.

  1. k1, v1, k2, v2... is not ideal because we need to arrange it to k1, k2, ... v1, v2 ..., that is why initial MakeMap is beaten by MapFunc.
  2. current implementation has the additional cost of make_array transformation, I guess we could do better without it.

@goldmedal
Copy link
Contributor

My concern is that it might be slower because of additional make_array() computation.

I followed #11526 to create another implementation for map_func, called map_one_func temporarily. I did some benchmarks and found that there are no obvious differences between the performance of the map and map_one functions. I also pushed the updated commits in #11560. You could check it.
The benchmark results: (map is the original design and map_one is the new design)

Gnuplot not found, using plotters backend
map_1000                time:   [9.5816 ms 9.6505 ms 9.7221 ms]
                        change: [-3.3395% -1.7212% -0.4746%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

map_one_1000            time:   [9.7250 ms 9.7995 ms 9.8764 ms]
                        change: [-3.4189% -0.6339% +1.3546%] (p = 0.68 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

I ran the benchmark many times, and each time I got similar results. Referring to the result, I think we can just use the original design here. What do you think?

By the way, I found that the compile time to run the benchmark in the core is very long. It takes about 9 minutes. I'm not sure if that's normal. 😢

 jax: ~/git/datafusion/datafusion/core (feature/11546-map-df-api) $ cargo bench --bench map_query_sql
   Compiling datafusion-functions v40.0.0 (/Users/jax/git/datafusion/datafusion/functions)
   Compiling datafusion-functions-array v40.0.0 (/Users/jax/git/datafusion/datafusion/functions-array)
   Compiling datafusion v40.0.0 (/Users/jax/git/datafusion/datafusion/core)
    Finished `bench` profile [optimized] target(s) in 8m 53s

@jayzhan211
Copy link
Contributor Author

pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
    let keys = make_array(keys);
    let values = make_array(values);
    Expr::ScalarFunction(ScalarFunction::new_udf(
        map_udf(),
        vec![keys, values],
    ))
}

pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
    let keys = make_array(keys);
    let values = make_array(values);
    Expr::ScalarFunction(ScalarFunction::new_udf(
        map_one_udf(),
        vec![keys, values],
    ))
}

It seems they both compute make_array, but I think we can avoid make_array at all.

pub fn map_from_array(keys: Vec<Expr>, values: Vec<Expr>) -> Expr {
    let mut args = keys;
    args.extend(values);
    Expr::ScalarFunction(ScalarFunction::new_udf(
        map_one_udf(),
        args,
    ))
}

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 20, 2024

By the way, I found that the compile time to run the benchmark in the core is very long. It takes about 9 minutes. I'm not sure if that's normal.

I'm not sure whether it is expected, I guess because core crate has almost all the dependencies therefore many crates to re-complie. I think we could file an issue to track this

@jayzhan211
Copy link
Contributor Author

I'm not sure whether it is expected, I guess because core crate has almost all the dependencies therefore many crates to re-complie.

It it not the case for running cargo build in ~/arrow-datafusion 😕

@goldmedal
Copy link
Contributor

    let mut args = keys;
    args.extend(values);

Oops... Sorry about that. I forgot to remove this. I will provide another benchmark result. Many thanks.

@goldmedal
Copy link
Contributor

Here is the benchmark result after removing make_array ( I also pushed a new commit to the draft PR):

map_1000                time:   [10.105 ms 10.168 ms 10.233 ms]
                        change: [+0.0989% +1.5780% +2.8979%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

map_one_1000            time:   [44.081 ms 45.278 ms 46.808 ms]
                        change: [+1.8229% +4.9320% +8.3942%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

I think the result is really bad but I tried to understand why make_array is efficient. I noticed it uses make_scalar_function to handle the ColumnarValue. I guess it could be more efficient than ScalarValue::into_array.

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
make_scalar_function(make_array_inner)(args)
}

I will try to use this way to modify the two version and give another benchmark.

@goldmedal
Copy link
Contributor

Ok, I think it's getting worse.

Gnuplot not found, using plotters backend
map_1000                time:   [9.1289 ms 9.1897 ms 9.2537 ms]
                        change: [-0.7915% +0.1565% +1.1967%] (p = 0.75 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Benchmarking map_one_1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, or reduce sample count to 70.
map_one_1000            time:   [62.787 ms 63.239 ms 63.732 ms]
                        change: [-2.6609% -1.0620% +0.3899%] (p = 0.17 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)

I also tried to remove

    let mut args = keys;
    args.extend(values);

Just pass an args vector to map_from_array, but it's still slower. I pushed this version to a different branch: https://github.com/goldmedal/datafusion/blob/feature/11546-map-df-api-v4/datafusion/functions/src/core/map.rs
If you're interested, you can check it out.

Actually, I found that make_scalar_function uses ColumnarValue::values_to_arrays, so I need to use make_array_inner to aggregate the primitive arrays.

In conclusion, the original design (using make_array) is the fastest.

@jayzhan211

This comment was marked as outdated.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 21, 2024

In theory, I didn't expect this but I don't understand why. We can move on with make_array approach first.

@jayzhan211
Copy link
Contributor Author

Maybe we can rename functions-array to functions-collection? I think there are many implementations that could be shared if we want to support more collection types and usage.

functions-nested? For array, struct, map

@goldmedal
Copy link
Contributor

functions-nested? For array, struct, map

It looks good, but I think we can have another PR for it. It's also related to changing the name of array_expressions feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants