DataFusion solution [WIP] #182

Dandandan · 2021-01-17T13:52:39Z

WIP PR to add DataFusion (#107) as a solution.

@jangorecki is there any documentation regarding the required output? As this is a Rust solution, it can not easily reuse the prepared code.

jangorecki

Thank you for submitting this PR. It is really valuable to have a working example for a compiled code in this benchmark, as all currently listed solutions are using REPL interfaces.
I left one comment. I could take it over from here as the most difficult part has been already done. Although it can take me little while because coming week I will be on another project. Anyway, if it is possible for you to translate mentioned write_log from python to rust that would be helpful.

jangorecki · 2021-01-17T19:47:56Z

datafusion/src/main.rs

+    let df = ctx.sql("SELECT id1, SUM(v1) AS v1 FROM t GROUP BY id1")?;
+
+    let _results = df.collect().await?;


I would chain this two lines into single line, variable can be named ans for consistency to other scripts. Query for each question need to be run two times. After each query we do extra computation on the ans (and measure its timing as well) to ensure that actual query was not lazy by forcing computation on ans. Those two timings for each query we needs to be written to csv file with timings (mem usage can be ignored), which should be handled by a helper function, like this one:

db-benchmark/_helpers/helpers.py

Line 8 in 070d3b7

def write_log(task, data, in_rows, question, out_rows, out_cols, solution, version, git, fun, run, time_sec, mem_gb, cache, chk, chk_time_sec, on_disk):

After both run of a query finished we need to print head-3 and tail-3 of ans.

hi @jangorecki - im picking up work on this. do you still need a rust version of write_log?

Dandandan · 2021-01-18T07:57:27Z

Thanks for the feedback. I replaced the variables by ans and combined them in one go. I am not sure if I find enough time for the write_log this week.

@jangorecki if you work on it here some tips:

The solution can be executed with cargo run --release. cargo run executes it in debug mode, this makes the program very slow.
The lto = true line in Cargo.toml speeds up the binary, but slows the build down, so better to remove it while developing.

jangorecki · 2021-01-18T12:54:04Z

Any idea if DataFusion supports queries that are required for advanced groupby questions? q6-q10..

Dandandan · 2021-01-18T16:38:02Z

I added query7 and query 10. The others I think needs features to be implemented (median, window functions, etc). Query 10 is ridiculously slow though, that will improve a bit once a PR has been merged, but probably will still be slow after that. The benchmarks showed that we have some more work to do! I also have an open PR that will improve performance on the easier queries, I think DataFusion might already be close for the group by query 1 and 4 to clickhouse / data.table or even CuDF.

The inner join queries should all work too I think (I might add them later, it is easy as I can just reuse the clickhouse queries). There is a known bug for left joins which gives wrong output.

…sue with high number of groups Currently, we loop to the hashmap for every key. However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap. In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators. On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit. This PR: ``` q1 took 340 ms q2 took 1768 ms q3 took 10975 ms q4 took 337 ms q5 took 13529 ms ``` Master: ``` q1 took 330 ms q2 took 1648 ms q3 took 16408 ms q4 took 335 ms q5 took 21074 ms ``` Closes #9234 from Dandandan/hash_agg_speed2 Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

…sue with high number of groups Currently, we loop to the hashmap for every key. However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap. In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators. On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit. This PR: ``` q1 took 340 ms q2 took 1768 ms q3 took 10975 ms q4 took 337 ms q5 took 13529 ms ``` Master: ``` q1 took 330 ms q2 took 1648 ms q3 took 16408 ms q4 took 335 ms q5 took 21074 ms ``` Closes apache#9234 from Dandandan/hash_agg_speed2 Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

matthewmturner · 2021-12-24T16:24:58Z

@Dandandan anything i can do to help the finalize the work on this?

Dandandan · 2021-12-24T16:32:06Z

@Dandandan anything i can do to help the finalize the work on this?

Yeah sure, help appreciated!

I think what's missing is:

Integrating with the standard flow (writing a file in the standardized way)
Add join queries as well

matthewmturner · 2021-12-24T16:40:28Z

@Dandandan ok! Will check it out. Any additional info you could provide on what the standard flow is?

matthewmturner · 2021-12-25T04:19:43Z

@Dandandan do you have a preference for how i push my updates here?

i started the work here https://github.com/matthewmturner/db-benchmark/tree/datafusion/datafusion as a fork of what you were doing.

ive updated how the tables are created and added the join queries. still need to review them in more detail / make sure its correct, see if i can add any of the missing group by queries, and i assume we'll want to test the larger datasets as well - but let me know if you have any thoughts.

after the above ill start looking into the flow more.

right now these are the results i get when running the benchmarks:

group by
q1 took 56 ms
q2 took 289 ms
q3 took 1305 ms
q4 took 69 ms
q5 took 1158 ms
q7 took 1198 ms
q10 took 24691 ms

join
q1 took 261 ms
q2 took 367 ms
q3 took 334 ms
q4 took 507 ms
q5 took 1936 ms

Dandandan · 2021-12-25T04:58:28Z

@Dandandan do you have a preference for how i push my updates here?

i started the work here https://github.com/matthewmturner/db-benchmark/tree/datafusion/datafusion as a fork of what you were doing.

ive updated how the tables are created and added the join queries. still need to review them in more detail / make sure its correct, see if i can add any of the missing group by queries, and i assume we'll want to test the larger datasets as well - but let me know if you have any thoughts.

after the above ill start looking into the flow more.

right now these are the results i get when running the benchmarks:
group by
q1 took 56 ms
q2 took 289 ms
q3 took 1305 ms
q4 took 69 ms
q5 took 1158 ms
q7 took 1198 ms
q10 took 24691 ms

join
q1 took 261 ms
q2 took 367 ms
q3 took 334 ms
q4 took 507 ms
q5 took 1936 ms

Thank you 💯. Maybe you could open a new PR with the combined changes so we can continue it there?

matthewmturner · 2021-12-25T05:00:38Z

@Dandandan Sure sounds good!

Dandandan · 2021-12-26T08:24:05Z

Follow up PR:

#240

Dandandan added 4 commits January 17, 2021 14:41

Datafusion solution

3a983fd

Datafusion solution

b1f613e

Query fix

51ce127

Undo change

3343428

Dandandan changed the title ~~Datafusion solution [WIP]~~ DataFusion solution [WIP] Jan 17, 2021

This was referenced Jan 17, 2021

ARROW-11268: [Rust][DataFusion] MemTable::load output partition support apache/arrow#9214

Closed

ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups apache/arrow#9234

Closed

Increase batch size

d1e7ff3

jangorecki reviewed Jan 17, 2021

View reviewed changes

Dandandan added 2 commits January 18, 2021 08:47

Rename to ans

58be012

Fix

d87c92d

Add q7/q10

2b67e2a

Dandandan added 6 commits January 18, 2021 22:53

Use multiple threads better

d217e37

Add exec script

5a3e5ec

Some cleanup

f839050

Rename

6cb14f5

Fix disabled snmalloc

63fe38b

Use arrow master again

cbecfbc

Dandandan mentioned this pull request Mar 21, 2021

ARROW-11511: [Rust] Replace Arc<ArrayData> by ArrayData in all arrays apache/arrow#9329

Closed

Update benchmark code

88ba391

Dandandan mentioned this pull request May 31, 2021

Add Linked data benchmarks apache/datafusion#451

Open

alamb mentioned this pull request Aug 7, 2021

Rework GroupByHash for faster performance and support grouping by nulls apache/datafusion#790

Closed

Dandandan closed this Dec 26, 2021

asfimport mentioned this pull request Jan 28, 2021

[Rust][DataFusion] Improve hash aggregate performance with large number of groups in apache/arrow#27200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFusion solution [WIP] #182

DataFusion solution [WIP] #182

Dandandan commented Jan 17, 2021 •

edited

Loading

jangorecki left a comment

jangorecki Jan 17, 2021 •

edited

Loading

matthewmturner Dec 25, 2021

Dandandan commented Jan 18, 2021 •

edited

Loading

jangorecki commented Jan 18, 2021 •

edited

Loading

Dandandan commented Jan 18, 2021 •

edited

Loading

matthewmturner commented Dec 24, 2021

Dandandan commented Dec 24, 2021

matthewmturner commented Dec 24, 2021 •

edited

Loading

matthewmturner commented Dec 25, 2021

Dandandan commented Dec 25, 2021

matthewmturner commented Dec 25, 2021

Dandandan commented Dec 26, 2021

		let df = ctx.sql("SELECT id1, SUM(v1) AS v1 FROM t GROUP BY id1")?;

		let _results = df.collect().await?;

DataFusion solution [WIP] #182

DataFusion solution [WIP] #182

Conversation

Dandandan commented Jan 17, 2021 • edited Loading

jangorecki left a comment

Choose a reason for hiding this comment

jangorecki Jan 17, 2021 • edited Loading

Choose a reason for hiding this comment

matthewmturner Dec 25, 2021

Choose a reason for hiding this comment

Dandandan commented Jan 18, 2021 • edited Loading

jangorecki commented Jan 18, 2021 • edited Loading

Dandandan commented Jan 18, 2021 • edited Loading

matthewmturner commented Dec 24, 2021

Dandandan commented Dec 24, 2021

matthewmturner commented Dec 24, 2021 • edited Loading

matthewmturner commented Dec 25, 2021

Dandandan commented Dec 25, 2021

matthewmturner commented Dec 25, 2021

Dandandan commented Dec 26, 2021

Dandandan commented Jan 17, 2021 •

edited

Loading

jangorecki Jan 17, 2021 •

edited

Loading

Dandandan commented Jan 18, 2021 •

edited

Loading

jangorecki commented Jan 18, 2021 •

edited

Loading

Dandandan commented Jan 18, 2021 •

edited

Loading

matthewmturner commented Dec 24, 2021 •

edited

Loading