-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFusion solution [WIP] #182
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for submitting this PR. It is really valuable to have a working example for a compiled code in this benchmark, as all currently listed solutions are using REPL interfaces.
I left one comment. I could take it over from here as the most difficult part has been already done. Although it can take me little while because coming week I will be on another project. Anyway, if it is possible for you to translate mentioned write_log
from python to rust that would be helpful.
datafusion/src/main.rs
Outdated
let df = ctx.sql("SELECT id1, SUM(v1) AS v1 FROM t GROUP BY id1")?; | ||
|
||
let _results = df.collect().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would chain this two lines into single line, variable can be named ans
for consistency to other scripts. Query for each question need to be run two times. After each query we do extra computation on the ans
(and measure its timing as well) to ensure that actual query was not lazy by forcing computation on ans
. Those two timings for each query we needs to be written to csv file with timings (mem usage can be ignored), which should be handled by a helper function, like this one:
db-benchmark/_helpers/helpers.py
Line 8 in 070d3b7
def write_log(task, data, in_rows, question, out_rows, out_cols, solution, version, git, fun, run, time_sec, mem_gb, cache, chk, chk_time_sec, on_disk): |
After both run of a query finished we need to print head-3 and tail-3 of
ans
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @jangorecki - im picking up work on this. do you still need a rust version of write_log
?
Thanks for the feedback. I replaced the variables by @jangorecki if you work on it here some tips:
|
Any idea if DataFusion supports queries that are required for advanced groupby questions? q6-q10.. |
I added query7 and query 10. The others I think needs features to be implemented (median, window functions, etc). Query 10 is ridiculously slow though, that will improve a bit once a PR has been merged, but probably will still be slow after that. The benchmarks showed that we have some more work to do! I also have an open PR that will improve performance on the easier queries, I think DataFusion might already be close for the group by query 1 and 4 to clickhouse / data.table or even CuDF. The inner join queries should all work too I think (I might add them later, it is easy as I can just reuse the clickhouse queries). There is a known bug for left joins which gives wrong output. |
…sue with high number of groups Currently, we loop to the hashmap for every key. However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap. In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators. On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit. This PR: ``` q1 took 340 ms q2 took 1768 ms q3 took 10975 ms q4 took 337 ms q5 took 13529 ms ``` Master: ``` q1 took 330 ms q2 took 1648 ms q3 took 16408 ms q4 took 335 ms q5 took 21074 ms ``` Closes #9234 from Dandandan/hash_agg_speed2 Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
…sue with high number of groups Currently, we loop to the hashmap for every key. However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap. In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators. On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit. This PR: ``` q1 took 340 ms q2 took 1768 ms q3 took 10975 ms q4 took 337 ms q5 took 13529 ms ``` Master: ``` q1 took 330 ms q2 took 1648 ms q3 took 16408 ms q4 took 335 ms q5 took 21074 ms ``` Closes #9234 from Dandandan/hash_agg_speed2 Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
…sue with high number of groups Currently, we loop to the hashmap for every key. However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap. In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators. On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit. This PR: ``` q1 took 340 ms q2 took 1768 ms q3 took 10975 ms q4 took 337 ms q5 took 13529 ms ``` Master: ``` q1 took 330 ms q2 took 1648 ms q3 took 16408 ms q4 took 335 ms q5 took 21074 ms ``` Closes apache#9234 from Dandandan/hash_agg_speed2 Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
…sue with high number of groups Currently, we loop to the hashmap for every key. However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap. In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators. On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit. This PR: ``` q1 took 340 ms q2 took 1768 ms q3 took 10975 ms q4 took 337 ms q5 took 13529 ms ``` Master: ``` q1 took 330 ms q2 took 1648 ms q3 took 16408 ms q4 took 335 ms q5 took 21074 ms ``` Closes apache#9234 from Dandandan/hash_agg_speed2 Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
@Dandandan anything i can do to help the finalize the work on this? |
Yeah sure, help appreciated! I think what's missing is:
|
@Dandandan ok! Will check it out. Any additional info you could provide on what the standard flow is? |
@Dandandan do you have a preference for how i push my updates here? i started the work here https://github.com/matthewmturner/db-benchmark/tree/datafusion/datafusion as a fork of what you were doing. ive updated how the tables are created and added the join queries. still need to review them in more detail / make sure its correct, see if i can add any of the missing group by queries, and i assume we'll want to test the larger datasets as well - but let me know if you have any thoughts. after the above ill start looking into the flow more. right now these are the results i get when running the benchmarks:
|
Thank you 💯. Maybe you could open a new PR with the combined changes so we can continue it there? |
@Dandandan Sure sounds good! |
Follow up PR: |
WIP PR to add DataFusion (#107) as a solution.
@jangorecki is there any documentation regarding the required output? As this is a Rust solution, it can not easily reuse the prepared code.