Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFusion solution [WIP] #182

Closed
wants to merge 15 commits into from
Closed

Conversation

Dandandan
Copy link

@Dandandan Dandandan commented Jan 17, 2021

WIP PR to add DataFusion (#107) as a solution.

@jangorecki is there any documentation regarding the required output? As this is a Rust solution, it can not easily reuse the prepared code.

Copy link
Contributor

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting this PR. It is really valuable to have a working example for a compiled code in this benchmark, as all currently listed solutions are using REPL interfaces.
I left one comment. I could take it over from here as the most difficult part has been already done. Although it can take me little while because coming week I will be on another project. Anyway, if it is possible for you to translate mentioned write_log from python to rust that would be helpful.

Comment on lines 37 to 39
let df = ctx.sql("SELECT id1, SUM(v1) AS v1 FROM t GROUP BY id1")?;

let _results = df.collect().await?;
Copy link
Contributor

@jangorecki jangorecki Jan 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would chain this two lines into single line, variable can be named ans for consistency to other scripts. Query for each question need to be run two times. After each query we do extra computation on the ans (and measure its timing as well) to ensure that actual query was not lazy by forcing computation on ans. Those two timings for each query we needs to be written to csv file with timings (mem usage can be ignored), which should be handled by a helper function, like this one:

def write_log(task, data, in_rows, question, out_rows, out_cols, solution, version, git, fun, run, time_sec, mem_gb, cache, chk, chk_time_sec, on_disk):

After both run of a query finished we need to print head-3 and tail-3 of ans.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jangorecki - im picking up work on this. do you still need a rust version of write_log?

@Dandandan
Copy link
Author

Dandandan commented Jan 18, 2021

Thanks for the feedback. I replaced the variables by ans and combined them in one go. I am not sure if I find enough time for the write_log this week.

@jangorecki if you work on it here some tips:

  • The solution can be executed with cargo run --release. cargo run executes it in debug mode, this makes the program very slow.
  • The lto = true line in Cargo.toml speeds up the binary, but slows the build down, so better to remove it while developing.

@jangorecki
Copy link
Contributor

jangorecki commented Jan 18, 2021

Any idea if DataFusion supports queries that are required for advanced groupby questions? q6-q10..

@Dandandan
Copy link
Author

Dandandan commented Jan 18, 2021

I added query7 and query 10. The others I think needs features to be implemented (median, window functions, etc). Query 10 is ridiculously slow though, that will improve a bit once a PR has been merged, but probably will still be slow after that. The benchmarks showed that we have some more work to do! I also have an open PR that will improve performance on the easier queries, I think DataFusion might already be close for the group by query 1 and 4 to clickhouse / data.table or even CuDF.

The inner join queries should all work too I think (I might add them later, it is easy as I can just reuse the clickhouse queries). There is a known bug for left joins which gives wrong output.

alamb pushed a commit to apache/arrow that referenced this pull request Jan 20, 2021
…sue with high number of groups

Currently, we loop to the hashmap for every key.

However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap.

In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators.

On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit.

This PR:
```
q1 took 340 ms
q2 took 1768 ms
q3 took 10975 ms
q4 took 337 ms
q5 took 13529 ms
```
Master:
```
q1 took 330 ms
q2 took 1648 ms
q3 took 16408 ms
q4 took 335 ms
q5 took 21074 ms
```

Closes #9234 from Dandandan/hash_agg_speed2

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
kszucs pushed a commit to apache/arrow that referenced this pull request Jan 25, 2021
…sue with high number of groups

Currently, we loop to the hashmap for every key.

However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap.

In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators.

On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit.

This PR:
```
q1 took 340 ms
q2 took 1768 ms
q3 took 10975 ms
q4 took 337 ms
q5 took 13529 ms
```
Master:
```
q1 took 330 ms
q2 took 1648 ms
q3 took 16408 ms
q4 took 335 ms
q5 took 21074 ms
```

Closes #9234 from Dandandan/hash_agg_speed2

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…sue with high number of groups

Currently, we loop to the hashmap for every key.

However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap.

In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators.

On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit.

This PR:
```
q1 took 340 ms
q2 took 1768 ms
q3 took 10975 ms
q4 took 337 ms
q5 took 13529 ms
```
Master:
```
q1 took 330 ms
q2 took 1648 ms
q3 took 16408 ms
q4 took 335 ms
q5 took 21074 ms
```

Closes apache#9234 from Dandandan/hash_agg_speed2

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…sue with high number of groups

Currently, we loop to the hashmap for every key.

However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap.

In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators.

On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit.

This PR:
```
q1 took 340 ms
q2 took 1768 ms
q3 took 10975 ms
q4 took 337 ms
q5 took 13529 ms
```
Master:
```
q1 took 330 ms
q2 took 1648 ms
q3 took 16408 ms
q4 took 335 ms
q5 took 21074 ms
```

Closes apache#9234 from Dandandan/hash_agg_speed2

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
@matthewmturner
Copy link

@Dandandan anything i can do to help the finalize the work on this?

@Dandandan
Copy link
Author

@Dandandan anything i can do to help the finalize the work on this?

Yeah sure, help appreciated!

I think what's missing is:

  • Integrating with the standard flow (writing a file in the standardized way)
  • Add join queries as well

@matthewmturner
Copy link

matthewmturner commented Dec 24, 2021

@Dandandan ok! Will check it out. Any additional info you could provide on what the standard flow is?

@matthewmturner
Copy link

@Dandandan do you have a preference for how i push my updates here?

i started the work here https://github.com/matthewmturner/db-benchmark/tree/datafusion/datafusion as a fork of what you were doing.

ive updated how the tables are created and added the join queries. still need to review them in more detail / make sure its correct, see if i can add any of the missing group by queries, and i assume we'll want to test the larger datasets as well - but let me know if you have any thoughts.

after the above ill start looking into the flow more.

right now these are the results i get when running the benchmarks:

group by
q1 took 56 ms
q2 took 289 ms
q3 took 1305 ms
q4 took 69 ms
q5 took 1158 ms
q7 took 1198 ms
q10 took 24691 ms

join
q1 took 261 ms
q2 took 367 ms
q3 took 334 ms
q4 took 507 ms
q5 took 1936 ms

@Dandandan
Copy link
Author

@Dandandan do you have a preference for how i push my updates here?

i started the work here https://github.com/matthewmturner/db-benchmark/tree/datafusion/datafusion as a fork of what you were doing.

ive updated how the tables are created and added the join queries. still need to review them in more detail / make sure its correct, see if i can add any of the missing group by queries, and i assume we'll want to test the larger datasets as well - but let me know if you have any thoughts.

after the above ill start looking into the flow more.

right now these are the results i get when running the benchmarks:

group by
q1 took 56 ms
q2 took 289 ms
q3 took 1305 ms
q4 took 69 ms
q5 took 1158 ms
q7 took 1198 ms
q10 took 24691 ms

join
q1 took 261 ms
q2 took 367 ms
q3 took 334 ms
q4 took 507 ms
q5 took 1936 ms

Thank you 💯. Maybe you could open a new PR with the combined changes so we can continue it there?

@matthewmturner
Copy link

@Dandandan Sure sounds good!

@Dandandan
Copy link
Author

Follow up PR:

#240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants