-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a sort optimization for Between Join_Condition #5303
Comments
Indirectly related #7767 |
Radosław Waśko reports a new STANDUP for yesterday (2023-10-30): Progress: Analyzing possibilities for more optimal Between join algorithms. Adding some additional benchmarks. Work on implementing a simple SortJoin. It should be finished by 2023-11-06. Next Day: Next day I will be working on the same task. Implement the simple SortJoin. Integrate it with HashJoin. Check benchmarks. Tune some edge cases in benchmarks. |
Radosław Waśko reports a new STANDUP for today (2023-10-31): Progress: Implemented SortJoin using TreeSet, and also sort+binsearch. Both are currently failing tests - more work is needed. Meetings, reviews, discussions on legal review tool. It should be finished by 2023-11-06. Next Day: Next day I will be working on the same task. More focused work time is needed. Try to fix the TreeSet approach that seems simpler. See if fixing sort+binsearch is viable; see how they compare performance wise. Integrate the chosen technique with IndexJoin to get a compound join algorithm that can handle mixed equals+between conditions. First refactor JoinStrategy a bit to make this 'stacking' easier (move some params from |
Radosław Waśko reports a new STANDUP for yesterday (2023-11-02): Progress: Fixed the implementation to pass the tests. Running benchmarks and comparing performance. Prepared the PR. Created followup and related tickets stemming from that. Discussions on Type stuff. It should be finished by 2023-11-06. Next Day: Next day I will be working on the #8213 task. Fix the issue with |
Radosław Waśko reports a new STANDUP for yesterday (2023-11-06): Progress: Finishing touches to the Between optimization PR, got it ready to merge. Going through other of my pending PRs and ensuring they can proceed to be merged. Starting to look into next task (ambiguous from conversion definition). It should be finished by 2023-11-06. Next Day: Next day I will be working on the #7853 task. Figure out where to add tests. Work on returning some more clear error. |
…ension (#8212) - Closes #5303 - Refactors `JoinStrategy` allowing us to 'stack' join strategies on top of each other (to some extent) - currently a `HashJoin` can be followed by another join strategy (currently `SortJoin`) - Adds benchmarks for join - Due to limitations of the sorting approach this will still not be as fast as possible for cases where there is more than 1 `Between` condition in a single query - trying to demonstrate that in benchmarks. - We can replace sorting by d-dimensional [RangeTrees](https://en.wikipedia.org/wiki/Range_tree) to get `O((n + m) log^d n + k)` performance (where `n` and `m` are sizes of joined tables, `d` is the amount of `Between` conditions used in the query and `k` is the result set size). - Follow up ticket for consideration later: #8216 - Closes #8215 - After all, it turned out that `TreeSet` was problematic (because of not enough flexibility with duplicate key handling), so the simplest solution was to immediately implement this sub-task. - Closes #8204 - Unrelated, but I ran into this here: adds type checks to other arguments of `set`. - Before, putting in a Column as `new_name` (i.e. mistakenly messing up the order of arguments), lead to a hard to understand `Method `if_then_else` of type Column could not be found.`, instead now it would file with type error 'expected Text got Column`.
This task is automatically imported from the old Task Issue Board and it was originally created by Radosław Waśko.
Original issue is here.
TODO: figure out more details how it should work
In short, currently we rely on a full scan for Between but should be able to get better results by sorting one table by the key.
This needs to co-operate with the index-based join - sorting the hashmap buckets.
It also needs to support multiple between levels - probably by relying on lexicographically ordering between the levels
The text was updated successfully, but these errors were encountered: