Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a sort optimization for Between Join_Condition #5303

Closed
wdanilo opened this issue Feb 5, 2023 · 5 comments · Fixed by #8212
Closed

Implement a sort optimization for Between Join_Condition #5303

wdanilo opened this issue Feb 5, 2023 · 5 comments · Fixed by #8212
Assignees
Labels
-libs Libraries: New libraries to be implemented l-join p-low Low priority x-new-feature Type: new feature request

Comments

@wdanilo
Copy link
Member

wdanilo commented Feb 5, 2023

This task is automatically imported from the old Task Issue Board and it was originally created by Radosław Waśko.
Original issue is here.


TODO: figure out more details how it should work

In short, currently we rely on a full scan for Between but should be able to get better results by sorting one table by the key.

This needs to co-operate with the index-based join - sorting the hashmap buckets.
It also needs to support multiple between levels - probably by relying on lexicographically ordering between the levels

@radeusgd
Copy link
Member

radeusgd commented Oct 4, 2023

Indirectly related #7767

@enso-bot
Copy link

enso-bot bot commented Oct 31, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-10-30):

Progress: Analyzing possibilities for more optimal Between join algorithms. Adding some additional benchmarks. Work on implementing a simple SortJoin. It should be finished by 2023-11-06.

Next Day: Next day I will be working on the same task. Implement the simple SortJoin. Integrate it with HashJoin. Check benchmarks. Tune some edge cases in benchmarks.

@enso-bot
Copy link

enso-bot bot commented Oct 31, 2023

Radosław Waśko reports a new STANDUP for today (2023-10-31):

Progress: Implemented SortJoin using TreeSet, and also sort+binsearch. Both are currently failing tests - more work is needed. Meetings, reviews, discussions on legal review tool. It should be finished by 2023-11-06.

Next Day: Next day I will be working on the same task. More focused work time is needed. Try to fix the TreeSet approach that seems simpler. See if fixing sort+binsearch is viable; see how they compare performance wise. Integrate the chosen technique with IndexJoin to get a compound join algorithm that can handle mixed equals+between conditions. First refactor JoinStrategy a bit to make this 'stacking' easier (move some params from join callback into constructors)

@enso-bot
Copy link

enso-bot bot commented Nov 3, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-11-02):

Progress: Fixed the implementation to pass the tests. Running benchmarks and comparing performance. Prepared the PR. Created followup and related tickets stemming from that. Discussions on Type stuff. It should be finished by 2023-11-06.

Next Day: Next day I will be working on the #8213 task. Fix the issue with order_by.

@enso-bot
Copy link

enso-bot bot commented Nov 7, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-11-06):

Progress: Finishing touches to the Between optimization PR, got it ready to merge. Going through other of my pending PRs and ensuring they can proceed to be merged. Starting to look into next task (ambiguous from conversion definition). It should be finished by 2023-11-06.

Next Day: Next day I will be working on the #7853 task. Figure out where to add tests. Work on returning some more clear error.

@mergify mergify bot closed this as completed in #8212 Nov 8, 2023
mergify bot pushed a commit that referenced this issue Nov 8, 2023
…ension (#8212)

- Closes #5303
- Refactors `JoinStrategy` allowing us to 'stack' join strategies on top of each other (to some extent) - currently a `HashJoin` can be followed by another join strategy (currently `SortJoin`)
- Adds benchmarks for join
- Due to limitations of the sorting approach this will still not be as fast as possible for cases where there is more than 1 `Between` condition in a single query - trying to demonstrate that in benchmarks.
- We can replace sorting by d-dimensional [RangeTrees](https://en.wikipedia.org/wiki/Range_tree) to get `O((n + m) log^d n + k)` performance (where `n` and `m` are sizes of joined tables, `d` is the amount of `Between` conditions used in the query and `k` is the result set size).
- Follow up ticket for consideration later:
#8216
- Closes #8215
- After all, it turned out that `TreeSet` was problematic (because of not enough flexibility with duplicate key handling), so the simplest solution was to immediately implement this sub-task.
- Closes #8204
- Unrelated, but I ran into this here: adds type checks to other arguments of `set`.
- Before, putting in a Column as `new_name` (i.e. mistakenly messing up the order of arguments), lead to a hard to understand `Method `if_then_else` of type Column could not be found.`, instead now it would file with type error 'expected Text got Column`.
@github-project-automation github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-libs Libraries: New libraries to be implemented l-join p-low Low priority x-new-feature Type: new feature request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants