Implement a sort optimization for Between Join_Condition #5303

wdanilo · 2023-02-05T22:57:40Z

This task is automatically imported from the old Task Issue Board and it was originally created by Radosław Waśko.
Original issue is here.

TODO: figure out more details how it should work

In short, currently we rely on a full scan for Between but should be able to get better results by sorting one table by the key.

This needs to co-operate with the index-based join - sorting the hashmap buckets.
It also needs to support multiple between levels - probably by relying on lexicographically ordering between the levels

radeusgd · 2023-10-04T10:05:58Z

Indirectly related #7767

enso-bot · 2023-10-31T10:03:15Z

Radosław Waśko reports a new STANDUP for yesterday (2023-10-30):

Progress: Analyzing possibilities for more optimal Between join algorithms. Adding some additional benchmarks. Work on implementing a simple SortJoin. It should be finished by 2023-11-06.

Next Day: Next day I will be working on the same task. Implement the simple SortJoin. Integrate it with HashJoin. Check benchmarks. Tune some edge cases in benchmarks.

enso-bot · 2023-10-31T19:38:51Z

Radosław Waśko reports a new STANDUP for today (2023-10-31):

Progress: Implemented SortJoin using TreeSet, and also sort+binsearch. Both are currently failing tests - more work is needed. Meetings, reviews, discussions on legal review tool. It should be finished by 2023-11-06.

Next Day: Next day I will be working on the same task. More focused work time is needed. Try to fix the TreeSet approach that seems simpler. See if fixing sort+binsearch is viable; see how they compare performance wise. Integrate the chosen technique with IndexJoin to get a compound join algorithm that can handle mixed equals+between conditions. First refactor JoinStrategy a bit to make this 'stacking' easier (move some params from join callback into constructors)

enso-bot · 2023-11-03T07:31:15Z

Radosław Waśko reports a new STANDUP for yesterday (2023-11-02):

Progress: Fixed the implementation to pass the tests. Running benchmarks and comparing performance. Prepared the PR. Created followup and related tickets stemming from that. Discussions on Type stuff. It should be finished by 2023-11-06.

Next Day: Next day I will be working on the #8213 task. Fix the issue with order_by.

enso-bot · 2023-11-07T10:23:23Z

Radosław Waśko reports a new STANDUP for yesterday (2023-11-06):

Progress: Finishing touches to the Between optimization PR, got it ready to merge. Going through other of my pending PRs and ensuring they can proceed to be merged. Starting to look into next task (ambiguous from conversion definition). It should be finished by 2023-11-06.

Next Day: Next day I will be working on the #7853 task. Figure out where to add tests. Work on returning some more clear error.

…ension (#8212) - Closes #5303 - Refactors `JoinStrategy` allowing us to 'stack' join strategies on top of each other (to some extent) - currently a `HashJoin` can be followed by another join strategy (currently `SortJoin`) - Adds benchmarks for join - Due to limitations of the sorting approach this will still not be as fast as possible for cases where there is more than 1 `Between` condition in a single query - trying to demonstrate that in benchmarks. - We can replace sorting by d-dimensional [RangeTrees](https://en.wikipedia.org/wiki/Range_tree) to get `O((n + m) log^d n + k)` performance (where `n` and `m` are sizes of joined tables, `d` is the amount of `Between` conditions used in the query and `k` is the result set size). - Follow up ticket for consideration later: #8216 - Closes #8215 - After all, it turned out that `TreeSet` was problematic (because of not enough flexibility with duplicate key handling), so the simplest solution was to immediately implement this sub-task. - Closes #8204 - Unrelated, but I ran into this here: adds type checks to other arguments of `set`. - Before, putting in a Column as `new_name` (i.e. mistakenly messing up the order of arguments), lead to a hard to understand `Method `if_then_else` of type Column could not be found.`, instead now it would file with type error 'expected Text got Column`.

wdanilo assigned radeusgd Feb 6, 2023

wdanilo removed the Assignee: Radosław Waśko label Feb 6, 2023

wdanilo added this to Issues Board Feb 6, 2023

wdanilo removed the State: unscheduled label Feb 6, 2023

jdunkerley removed this from Issues Board Feb 6, 2023

jdunkerley unassigned radeusgd Feb 6, 2023

github-project-automation bot added this to Issues Board Feb 6, 2023

github-project-automation bot moved this to ❓New in Issues Board Feb 6, 2023

jdunkerley removed this from Issues Board Feb 7, 2023

jdunkerley assigned radeusgd Oct 4, 2023

jdunkerley added this to Issues Board Oct 4, 2023

github-project-automation bot moved this to ❓New in Issues Board Oct 4, 2023

jdunkerley moved this from ❓New to 📤 Backlog in Issues Board Oct 4, 2023

enso-bot bot mentioned this issue Oct 26, 2023

Implement Table.lookup_and_replace for the Database backend #7981

Closed

radeusgd moved this from 📤 Backlog to 🔧 Implementation in Issues Board Oct 26, 2023

enso-bot bot mentioned this issue Oct 27, 2023

Limit max_rows that are downloaded in Table.read by default, and warn if more rows are available #8159

Merged

5 tasks

radeusgd mentioned this issue Nov 2, 2023

Improve performance of Join_Condition.Between by sorting on one dimension #8212

Merged

5 tasks

radeusgd moved this from 🔧 Implementation to 👁️ Code review in Issues Board Nov 2, 2023

radeusgd mentioned this issue Nov 2, 2023

Improve performance of Table.join when multiple Between conditions are used #8216

Open

3 tasks

enso-bot bot mentioned this issue Nov 6, 2023

Table.order_by should not warn about Floating_Point_Equality when sorting on Float columns #8213

Closed

3 tasks

mergify bot closed this as completed in #8212 Nov 8, 2023

github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a sort optimization for Between Join_Condition #5303

Implement a sort optimization for Between Join_Condition #5303

wdanilo commented Feb 5, 2023

radeusgd commented Oct 4, 2023

enso-bot bot commented Oct 31, 2023

enso-bot bot commented Oct 31, 2023

enso-bot bot commented Nov 3, 2023

enso-bot bot commented Nov 7, 2023

Implement a sort optimization for Between Join_Condition #5303

Implement a sort optimization for Between Join_Condition #5303

Comments

wdanilo commented Feb 5, 2023

radeusgd commented Oct 4, 2023

enso-bot bot commented Oct 31, 2023

enso-bot bot commented Oct 31, 2023

enso-bot bot commented Nov 3, 2023

enso-bot bot commented Nov 7, 2023