sql: make lookup-join deduplication efficient for range lookups #85950
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-sql-queries
SQL Queries Team
Is your feature request related to a problem? Please describe.
Currently, de-duplication of the spans used for lookup is performed in the lookup joiner logic using a go map keyed on the span start and end keys. This works well for lookup joins that only use equalities (e.g. point-lookups), since in this case spans that overlap are exactly equivalent and can therefore be deduplicated by the map. For range lookups, however, spans may only partially overlap - for example, the start keys of two spans may be the same but the end keys different. Currently, we would not perform any de-duplication in this case and would instead perform two separate lookups, which may involve a lot of redundant work.
Describe the solution you'd like
First of all, de-duplication should be moved from the
joinReader
to theStreamer
- see #82155.For point lookups, it makes sense to use a map as before, since this is simple and imposes little overhead. However, for range lookups, a sort-merge strategy probably makes the most sense in order to eliminate redundant work. This work would only be performed once per input batch in order to de-duplicate. For matching and emitting looked-up rows to input rows, we would still use a simple multi-map slice like
spanIDHelper.spanIDToInputRowIndices
as before.Additional context
#66002 and #85597 added support for range-based lookup joins, but lookup joins that don't use equality conditions (only inequalities) are still only planned in limited cases in order to avoid performance regressions due to the lack of de-duplication. Fixing this issue will allow us to plan lookup joins in more cases.
Jira issue: CRDB-18498
The text was updated successfully, but these errors were encountered: