Support pass-through queries for the jdbc based connectors #9163

grantatspothero · 2021-09-08T22:08:07Z

The postgres/redshift connectors do not support pushdown of arbitrary operations (see docs). This makes performing certain types of queries extremely inefficient since all the data is pulled out of postgres/redshift and processed in memory in trino.

Since supporting pushdown of arbitrary operations is not wanted or feasible, being able to write a pass-through query in trino that runs inside postgres/redshift would be extremely helpful in order to optimize queries where the current pushdown optimizations are not good enough.

There is some precedent for this in Trino, @bitsondatadev noted the elasticsearch connector currently supports a form of pass through queries: https://trino.io/docs/current/connector/elasticsearch.html#pass-through-queries

Examples of problems we have today:

if you join two redshift tables through trino, trino will sometimes pull all the data out of redshift and do the join in memory instead of pushing the join down into redshift.
if you need to use a db-specific function (like datetime functions), the db functions are not exposed by trino so you have to use trino functions which requires pulling the entire table out of the source db instead of pushing down the operation.

How pass through queries could help solve the above problems (with a fake syntax):

with tmp_table as (`redshift`
    select *
    from schema.a
    -- join happening in redshift
    inner join schema.b
    on a.id = b.id
    -- timestamp filtering happening in redshift
    where date_part_year(b.event_date) = 2021
)
-- only final join executed in trino
select *
from tmp_table
inner join hive.schema.table as t
on a.id = t.id

The text was updated successfully, but these errors were encountered:

hashhar · 2021-09-09T06:09:07Z

Since supporting pushdown of arbitrary operations is not wanted or feasible

This isn't true. There is work in progress in this direction - there is some ongoing work in #7994.

if you need to use a db-specific function (like datetime functions), the db functions are not exposed by trino so you have to use trino functions which requires pulling the entire table out of the source db instead of pushing down the operation.

Agreed about this. This will most likely always require passthrough though because functions don't have same semantics across databases. e.g. if you do a COUNT(col) in Postgres you'd get count of non-null values but Pinot for example would return total count. This makes it very easy to run into silent incorrectness cases.

An alternative we have been thinking about is to be able to let connector's rewrite Trino function calls into equivalent expressions for the remote system. So for example COUNT(col) can be rewritten to sum(CASE WHEN col IS NULL THEN 1 ELSE 0 END) by a connector to preserve semantics. You can implement this for aggregation functions today - not arbitrary functions (that will be possible to implement after #7994).

db functions are not exposed by trino so you have to use trino functions

This is not an easy problem to solve. The query parser needs to be able to parse the query to be able to analyse it, plan it and optimize it. Since the grammar across DBs differs it would require implementing a SQL translation layer which itself is a very large project.

One of the biggest problems with passthrough queries is that it's a footgun if used incorrectly because as long as you're thinking about only a single target system it's ok because the semantics are consistent and you'd get consistent (according to remote database semantics) output. But as soon as you have multiple such catalogs in play it becomes very complicated and easy to get silently incorrect results.

Due to this reason IMO the effort is better spent on the roadmap items in #18.

grantatspothero · 2021-09-09T14:55:52Z

This is not an easy problem to solve. The query parser needs to be able to parse the query to be able to analyse it, plan it and optimize it. Since the grammar across DBs differs it would require implementing a SQL translation layer which itself is a very large project.

Totally understand why implementing a translation layer is a huge project, but why is the naive "pass through query to underlying system as a string" not feasible?

But as soon as you have multiple such catalogs in play it becomes very complicated and easy to get silently incorrect results.

Is that a bad thing? if you have two postgres connectors each running a different versions of postgres and you send a passthrough query to each, the expectation is you will get slightly different results based on which postgres db you are talking to.

Pass through queries solve real problems that currently do not have any good workarounds. While I agree there is potential to misuse them, does the risk of misuse outweigh the benefits? IMO no.

electrum · 2021-09-09T16:44:27Z

Would it work to create a view in Redshift and select from it in Trino? The entire view query will be executed in Redshift.

grantatspothero · 2021-09-09T17:40:49Z

@electrum yes that is a workaround (as well as materializing a view in redshift). It is what we do today.

But when using additional tooling on top of trino like DBT, having to model one-off queries as views gets bloated pretty quickly (see this thread for the motivation: https://trinodb.slack.com/archives/CFQAMGRQE/p1631029170025400?thread_ts=1630955221.022000&cid=CFQAMGRQE)

findepi · 2022-02-25T15:54:53Z

the current plan is to address this with #1839, cc @kasiafi @martint

kasiafi · 2022-05-20T08:53:12Z

@grantatspothero we now have support for Polymorphic Table Functions (see https://trino.io/docs/current/release/release-381.html).
Query pass-through is a great use-case for PTF. We're currently working on query pass-through for JDBC-based connectors: #12325
For the user, it will be as easy as calling a function in Trino with textual native query as the argument.

grantatspothero changed the title ~~Support pass-through queries for the jdbc based connector~~ Support pass-through queries for the jdbc based connectors Sep 8, 2021

hashhar added the enhancement New feature or request label Sep 9, 2021

sap-nocops mentioned this issue Oct 27, 2021

allow push down join on redshift #9789

Closed

harlequin mentioned this issue May 10, 2022

Add PassThroughProcedure for JDBC Connector #12322

Closed

findepi linked a pull request May 20, 2022 that will close this issue

Support query pass-through for JDBC-based connectors #12325

Merged

findepi closed this as completed in #12325 Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pass-through queries for the jdbc based connectors #9163

Support pass-through queries for the jdbc based connectors #9163

grantatspothero commented Sep 8, 2021

hashhar commented Sep 9, 2021

grantatspothero commented Sep 9, 2021

electrum commented Sep 9, 2021

grantatspothero commented Sep 9, 2021

findepi commented Feb 25, 2022

kasiafi commented May 20, 2022

Support pass-through queries for the jdbc based connectors #9163

Support pass-through queries for the jdbc based connectors #9163

Comments

grantatspothero commented Sep 8, 2021

hashhar commented Sep 9, 2021

grantatspothero commented Sep 9, 2021

electrum commented Sep 9, 2021

grantatspothero commented Sep 9, 2021

findepi commented Feb 25, 2022

kasiafi commented May 20, 2022