Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pass-through queries for the jdbc based connectors #9163

Closed
grantatspothero opened this issue Sep 8, 2021 · 6 comments · Fixed by #12325
Closed

Support pass-through queries for the jdbc based connectors #9163

grantatspothero opened this issue Sep 8, 2021 · 6 comments · Fixed by #12325
Labels
enhancement New feature or request

Comments

@grantatspothero
Copy link
Contributor

The postgres/redshift connectors do not support pushdown of arbitrary operations (see docs). This makes performing certain types of queries extremely inefficient since all the data is pulled out of postgres/redshift and processed in memory in trino.

Since supporting pushdown of arbitrary operations is not wanted or feasible, being able to write a pass-through query in trino that runs inside postgres/redshift would be extremely helpful in order to optimize queries where the current pushdown optimizations are not good enough.

There is some precedent for this in Trino, @bitsondatadev noted the elasticsearch connector currently supports a form of pass through queries: https://trino.io/docs/current/connector/elasticsearch.html#pass-through-queries

Examples of problems we have today:

  • if you join two redshift tables through trino, trino will sometimes pull all the data out of redshift and do the join in memory instead of pushing the join down into redshift.
  • if you need to use a db-specific function (like datetime functions), the db functions are not exposed by trino so you have to use trino functions which requires pulling the entire table out of the source db instead of pushing down the operation.

How pass through queries could help solve the above problems (with a fake syntax):

with tmp_table as (`redshift`
    select *
    from schema.a
    -- join happening in redshift
    inner join schema.b
    on a.id = b.id
    -- timestamp filtering happening in redshift
    where date_part_year(b.event_date) = 2021
)
-- only final join executed in trino
select *
from tmp_table
inner join hive.schema.table as t
on a.id = t.id
@grantatspothero grantatspothero changed the title Support pass-through queries for the jdbc based connector Support pass-through queries for the jdbc based connectors Sep 8, 2021
@hashhar hashhar added the enhancement New feature or request label Sep 9, 2021
@hashhar
Copy link
Member

hashhar commented Sep 9, 2021

Since supporting pushdown of arbitrary operations is not wanted or feasible

This isn't true. There is work in progress in this direction - there is some ongoing work in #7994.

if you need to use a db-specific function (like datetime functions), the db functions are not exposed by trino so you have to use trino functions which requires pulling the entire table out of the source db instead of pushing down the operation.

Agreed about this. This will most likely always require passthrough though because functions don't have same semantics across databases. e.g. if you do a COUNT(col) in Postgres you'd get count of non-null values but Pinot for example would return total count. This makes it very easy to run into silent incorrectness cases.

An alternative we have been thinking about is to be able to let connector's rewrite Trino function calls into equivalent expressions for the remote system. So for example COUNT(col) can be rewritten to sum(CASE WHEN col IS NULL THEN 1 ELSE 0 END) by a connector to preserve semantics. You can implement this for aggregation functions today - not arbitrary functions (that will be possible to implement after #7994).

db functions are not exposed by trino so you have to use trino functions

This is not an easy problem to solve. The query parser needs to be able to parse the query to be able to analyse it, plan it and optimize it. Since the grammar across DBs differs it would require implementing a SQL translation layer which itself is a very large project.


One of the biggest problems with passthrough queries is that it's a footgun if used incorrectly because as long as you're thinking about only a single target system it's ok because the semantics are consistent and you'd get consistent (according to remote database semantics) output. But as soon as you have multiple such catalogs in play it becomes very complicated and easy to get silently incorrect results.

Due to this reason IMO the effort is better spent on the roadmap items in #18.

@grantatspothero
Copy link
Contributor Author

This is not an easy problem to solve. The query parser needs to be able to parse the query to be able to analyse it, plan it and optimize it. Since the grammar across DBs differs it would require implementing a SQL translation layer which itself is a very large project.

Totally understand why implementing a translation layer is a huge project, but why is the naive "pass through query to underlying system as a string" not feasible?

But as soon as you have multiple such catalogs in play it becomes very complicated and easy to get silently incorrect results.

Is that a bad thing? if you have two postgres connectors each running a different versions of postgres and you send a passthrough query to each, the expectation is you will get slightly different results based on which postgres db you are talking to.

Pass through queries solve real problems that currently do not have any good workarounds. While I agree there is potential to misuse them, does the risk of misuse outweigh the benefits? IMO no.

@electrum
Copy link
Member

electrum commented Sep 9, 2021

Would it work to create a view in Redshift and select from it in Trino? The entire view query will be executed in Redshift.

@grantatspothero
Copy link
Contributor Author

@electrum yes that is a workaround (as well as materializing a view in redshift). It is what we do today.

But when using additional tooling on top of trino like DBT, having to model one-off queries as views gets bloated pretty quickly (see this thread for the motivation: https://trinodb.slack.com/archives/CFQAMGRQE/p1631029170025400?thread_ts=1630955221.022000&cid=CFQAMGRQE)

@findepi
Copy link
Member

findepi commented Feb 25, 2022

the current plan is to address this with #1839, cc @kasiafi @martint

@kasiafi
Copy link
Member

kasiafi commented May 20, 2022

@grantatspothero we now have support for Polymorphic Table Functions (see https://trino.io/docs/current/release/release-381.html).
Query pass-through is a great use-case for PTF. We're currently working on query pass-through for JDBC-based connectors: #12325
For the user, it will be as easy as calling a function in Trino with textual native query as the argument.

@findepi findepi linked a pull request May 20, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging a pull request may close this issue.

5 participants