Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trino-cli has no option to accept targetResultSize query parameter #22303

Closed
anilsomisetty opened this issue Jun 6, 2024 · 16 comments
Closed
Assignees

Comments

@anilsomisetty
Copy link
Contributor

anilsomisetty commented Jun 6, 2024

In query result GET request from coordinator it has a query parameter targetResultSize which is per get request how much a coordinator can send to client.
I see there is no way to configure it from trino client.

@anilsomisetty anilsomisetty self-assigned this Jun 6, 2024
@mosabua
Copy link
Member

mosabua commented Jun 19, 2024

What is the use case you are trying to solve with exposing this parameter? The protocol is already conversational so there is no real limit for the overall data returned.

@anilsomisetty
Copy link
Contributor Author

anilsomisetty commented Jun 19, 2024

Hi @mosabua

Currently in below mentioned code I see that for query get request to fetch resultset there is a queryparameter targetResultSize defined with a default size as 16MB i.e 16MB of result set would be sent to client from coordinator per get request.

Another thing to note here is this parameter is never set i.e it's always null because there is no way this parameter can be set from client. but I see a condition to check take min of this value and max target size which is 128MB if it is set.

If this parameter can be set through client the queries that would fetch large resultset from coordinator mininization of the get request calls can be done which would reduce the wait time and queries would complete faster.

ExecuteStatementResource.java(

)

please let me know your comments.

@mosabua
Copy link
Member

mosabua commented Jun 19, 2024

This might be something we can do in our current work on improving client protocol performance with Project Swift.

#22271

Note that we don't know if increasing the size does indeed improve performance. Did you test this and can report any findings?

Also .. ideally users would not have to configure such things just to get a faster connection.

@anilsomisetty
Copy link
Contributor Author

What I have observed is in code targetResultSize was set as QueryParam and was never getting used or set from trino client while running a query, so I was just curious on why it was not done.

I have run a query which returns approximately 6GB of data with 7.7Million rows on a trino cluster from DBVisualizer using trino-jdbc client driver jar, these are my findigs:

  1. When targetResultSize is default i.e 16MB it took 1hour to fetch this result
  2. When targetResultSize is made maximum i.e 128MB it took 4minutes to fetch this result

@mosabua
Copy link
Member

mosabua commented Jun 24, 2024

Wow .. very interesting finding. We should take this into account for Project Swift @wendigo @dain @electrum @martint

@wendigo
Copy link
Contributor

wendigo commented Jun 24, 2024

@mosabua I can't agree with these numbers - I've been benchmarking changing this value and there is a diminishing improvement for values over 32 MB. The buffering and compression happens on the coordinator so setting this value high, puts an additional pressure on the coordinator. That doesn't scale well.

@anilsomisetty
Copy link
Contributor Author

My trino cluster has 6 nodes = 5 workers and 1 coordinator(doesn't act as worker) where each node has 1TB memory and 128 CPU where I have run the query.

@wendigo
Copy link
Contributor

wendigo commented Jul 17, 2024

We don't want to expose this parameter to the clients. With the spooled client protocol extension (#22662) it won't be effective either way.

@wendigo wendigo closed this as completed Jul 17, 2024
@pichlerpa
Copy link

Interesting, I just added this parameter to the Power BI connector for testing and I did notice a significant performance change on my local machine, where 16MB was the fastest for loading 500k rows into Power BI.

Image

TargetResultSize 1MB (default) --> time taken: 00:02:37

Image

TargetResultSize 16MB --> time taken: 00:01:46

Image

TargetResultSize 128MB --> time taken: 00:01:58

Image

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

@pichlerpa this puts an additional pressure on the coordinator as the last response is cached in order for the client to be able to retry the last call. That's why it's not recommended approach.

@pichlerpa
Copy link

@wendigo I understand, but I think it's great to have the option at least. In case of Power BI, this can be very useful to speed up imports which often happen just once per night or a few times during the day.

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

@pichlerpa that's why we've added a spooled protocol that allows you to have segments of a bigger size. In our benchmarks the new jdbc driver is 2-4x faster.

@pichlerpa
Copy link

@wendigo Ok, yes, it seems to work like Databricks' cloud fetch mechanism. But what if I don't have an object store available? Isn't this a relatively simple option to improve performance, as long as the coordinator isn't too busy at those times to avoid OOM errors? I don't really see the problem, if you know the system and functionality, why not? Isn't it better than having endless round-trips to collect query results.

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

@pichlerpa we are considering adding a spooling manager that works with the local worker storage

@pichlerpa
Copy link

@wendigo is there any documentation or reference available for how to implement the spooling protocol in a new connector/client not relying on JDBC? Thanks for your feedback BTW!

@wendigo
Copy link
Contributor

wendigo commented Jan 7, 2025

@pichlerpa #22662

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants