Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune JDBC fetch-size automatically based on column count #16644

Merged
merged 1 commit into from
Mar 22, 2023

Conversation

findepi
Copy link
Member

@findepi findepi commented Mar 21, 2023

PostgreSQL, Redshift and Oracle connectors had hard-coded fetch-size value of 1000. The value was found not to be optimal when server is far (high latency) or when number of columns selected is low. This commit improves in the latter case by picking fetch size automatically based on number of columns projected. After the change, the fetch size will be automatically picked in the range 1000 to 100,000.

Fixes #16153
Alternative to #16269

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is nice however I don't know whether larger fetch sizes cause issues for wide tables, have you observed something?

I would've expected a different heurestic: the wider table or more rows we pull the higher the fetch count.

One concern I have with this is that this will impact memory estimation since the fetch size is no longer a constant value.

throws SQLException
{
PreparedStatement statement = connection.prepareStatement(sql);
statement.setFetchSize(1000);
// This is a heuristic, not exact science. A better formula can perhaps be found with measurements.
// Column count is not known for non-SELECT queries. Not setting fetch size for these.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not setting fetch size can mean a fetch size of 1 in some drivers. IMO in case we don't know column size we should default to older value of 1k.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this originally, but note that column count is not known only for queries like DELETE (not SELECT queries).
All the queries reading data know their projected column count.
Thus I had a choice: do the change defensively, as if I didn't know when column count may be missing. Or write the code "the way it would be written today".

@findepi
Copy link
Member Author

findepi commented Mar 21, 2023

the idea is nice however I don't know whether larger fetch sizes cause issues for wide tables, have you observed something?

I would've expected a different heurestic: the wider table or more rows we pull the higher the fetch count.

I would expect this can cause memory pressure issues.

Note that

  • i didn't run any measurements on my own. This is just a heuristic, just like the previous value of 1k was just a heuristic.
  • the change doesn't discern data types. Ideally we could make the choice based on the width of columns pulled from statistics.

One concern I have with this is that this will impact memory estimation since the fetch size is no longer a constant value.

in what context do we do memory estimation, taking into account rows prefetched by the JDBC driver?
in any case, the estimations are still possible, since there is a min and max value for the fetch size (1k and 100k respectively).

@hashhar
Copy link
Member

hashhar commented Mar 21, 2023

in what context do we do memory estimation

We don't do it today. Now that I re-read my comment actually even today the prefetch is not constant. For wider tables we prefetch more compared to narrow tables. So actually your change is probably better in this regard.

This looks like a good starting point. We can iterate over time if someone finds issues here. Do you think it'd be useful to have a killswitch for sometime?

@findepi
Copy link
Member Author

findepi commented Mar 21, 2023

This looks like a good starting point. We can iterate over time if someone finds issues here.

yes, that's the idea

Do you think it'd be useful to have a killswitch for sometime?

Sure, i can add one

@findepi findepi force-pushed the findepi/auto-fetch-size-v2 branch from ae7226a to 2f01b25 Compare March 21, 2023 13:40
@findepi
Copy link
Member Author

findepi commented Mar 21, 2023

CI #15430, #16652

@findepi findepi force-pushed the findepi/auto-fetch-size-v2 branch from 2f01b25 to 6cc400e Compare March 21, 2023 21:11
@findepi findepi requested a review from hashhar March 21, 2023 21:11
// Column count is not known for non-SELECT queries. Not setting fetch size for these.
else if (columnCount.isPresent()) {
statement.setFetchSize(min(100_000 / columnCount.get(), 1_000));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, we can avoid overriding the fetch size when the size is specified with defaultRowFetchSize connection property in addition to this change? It will allow users configure the value in their side. We can retrieve the value with PgConnection#getDefaultFetchSize.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like this is orthogonal, i.e. the existing code was setting the fetch size unconditionally, and this one sets fetch size unconditionally, just with a different value.

However, i don't think we should go into that direction at all.
if our intent is to let users configure the fetch size, we should have an explicit toggle rather than inspect fetch size provided in the JDBC URL string. Note however that giving users' control is good, but having our code be smarter is even better, and those things are at odds. i DO think we should desist desire to throw toggle at a problem.

@chenjian2664
Copy link
Contributor

I think the idea is good as we consider more factors while fetching the values.
But the question became how can we get the "correct" formula, eg, https://github.com/trinodb/trino/pull/16644/files/6cc400e7dd00946dc9270a9f680032843f7d1191#diff-e7f2966e1e2e070e8643b5563c7ba62f7dc0652f76fbf446ef389b75ef4883fbR423 why set 100_000, again is it allow control by user?
Why don't allow set by user like #16269 , and this is what already done while writing the values, see https://github.com/trinodb/trino/blob/master/plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/JdbcPageSink.java#L128

private ArrayMapping arrayMapping = ArrayMapping.DISABLED;
private boolean includeSystemTables;
private boolean enableStringPushdownWithCollate;

@Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this instead live in JdbcMetadataConfig?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's applicable to only some connectors (postgresql, oracle, redshift)

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments from me and Yuya

@findepi
Copy link
Member Author

findepi commented Mar 22, 2023

Why don't allow set by user like #16269 ,

see #16269 (comment)

tl;dr: few users would benefit from a config toggle; many users may benefit from auto-adjusted value
i aim at the bigger benefit

But the question became how can we get the "correct" formula

@chenjian2664 i suppose in next iteration someone will do some experiments and propose a better formula.

@findepi findepi force-pushed the findepi/auto-fetch-size-v2 branch from 6cc400e to bebfcd3 Compare March 22, 2023 10:24
@findepi
Copy link
Member Author

findepi commented Mar 22, 2023

There is a conflict with #16379 (just merged)
also, the formula used min(..., 1000) but it was supposed to be max(..., 1000). Will change & rebase.

@findepi findepi force-pushed the findepi/auto-fetch-size-v2 branch from bebfcd3 to 5b3bcca Compare March 22, 2023 10:28
@findepi
Copy link
Member Author

findepi commented Mar 22, 2023

Conflicts with #16616 (just merged), will rebase.

PostgreSQL, Redshift and Oracle connectors had hard-coded fetch-size
value of 1000. The value was found not to be optimal when server is far
(high latency) or when number of columns selected is low. This commit
improves in the latter case by picking fetch size automatically based on
number of columns projected. After the change, the fetch size will be
automatically picked in the range 1000 to 100,000.
@findepi findepi force-pushed the findepi/auto-fetch-size-v2 branch from 5b3bcca to 039eeb9 Compare March 22, 2023 10:43
@findepi
Copy link
Member Author

findepi commented Mar 22, 2023

CI ignite #16671

@findepi
Copy link
Member Author

findepi commented Mar 22, 2023

CI #16652 (again)

@findepi findepi merged commit 9c4ef0e into master Mar 22, 2023
@findepi findepi deleted the findepi/auto-fetch-size-v2 branch March 22, 2023 16:49
@github-actions github-actions bot added this to the 411 milestone Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Change FetchSize from 1k to 10k
4 participants