Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-12044] JdbcIO read: always force autocommit to false #14338

Closed
wants to merge 1 commit into from

Conversation

turb
Copy link
Contributor

@turb turb commented Mar 25, 2021

According to PostgreSQL JDBC documentation, it is required to disable autocommit in order to allow cursor streaming.

In JdbcIO:1548 we have poolableConnectionFactory.setDefaultAutoCommit(false);, however it is only applicable to poolable datasource. So any other usage will eventually lead to OOM since the JDBC driver silently falls back to full memoization of the request result.

The commit disables autocommit more systematically.

cc @lukecwik @kennknowles @aaltay

Post-Commit Tests Status (on master branch)

Lang SDK ULR Dataflow Flink Samza Spark Twister2
Go --- --- Build Status --- Build Status ---
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@aaltay aaltay requested a review from pabloem March 25, 2021 16:29
@@ -903,6 +903,8 @@ public void processElement(ProcessContext context) throws Exception {
if (connection == null) {
connection = dataSource.getConnection();
}
// PostgreSQL requires autocommit to be disabled to enable cursor streaming
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by: could you include the Postgres documentation link in this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the comment in the source code or the comment of the commit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the source code. If someone wonders why this is here and why Postgres requires this (and whether other DBs might also require it for the same reason), it's better if they can see it right there in the code rather than have to dig into Git history.

I would perhaps propose the following phrasing:

Some databases require autocommit to be disabled to enable cursor streaming.
E.g. PostgreSQL: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done meanwhile ;)

Copy link
Contributor

@aromanenko-dev aromanenko-dev Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to set it to false only if it was not explicitly set to true by connection properties configuration. Otherwise, it won't be possible to override and it can be confusing for users (since we already provide a way to do that with DataSourceConfiguration).

Copy link
Contributor Author

@turb turb Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would store the configparameter passed in withDataSourceConfiguration? What would be the path to retrieve this parameter from it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess it won't be so easy to check without additional work (please, correct me if I'm wrong).

My main concern is the following - if we provide a way to configure IO in some way, we don't have to silently override it without taking into account the user config options if they were set explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have an idea on how the parameter can be specified, it will be additional work yes, but I can spend some reasonable time on it.

However, I am inclined to invoke YAGNI principle: once this parameter is needed (and I can't determine how someone would need autocommit while reading), the way to parameterize it and to take it into account in JdbcIO will appear more clearly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, since we have an agreement in general, let's keep this behaviour but, please, add a log message that autoCommit was forced to true and, as I mentioned before, create PR from a feature branch. Thanks!

@jkff
Copy link
Contributor

jkff commented Mar 25, 2021

Actually this change might be mildly problematic for some users:

Imagine a user who was loading a moderately large dataset from Postgres, but the dataset was fitting in memory.
Now with this change the dataset will be streamed incrementally in batches of fetchSize, which will use less memory but more database roundtrips, so it might perform worse. It might be OK though, because the default fetch size is pretty large (50K). The problem will happen only if the user was setting a small fetch size, which previously wasn't having any effect. In that case, they'll need to set a larger fetchSize.

Now that I think of it, this probably is not a problem, but thought worth mentioning.

Copy link
Contributor

@aromanenko-dev aromanenko-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turb Could create a PR from feature branch? Now it's based on your master.

@turb
Copy link
Contributor Author

turb commented Mar 25, 2021

@turb Could create a PR from feature branch? Now it's based on your master.

Sure, as soon as we come to a conclusion in the discussion before, I'll create it.

@turb
Copy link
Contributor Author

turb commented Mar 26, 2021

Superseeded by #14349

@turb turb closed this Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants