-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-12044] JdbcIO read: always force autocommit to false #14338
Conversation
@@ -903,6 +903,8 @@ public void processElement(ProcessContext context) throws Exception { | |||
if (connection == null) { | |||
connection = dataSource.getConnection(); | |||
} | |||
// PostgreSQL requires autocommit to be disabled to enable cursor streaming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by: could you include the Postgres documentation link in this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the comment in the source code or the comment of the commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the source code. If someone wonders why this is here and why Postgres requires this (and whether other DBs might also require it for the same reason), it's better if they can see it right there in the code rather than have to dig into Git history.
I would perhaps propose the following phrasing:
Some databases require autocommit to be disabled to enable cursor streaming.
E.g. PostgreSQL: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done meanwhile ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to set it to false
only if it was not explicitly set to true
by connection properties configuration. Otherwise, it won't be possible to override and it can be confusing for users (since we already provide a way to do that with DataSourceConfiguration
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we would store the config
parameter passed in withDataSourceConfiguration
? What would be the path to retrieve this parameter from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess it won't be so easy to check without additional work (please, correct me if I'm wrong).
My main concern is the following - if we provide a way to configure IO in some way, we don't have to silently override it without taking into account the user config options if they were set explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have an idea on how the parameter can be specified, it will be additional work yes, but I can spend some reasonable time on it.
However, I am inclined to invoke YAGNI principle: once this parameter is needed (and I can't determine how someone would need autocommit while reading), the way to parameterize it and to take it into account in JdbcIO
will appear more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, since we have an agreement in general, let's keep this behaviour but, please, add a log message that autoCommit
was forced to true
and, as I mentioned before, create PR from a feature branch. Thanks!
Actually this change might be mildly problematic for some users: Imagine a user who was loading a moderately large dataset from Postgres, but the dataset was fitting in memory. Now that I think of it, this probably is not a problem, but thought worth mentioning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@turb Could create a PR from feature branch? Now it's based on your master
.
Sure, as soon as we come to a conclusion in the discussion before, I'll create it. |
Superseeded by #14349 |
According to PostgreSQL JDBC documentation, it is required to disable autocommit in order to allow cursor streaming.
In
JdbcIO:1548
we havepoolableConnectionFactory.setDefaultAutoCommit(false);
, however it is only applicable to poolable datasource. So any other usage will eventually lead to OOM since the JDBC driver silently falls back to full memoization of the request result.The commit disables autocommit more systematically.
cc @lukecwik @kennknowles @aaltay
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.