-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always use the executemany method when inserting rows in DbApiHook as it's way much faster #38715
Always use the executemany method when inserting rows in DbApiHook as it's way much faster #38715
Conversation
…as it's much faster than inserting each row separately and committing every once in a while in between
…onstructor of Hook as this is not a standard supported option by all ODBC drivers
…does the same by default, no need for a specialized method and thus delegate to insert_rows method
…rt_rows_with_commit_every
Also deprecated bulk_insert_rows method in Teradata as it does almost the same as what I did in insert_rows, so for now I log a deprecation warning message and just delegate the call to the insert_rows method. |
…d and changed some rows values to string for the TestTeradataHook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love other reviews, but it looks good.
NIT: why changing types in tests to string? Will that work for all common.sql users? also NIT2 - we should let users know y depreaction warning if they are still using executemany as parameter.
Why not just always use |
To ellaborate a bit, that's indeed what this pull request is doing, always use executemany. But the previous PR you had an option to use the original implementation or the faster executemany one, so you had to pass the parameter executemany to use the faster implementation, now this is obsolete as we will always use executemany. Also the TeradataHook also had an additional bulk_insert_rows method which used the faster executemany implementation, this one now delegates to the insert_rows method as they share the same principle, so there we have a clean and less code and thus better maintainable. |
Co-authored-by: Tzu-ping Chung <[email protected]>
…tocommit_connection in DbApiHook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already applied those changed
Wondering why I have tests failing with this error message:
|
I re-run it. Seems like intermittent error because of broken docker on GitHub Runner. Happens.
|
Also wondering why these tests are failing as those are unrelated with my changes (I think):
|
I'd say it's a side-effect of another test that probably interact with caplog in a bad way - and looking at the log entry, it's likely introduced but this one #38910 - just a watch-out @dstandish -> seems the tenacity retry added there can have some side - effects while it logs warning on retries. That's pretty strange and I am not sure how it can leak to other tests, but likely it's because an asyncio nature of the tests and connected with xdist execution of these. I am afraid this one will be somewhat intermittent (it did not happen in the last run it seems) |
In my previous pull request I added the executemany parameter to the insert_rows method to allow you to choose which strategy to apply when inserting rows as the executemany method is way much faster than the original implementation. You can see this in the picture above when we did the performance comparison with thousands of records inserted in bulk, the penultimate one in red took already 13 minutes (so we killed it) and wasn't even finished while the last one with the executemany strategy was completed in merely a few minutes. So I decided to create this new pull request and always apply the faster executemany strategy, as the operator using the hook didn't have any way to change that property anyway and there wasn't anything foreseen to configure that parameter in the connection too. Also why keep both strategies if one is better than the other, then it's better to ditch the slower one which makes the code less complex and easier to read.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.