Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance when using pandas.read_sql #222

Closed
rpkilby opened this issue Apr 2, 2021 · 3 comments · Fixed by #223
Closed

Poor performance when using pandas.read_sql #222

rpkilby opened this issue Apr 2, 2021 · 3 comments · Fixed by #223

Comments

@rpkilby
Copy link

rpkilby commented Apr 2, 2021

Hi @laughingman7743, I've noticed that pyathena+pandas.read_sql can exhibit poor performance when the query statement exceeds 255 characters in length, and that this can be fixed by instead calling pandas.read_sql_query directly.

The issue is that while read_sql is just a wrapper for read_sql_query and read_sql_table, the way it determines which method to call is through a has_table check on the query statement, which ultimately calls get_columns. It's this call that's problematic, and in our case adding a consistent ~1 minute delay to the query. There are a few compounding factors:

  • Our dev Athena db is running in a geographically distant region, so the base roundtrip time for each query is ~5 seconds.
  • This table inspection query takes ~5 seconds to fail (so total query time is 10-11 seconds).
  • Our query is over 255 characters (selecting lots of columns), and this causes the inspection query to fail instead of just returning an empty result set. Specifically, there is a table constraint enforcing the 255 character max that is failing here.
  • Because the query is failing out, this triggers pyathena's builtin retry policy.
  • The retry policy defaults to 5x, and at 10-11s per failure, this ultimately results in about a minute long delay before pandas gets around to executing the original query.

I'm not really sure what the correct fix is here. read_sql probably shouldn't be doing the table check, however the poor performance is the result of our database setup combined with pyathena's retry policy.

My solution is to just call read_sql_query directly, but as a relatively novice pandas user, I wasn't aware of what was happening since it's read_sql that's usually recommended or used in tutorials/sample code. Thanks.

laughingman7743 added a commit that referenced this issue Apr 4, 2021
@laughingman7743 laughingman7743 linked a pull request Apr 4, 2021 that will close this issue
laughingman7743 added a commit that referenced this issue Apr 4, 2021
laughingman7743 added a commit that referenced this issue Apr 4, 2021
@laughingman7743
Copy link
Owner

@rpkilby Thanks!

@user-2608
Copy link

I am encountering problems while using pandas.read_sql or pandas.read_sql_query, and it is giving an error message as text.

error:
AttributeError: 'str' object has no attribute '_execute_on_connection'
The above exception was the direct cause of the following exception:

@laughingman7743
Copy link
Owner

@chauhan-26 Please register a new issue if possible. A more detailed stack trace may help to resolve the cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants