Poor performance when using pandas.read_sql #222

rpkilby · 2021-04-02T21:54:08Z

Hi @laughingman7743, I've noticed that pyathena+pandas.read_sql can exhibit poor performance when the query statement exceeds 255 characters in length, and that this can be fixed by instead calling pandas.read_sql_query directly.

The issue is that while read_sql is just a wrapper for read_sql_query and read_sql_table, the way it determines which method to call is through a has_table check on the query statement, which ultimately calls get_columns. It's this call that's problematic, and in our case adding a consistent ~1 minute delay to the query. There are a few compounding factors:

Our dev Athena db is running in a geographically distant region, so the base roundtrip time for each query is ~5 seconds.
This table inspection query takes ~5 seconds to fail (so total query time is 10-11 seconds).
Our query is over 255 characters (selecting lots of columns), and this causes the inspection query to fail instead of just returning an empty result set. Specifically, there is a table constraint enforcing the 255 character max that is failing here.
Because the query is failing out, this triggers pyathena's builtin retry policy.
The retry policy defaults to 5x, and at 10-11s per failure, this ultimately results in about a minute long delay before pandas gets around to executing the original query.

I'm not really sure what the correct fix is here. read_sql probably shouldn't be doing the table check, however the poor performance is the result of our database setup combined with pyathena's retry policy.

My solution is to just call read_sql_query directly, but as a relatively novice pandas user, I wasn't aware of what was happening since it's read_sql that's usually recommended or used in tutorials/sample code. Thanks.

The text was updated successfully, but these errors were encountered:

Update README (fix #222)

laughingman7743 · 2021-04-04T13:33:11Z

@rpkilby Thanks!

user-2608 · 2023-02-01T11:50:04Z

I am encountering problems while using pandas.read_sql or pandas.read_sql_query, and it is giving an error message as text.

error:
AttributeError: 'str' object has no attribute '_execute_on_connection'
The above exception was the direct cause of the following exception:

laughingman7743 · 2023-02-01T12:22:31Z

@chauhan-26 Please register a new issue if possible. A more detailed stack trace may help to resolve the cause.

laughingman7743 added a commit that referenced this issue Apr 4, 2021

Update README (fix #222)

a62b92d

laughingman7743 linked a pull request Apr 4, 2021 that will close this issue

Update README (fix #222) #223

Merged

laughingman7743 added a commit that referenced this issue Apr 4, 2021

Update README (fix #222)

68789e5

laughingman7743 closed this as completed in #223 Apr 4, 2021

laughingman7743 added a commit that referenced this issue Apr 4, 2021

Merge pull request #223 from laughingman7743/#222

bcdb753

Update README (fix #222)

laughingman7743 mentioned this issue May 8, 2022

error with pd.read_sql and sqlalchemy connection #278

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance when using pandas.read_sql #222

Poor performance when using pandas.read_sql #222

rpkilby commented Apr 2, 2021

laughingman7743 commented Apr 4, 2021

user-2608 commented Feb 1, 2023

laughingman7743 commented Feb 1, 2023

Poor performance when using pandas.read_sql #222

Poor performance when using pandas.read_sql #222

Comments

rpkilby commented Apr 2, 2021

laughingman7743 commented Apr 4, 2021

user-2608 commented Feb 1, 2023

laughingman7743 commented Feb 1, 2023