Use Turbodbc/Arrow for read_sql_table #17790

mrocklin · 2017-10-04T21:09:58Z

Currently I believe that the read_sql functions pull data from databases with sqlalchemy, which can be somewhat slow. The Turbodbc library seems to provide an alternate solution that is still ODBC compatible that might operate at higher speed. We might consider adding an engine= keyword to the read_sql functions to allow them to use alternate libraries like this. In this case I would hope that we could pull from the database into Arrow memory and from there to Pandas more efficiently.

The documentation for turbodbc already shows how to do this using their API. There might be some value to integrating this into the Pandas API directly. From my perspective as a Dask developer I would like to use Turbodbc but would prefer that Pandas did the actual wrapping.

I spoke with @jreback about this in person. @MathMagique @xhochy @wesm may also be interested. My apologies if this has already been discussed elsewhere (I was surprised that I couldn't find anything).

http://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#apache-arrow-support
https://arrow.apache.org/blog/2017/06/16/turbodbc-arrow/

The text was updated successfully, but these errors were encountered:

max-sixty · 2017-10-04T23:14:25Z

This would be great! This could potentially be a very small wrapper around a Turbodbc -> Arrow -> Pandas, if each of those paths already exist?

(and Turbodbc doesn't have the same level of support as SQLAlchemy, so we'd need to keep that an option)

jorisvandenbossche · 2017-10-04T23:34:11Z

I actually just played with this last weekend, trying to get some idea of potential speed-up (as I am very enthusiastic about the potential!).
A very basic implementation is just:

def read_sql_turbo(query, conn):
    cursor = conn.cursor()
    cursor.execute(query)
    res = cursor.fetchallarrow().to_pandas()
    cursor.close()
    return res

So it would certainly be not that hard to add something (more expanded) like that as an optional engine. The nice thing is that arrow can take care of interpreting the data types (database types <-> pandas types), which is what we otherwise use sqlalchemy for.

The only downsides I noticed during my playing around (which are not necessarily reasons not to add this, as long as it is optional):

odbc drivers can actually be quite hard to install and configure (even on linux)
for postgres it does not really give a speedup (as also noted in there docs: http://turbodbc.readthedocs.io/en/latest/pages/databases/postgresql.html, and it's what I also saw in my own dummy benchmark)

Side note: our current sql i/o functions are indeed using sqlalchemy, and can indeed by quite slow, but as far as I know this is not due to using sqlalchemy (because we are only using sqlalchemy core, not orm, which gives little overhead), but mainly due to slow actual database drivers (or config settings).

wesm · 2017-10-05T00:01:14Z

In case it's interesting, I would like to build a native sqlite3->arrow reader, which should yield a pretty major speedup for the sqlite use case. Same for libpq. ODBC turns out to be a pretty heavy middleware -- vendors who have ODBC as the only way to push data in and out have optimized things a lot, whereas postgres/sqlite have not (downside of OSS at times, I guess)

MathMagique · 2017-10-05T07:44:38Z

I never would have dared to open an issue like this ;-). I agree that the feature needs to be optional, seeing that ODBC is a mouthful to set up initially and ODBC driver performance varies heavily from implementation to implementation. Enforcing turbodbc would be unacceptable.

jbrockmendel · 2023-02-12T00:01:31Z

@phofl is this one of the places where a pyarrow-related keyword is going to solve things?

phofl · 2023-02-12T12:34:45Z

I guess, but Alchemy is slow in general when reading data. So it should also be an improvement when converting to NumPy dtypes afterwards. Maybe more like engine in read_csv and read_json?

WillAyd · 2023-11-06T19:39:58Z

If this issue is strictly about Arrow (de-)serialization then it should be covered by #53869

Of course that PR is strictly about ADBC. If we wanted Turbodbc still I think the same development patterns would apply

TomAugspurger added the IO SQL to_sql, read_sql, read_sql_query label Oct 4, 2017

TomAugspurger added this to the Next Major Release milestone Oct 4, 2017

jreback added Difficulty Intermediate labels Oct 4, 2017

jreback mentioned this issue Oct 14, 2017

Specify type "category" in read_sql #17862

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke added the Enhancement label May 16, 2020

xhochy mentioned this issue Oct 5, 2020

ENH: Pluggable SQL performance #36893

Open

cdeil mentioned this issue Feb 16, 2021

Mention SQLite in docs blue-yonder/turbodbc#246

Open

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Turbodbc/Arrow for read_sql_table #17790

Use Turbodbc/Arrow for read_sql_table #17790

mrocklin commented Oct 4, 2017

max-sixty commented Oct 4, 2017

jorisvandenbossche commented Oct 4, 2017 •

edited

Loading

wesm commented Oct 5, 2017

MathMagique commented Oct 5, 2017

jbrockmendel commented Feb 12, 2023

phofl commented Feb 12, 2023

WillAyd commented Nov 6, 2023

Use Turbodbc/Arrow for read_sql_table #17790

Use Turbodbc/Arrow for read_sql_table #17790

Comments

mrocklin commented Oct 4, 2017

max-sixty commented Oct 4, 2017

jorisvandenbossche commented Oct 4, 2017 • edited Loading

wesm commented Oct 5, 2017

MathMagique commented Oct 5, 2017

jbrockmendel commented Feb 12, 2023

phofl commented Feb 12, 2023

WillAyd commented Nov 6, 2023

jorisvandenbossche commented Oct 4, 2017 •

edited

Loading