Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Turbodbc/Arrow for read_sql_table #17790

Open
mrocklin opened this issue Oct 4, 2017 · 7 comments
Open

Use Turbodbc/Arrow for read_sql_table #17790

mrocklin opened this issue Oct 4, 2017 · 7 comments
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query

Comments

@mrocklin
Copy link
Contributor

mrocklin commented Oct 4, 2017

Currently I believe that the read_sql functions pull data from databases with sqlalchemy, which can be somewhat slow. The Turbodbc library seems to provide an alternate solution that is still ODBC compatible that might operate at higher speed. We might consider adding an engine= keyword to the read_sql functions to allow them to use alternate libraries like this. In this case I would hope that we could pull from the database into Arrow memory and from there to Pandas more efficiently.

The documentation for turbodbc already shows how to do this using their API. There might be some value to integrating this into the Pandas API directly. From my perspective as a Dask developer I would like to use Turbodbc but would prefer that Pandas did the actual wrapping.

I spoke with @jreback about this in person. @MathMagique @xhochy @wesm may also be interested. My apologies if this has already been discussed elsewhere (I was surprised that I couldn't find anything).

http://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#apache-arrow-support
https://arrow.apache.org/blog/2017/06/16/turbodbc-arrow/

@TomAugspurger TomAugspurger added the IO SQL to_sql, read_sql, read_sql_query label Oct 4, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Oct 4, 2017
@max-sixty
Copy link
Contributor

This would be great! This could potentially be a very small wrapper around a Turbodbc -> Arrow -> Pandas, if each of those paths already exist?

(and Turbodbc doesn't have the same level of support as SQLAlchemy, so we'd need to keep that an option)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 4, 2017

I actually just played with this last weekend, trying to get some idea of potential speed-up (as I am very enthusiastic about the potential!).
A very basic implementation is just:

def read_sql_turbo(query, conn):
    cursor = conn.cursor()
    cursor.execute(query)
    res = cursor.fetchallarrow().to_pandas()
    cursor.close()
    return res

So it would certainly be not that hard to add something (more expanded) like that as an optional engine. The nice thing is that arrow can take care of interpreting the data types (database types <-> pandas types), which is what we otherwise use sqlalchemy for.

The only downsides I noticed during my playing around (which are not necessarily reasons not to add this, as long as it is optional):

Side note: our current sql i/o functions are indeed using sqlalchemy, and can indeed by quite slow, but as far as I know this is not due to using sqlalchemy (because we are only using sqlalchemy core, not orm, which gives little overhead), but mainly due to slow actual database drivers (or config settings).

@wesm
Copy link
Member

wesm commented Oct 5, 2017

In case it's interesting, I would like to build a native sqlite3->arrow reader, which should yield a pretty major speedup for the sqlite use case. Same for libpq. ODBC turns out to be a pretty heavy middleware -- vendors who have ODBC as the only way to push data in and out have optimized things a lot, whereas postgres/sqlite have not (downside of OSS at times, I guess)

@MathMagique
Copy link

I never would have dared to open an issue like this ;-). I agree that the feature needs to be optional, seeing that ODBC is a mouthful to set up initially and ODBC driver performance varies heavily from implementation to implementation. Enforcing turbodbc would be unacceptable.

@jbrockmendel
Copy link
Member

@phofl is this one of the places where a pyarrow-related keyword is going to solve things?

@phofl
Copy link
Member

phofl commented Feb 12, 2023

I guess, but Alchemy is slow in general when reading data. So it should also be an improvement when converting to NumPy dtypes afterwards. Maybe more like engine in read_csv and read_json?

@WillAyd
Copy link
Member

WillAyd commented Nov 6, 2023

If this issue is strictly about Arrow (de-)serialization then it should be covered by #53869

Of course that PR is strictly about ADBC. If we wanted Turbodbc still I think the same development patterns would apply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests