-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Turbodbc/Arrow for read_sql_table #17790
Comments
This would be great! This could potentially be a very small wrapper around a Turbodbc -> Arrow -> Pandas, if each of those paths already exist? (and Turbodbc doesn't have the same level of support as SQLAlchemy, so we'd need to keep that an option) |
I actually just played with this last weekend, trying to get some idea of potential speed-up (as I am very enthusiastic about the potential!).
So it would certainly be not that hard to add something (more expanded) like that as an optional engine. The nice thing is that arrow can take care of interpreting the data types (database types <-> pandas types), which is what we otherwise use sqlalchemy for. The only downsides I noticed during my playing around (which are not necessarily reasons not to add this, as long as it is optional):
Side note: our current sql i/o functions are indeed using sqlalchemy, and can indeed by quite slow, but as far as I know this is not due to using sqlalchemy (because we are only using sqlalchemy core, not orm, which gives little overhead), but mainly due to slow actual database drivers (or config settings). |
In case it's interesting, I would like to build a native sqlite3->arrow reader, which should yield a pretty major speedup for the sqlite use case. Same for libpq. ODBC turns out to be a pretty heavy middleware -- vendors who have ODBC as the only way to push data in and out have optimized things a lot, whereas postgres/sqlite have not (downside of OSS at times, I guess) |
I never would have dared to open an issue like this ;-). I agree that the feature needs to be optional, seeing that ODBC is a mouthful to set up initially and ODBC driver performance varies heavily from implementation to implementation. Enforcing turbodbc would be unacceptable. |
@phofl is this one of the places where a pyarrow-related keyword is going to solve things? |
I guess, but Alchemy is slow in general when reading data. So it should also be an improvement when converting to NumPy dtypes afterwards. Maybe more like engine in read_csv and read_json? |
If this issue is strictly about Arrow (de-)serialization then it should be covered by #53869 Of course that PR is strictly about ADBC. If we wanted Turbodbc still I think the same development patterns would apply |
Currently I believe that the read_sql functions pull data from databases with sqlalchemy, which can be somewhat slow. The Turbodbc library seems to provide an alternate solution that is still ODBC compatible that might operate at higher speed. We might consider adding an
engine=
keyword to theread_sql
functions to allow them to use alternate libraries like this. In this case I would hope that we could pull from the database into Arrow memory and from there to Pandas more efficiently.The documentation for turbodbc already shows how to do this using their API. There might be some value to integrating this into the Pandas API directly. From my perspective as a Dask developer I would like to use Turbodbc but would prefer that Pandas did the actual wrapping.
I spoke with @jreback about this in person. @MathMagique @xhochy @wesm may also be interested. My apologies if this has already been discussed elsewhere (I was surprised that I couldn't find anything).
http://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#apache-arrow-support
https://arrow.apache.org/blog/2017/06/16/turbodbc-arrow/
The text was updated successfully, but these errors were encountered: