-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLAlchemy in io.sql to manage different SQL dialects #2717
Comments
Quite interesting. Maybe get a look at issue #1662 which deals SQL connection improvements. I would like to contribute to these pandas features. I'll try to write something about it this week. |
I started work on this idea, very much Work in Progress, branch is here: https://github.com/mangecoeur/pandas/tree/sqlalchemy-integration |
I also ran across #191 -- apparently this idea has been broached before. Any progress? |
I've commited a couple more changes, notably started work on a autoload_frame which will uses sqlalchemy to figure out the contents of a table and turn it into a dataframe. I still need to figure out how to handle type conversions as well as some tests. |
We've built parts of an ETL tool on top of SQLAlchemy. When an extract is pointed at a database flavor that doesn't support bulk copy (read: Oracle) we simply use for row in table.select(). We've decided to move away from this, because of the overhead SQLAlchemy introduces. Plan on spending 2-4x the CPU cycles on top of your database driver to load the same number of rows. I landed on this thread as part of my hopes that pandas could do better. :) In any case, unless this feature is always intended to load fairly small tables into dataframes, I'd recommend against going the route of SQLAlchemy as part of a library that is, in most other aspects, quite fast. SQLAlchemy's power is really its OO query builder and ORM framework. Too much cruft for something like this. |
@derrley I've experienced this with SQLAlchemy too. |
@derrley I have difficulty seeing how you would provide compatibility for all the DBs that SQLAlchemy supports without introducing the same amount of overhead, and adding the burden of maintaining the compatibility layer. Perhaps a better strategy would be to work with the SQLAlchemy guys to see how to optimize the kind of operations that Pandas needs to be fast. |
@derrley the row fetching overhead of SQLAlchemy's core ResultProxy/RowProxy is nothing like 2x-4x the CPU cycles of plain DBAPI, unless you have integrated type-processing functions like in-Python unicode conversion or somnething like that. Within row fetching, most of what's more than negligible is ported to C functions. There may be specific aspects of your experience that were slowing it down, do you have any benchmarks illustrating your results? |
@derrley here is an actual test against MySQL, comparing the SQLAlchemy Core result proxy with C extensions installed to the raw MySQLdb driver. To execute a query with 50K rows, fetch all the rows and fetch a single column from the row takes 44 calls / .032 sec on MySQLdb raw and 82 calls / .057 sec with SQLA core. So sure, SQLA introduces overhead but it is not very much - by the time you implement your own logic on top of the raw MySQLdb cursor, you'd be pretty much at the same place or worse: https://gist.github.com/zzzeek/8346896 |
@derrley also as far as Oracle, the SQLAlchemy cx_oracle dialect goes through (documented) effort in order to fix some issues with the driver, most notably being able to return numerics with full precision, rather than receiving floating points. There is overhead to this process which is detailed here: http://docs.sqlalchemy.org/en/rel_0_9/dialects/oracle.html#precision-numerics . If this process is specific to the performance issues you've been seeing, this feature can be turned off by specifying |
Appreciate the suggestions. Just tried both the coerce trick and the cdecimal trick, and neither prevent talking directly to cx_Oracle from being 3-4x faster, depending on the table. :/ On Jan 10, 2014, at 11:21 AM, mike bayer [email protected] wrote:
|
@derrley if you can provide self-contained test scripts with sample tables/data I can isolate the cause of a 400% slowdown. |
let me run my above script against an Oracle database here first just to make sure nothing funny is going on... |
nope, nothing unusual, script + output is at https://gist.github.com/zzzeek/8479592 SQLAlchemy Core: 100058 function calls in 0.302 CPU seconds so that's around 1.2 times slower. Feel free to show me your code and also make sure you're running the C extensions. |
ah, lets try again, SQLA's output type handler leaked into that, one moment |
OK, so in both cases it's the coercion to unicode adding the majority of overhead. https://gist.github.com/zzzeek/8479592 is now updated to run both tests without any coercion - in the SQLAlchemy case we are using an event to "undo" the cursor.outputtypehandler used to coerce to unicode. I will look today into current cx_oracle releases to see if cx_oracle has decided to coerce to unicode for us yet (this is required of it in Python 3), and if so I will add version detection for this feature; otherwise, I will add a flag to turn it off with a documentation note. with unicode coercion turned off, we again have similar results of: SQLA core: 56 function calls in 0.113 CPU seconds this is again about 1.3 times slower. Feel free to apply this event to your application:
that will disable all numeric/unicode type conversion within the cx_oracle driver. |
I hope it's clear that when using SQLAlchemy, one needn't "plan on spending 2-4x the CPU cycles on top of your database driver to load the same number of rows." I've demonstrated that in the specific case of cx_oracle, we have converters in place to accommodate cx_oracle's default behavior of returning inaccurate decimal data and encoded bytestrings, as SQLAlchemy prefers to return the correct result first versus the fastest - normalizing behavior across DBAPIs is one of SQLAlchemy's primary features and in the case of cx_oracle it requires us to do more work than that of a driver like psycopg2. These converters can however be disabled and I will add further documentation and potential features regarding being able to customize this. |
It's sufficiently tangled up in our ETL tool (and the data I'm extracting is private). I can probably reproduce it with fixture data over a weekend some time. The SQLAlchemy interface is much nicer to use, so I'd love if this didn't produce the slowdown (or if I was discovered to be a moron). ubuntu@test-slave-jenkins-i-25e5480b:~$ python
select * from product_component_version yields The "fast" hack is:
The original SQLA code was:
On Jan 17, 2014, at 1:05 PM, mike bayer [email protected] wrote:
|
for your code above, use Also, if the overhead issue on the result fetching side, you should stick with connection.execute() - then, use result.cursor to get at the raw DBAPI cursor. |
I've made a change to the Oracle dialect in http://www.sqlalchemy.org/trac/ticket/2911 such that we no longer use cx_oracle's "outputtypehandler" to coerce to unicode; SQLAlchemy's own converters have minimal overhead while cx_Oracle's within Py2K seems to have full blown Python function overhead (but oddly not when run under Py3K). So a result set with cx_oracle will in 0.9.2 no longer have any string conversion overhead for plain strings, minimal overhead for Python unicode. I've enhanced the C extensions to better provide for DBAPIs like cx_Oracle that sometimes return unicode and sometimes str. |
Closing this, as SQLAlchemy integration in io.sql is now merged: #5950 |
Currently, read_frame and write_frame in sql are specific to sqlite/mysql dialects (see #4163).
Rather than adding all possible dialects to pandas, another option is to detect whether sqlalchemy is installed and prefer to use its DB support.
The text was updated successfully, but these errors were encountered: