API: user api for new sql functionality #6300

jorisvandenbossche · 2014-02-07T22:44:41Z

Starting a new issue to discuss this. See merged PR #5950 and follow-up issue #6292.

Summary of situation:

Current (0.13):

top-level read_sql, read_frame, write_frame and DataFrame.to_sql
in sql module: uquery, tquery, has_table (+ read_sql)

New after #5950

top-level read_sql and DataFrame.to_sql (but not yet used in the docs)
in sql module: read_sql, read_table, to_sql, has_table, execute
read_frame, write_frame, uquery, tquery are deprecated

Points to dicsuss

docs: use top-level functions read_sql and DataFrame.to_sql instead of functions from sql
sql.read_table vs existing pd.read_table. Is this a problem?
merge functionality of read_table and read_sql or not (see this comment of @mangecoeur to not do that: ENH: sql support via SQLAlchemy, with legacy fallback #5950 (comment))

The text was updated successfully, but these errors were encountered:

jreback · 2014-02-07T22:55:19Z

I don't recall if we test deprecations in general (e.g. call the function in an assert_produces_warning wrapper)

jorisvandenbossche · 2014-02-11T10:08:37Z

This is I think easy to agree on, to be consistent with the other io modules:

DataFrame.to_sql(...) instead of sql.to_sql(df, ...)
pd.read_sql instead of sql.read_sql

But

what to do with read_table? (which is, at this moment how it is laid out in the docs, sort of the primary function, http://pandas-docs.github.io/pandas-docs-travis/io.html#io-sql)
- rename and also put toplevel, eg pd.read_sqltable
- just keep as is in sql (but this feels odd because, as I said above, this is somewhat the main function of the sql module).
- rename read_table to toplevel pd.read_sql (the main function), and the current read_sql to something like pd.read_sqlquery or sql.read_query or sql.query(), sql.get_query. But I assume this is not possible due to backwards compatibility.
- merge read_sql and read_table to one function (but see comments here ENH: sql support via SQLAlchemy, with legacy fallback #5950 (comment))

@hayd @mangecoeur @jreback @danielballan @y-p Opinions?

jreback · 2014-02-11T10:43:37Z

it seems that you can use read_table if u can reflect the table definition
otherwise u have to use read_sql and then figure out types

well it seems that pandas should then try to use read_table if possible (eg try to reflect the table) then fallback to read_sql?

why is this this a problem or not possible?

it is really confusing to have 2 functions which basically do the same thing depending on an implementation detail

mangecoeur · 2014-02-11T15:08:25Z

@jreback I addressed some of the reasons here, and some more reasons below
#5950 (comment)

I don't see this as confusing at all, and the functionality is distinct - you have low level SQL querying and higher level "table selection" apis.
You can always reflect a table if you have it's name, so there's no need for read_sql fallback in that sense.

On the other hand, if you issue a read_sql, you will be using an SQL query, that may not return values stored in a db TABLE - e.g. you might do something like
"SELECT COUNT(tbl1.foo), SUM(tbl2.bar) FROM tbl1, tbl2 WHERE tbl1.otherID == tbl2.ID GROUP BY tbl2.groupable_name"
And none of the returned values will be data stored in the DB, instead they are calculated on the fly. You can't reflect type information before hand, so you can't rely on the same type conversion.

If you used read_table in the case you only wanted one table (and we'll see below why that's not straightforward) you would suddenly find you had different type casting rules, because instead of the rules that apply to read_sql (limited type coertion) you get the ones that apply to read_table. You should never create that sort of unexpected behaviour.

In any case, how would you know if you were only selecting one table? You would have to parse the SQL string to find out that "SELECT * FROM table_name" (and it's functionally equivalent alternatives, of which there are a few) meant you want everything from table_name. Which means you need to introduce an SQL parser and to deal with the differences between SQL dialects, which completely defeats the point of using SQLAlchemy.

Next, there are keyword arguments which have meaning for read_table but not for read_sql, and vice-versa. E.g. "flavor" is meaningless for read_table, "columns" is likewise for read_sql. If you tried to merge the functions these options would have a different effect depending on the value of the input arg - that to me is really bad API design, it's as if the accelerator pedal of your car turned into the brake pedal depending on whether you are wearing a hat. It's just nuts!

Having 2 functions which represent different levels of abstraciton from raw SQL completely eliminates these problems. It leaves lots of room to add interesting funcitonality to read_table and to take advantage of the higher level of abstraction offered by SQLAlchemy's expression language.

All we need to do is figure out nice API names. If a top level function is required (though IMHO it's not that important) I think read_sql_table or read_db_table as aliases of sql.read_table would be fine. If it weren't for backward compatibility isses I would prefer even read_table to be called read_sql and the old read_sql to be query_sql or similar.

jreback · 2014-02-11T20:16:16Z

@mangecoeur ok...I hear ya...so to summarize, read_sql returns a table (that may or may not be defined but pandas will try extra hard to infer dtypes, while the currently namedread_table`` will essentially perform arbitrary sql and not do anything special with the result. ok

The big question here is whether to expose the currently named read_table at the top-level (I think its ok)....but with what name.

Since we are breaking.....why don't we use the names you suggest

I actually think the top-level names ARE important as most users will simply use those by default. In general you don't get too many diving into the modules (nor should they necessarily). If these are both useful, then they can both be top-level.

read_sql_table and read_sql_query; can simply map read_sql to read_sql_table (with a deprc) so things will 'work'.

hayd · 2014-02-11T21:06:55Z

I had the same thought: sep to read_sql_query and read_sql_table (I think this will be used more).

@jreback Isn't it the other way around? read_sql_table takes a table_name and infers dtypes based on the dtypes of the sql table. read_sql_query takes arbitrary sql and has a go at infering ?

I still think there is a case for read_sql to do read_sql_table(sql) if has_table(sql) else read_sql_query(sql)... I don't think we should shy away from lots of (optional) args, we certainly don't elsewhere!

Definitely the functionality of sql_read_table should be toplevel.

jreback · 2014-02-11T21:10:27Z

@hayd oh...I see I did put them 'backwards'. Ok if we go with read_sql_query and read_sql_table then map read_sql to read_sql_table (maybe with a deprecate on read_sql directly)....everyone ok with that?

seems minimum pain

jorisvandenbossche · 2014-02-11T21:32:06Z

@mangecoeur About the top level function: that's the design choice made earlier in pandas, and I think it is important to be consistent on this: the main read_.. function in top level namespace.

Indeed a good summary: there are two cases for reading sql data:

full table (with currently read_table) -> infer types from table definition via sqlalchemy types
custom sql query (with currently read_sql) -> try to infer types just from result set

And due to the potential differences between the results of both methods, it is indeed important to clearly identify those two cases in the docs (whether as two cases in one function, or two cases in two functions).

For me it is ok to go with read_sql_query and read_sql_table.
But @jreback, what do you mean with mapping read_sql to read_sql_table? Because isn't that just the problem with the backwards compatibility that read_sql already exists and has a different behaviour? But if this deprecation is OK, then we actually can just use read_sql instead of read_sql_table as the name for the function that reads whole tables (unless we find read_sql_table better as it is more explicit).

jreback · 2014-02-11T23:55:39Z

I think that I am refering to these top-level functions reversed. so ignore me!

jorisvandenbossche · 2014-02-14T14:51:47Z

Something else, @mangecoeur, is it also the intention that users could use one of the objects? (in the meaning: has this an added value, so that it would be usefull to mention it in the docs)

Because now, in the docs there is a mention of PandasSQLWIthEngine (just above this header: http://pandas-docs.github.io/pandas-docs-travis/io.html#querying), but I think this is certainly outdated and maybe not your intention of being in the docs? (possibly a leftover from the rebasing complexity, there were some merging conflicts in the docs that I manually, and maybe incorrectly, resolved).

mangecoeur · 2014-02-14T15:15:36Z

@jorisvandenbossche well spotted, that should be updated to read PandasSQLAlchemy, and maybe also to include PandasSQLTable. But the OO API probably still needs work anyway. It's very useful if you need more fine control for certain use cases (I use it for debugging to check the create table statements before doing an insert for example)

jreback · 2014-02-14T15:25:12Z

@mangecoeur you might want to put a small explanation at the top of the sql.py file for these 'internal' classes. clearly will not be externally exposed, but just like anything else, I am sure someone will fine a use for them!

mangecoeur · 2014-02-14T15:27:45Z

@jreback Not sure what you mean by "externally exposed". They should remain accessible through pandas.io.sql for anyone who needs them (that includes me). The whole module probably needs more docstrings in general, the classes in particular, but they might change a little still so I guess I would wait until most of the bugs are ironed out.

jreback · 2014-02-14T15:56:15Z

@mangecoeur I just mean they aren't in the public API (which they are not now)

hayd · 2014-02-14T20:15:41Z

@jreback The original intention was for one to be in public api. Just like you create a HDF5Store and query it, you'd create a PandasSQL (perhaps should rename to SQLStore or something?) and then do read_sql and write_sql against this.

IMO you want to have a one constructor (to be in global API) which you can pass either a conn or engine to, since these can be reused.

jorisvandenbossche · 2014-02-14T20:26:00Z

See also the discussion in a previous issue on this topic, around #4163 (comment)

jreback · 2014-02-14T20:30:42Z

@hayd ok...that makes sense then...sure....I like SQLStore 👍

mangecoeur · 2014-02-20T13:35:04Z

@jreback would that be SQLStore to refer to what is now the PandasSqlEngine (for access to tables and queries of tables in a DB) or to the PandasSqlTable (that handles mapping between a single DataFrame and db table)?

jreback · 2014-02-20T13:40:40Z

@mangecoeur I think that would be PandasSqlEngine as it holds access to possible multiple tables.

caveat, I haven't really look at this though in detail. Idea being this would be the first thing a user would use if they say want to open a db then interactively work on it (as opposed to using the read_sql functionaility (which is analagous to read_hdf.)

jorisvandenbossche · 2014-04-09T21:41:42Z

OK, we have to come to a decision here. Options:

option 1:
- keep read_sql as is and rename read_table to read_sql_table
- drawback: 'main' function (the read_table functionality) is not read_sql
option 2:
- rename read_sql to read_sql_query and read_table to read_sql_table. read_sql will eventually be deprecated.
- drawback: we don't use the shortest/nicest read_sql name
option 3:
- rename read_table to read_sql, and the original read_sql to read_sql_query (or sql.read_query), and in the beginning read_sql will still accept a sql query statement for backwards compatibility.
- drawback: this is actually merging the functionality of both (for the backwards compatibility), and see the objections of @mangecoeur to this above.
option 4:
- or we have a sql.read_table and sql.read_query, and then only a top level read_sql which delegates to sql.read_table/sql.read_query depending on the input.
- drawback: different names as toplevel and in sql module is confusing (as users will see both in code depending on how they are imported). + drawback of option 3

Other options? Or combinations?
What do you prefer?

@jreback @mangecoeur @hayd @jtratner @danielballan @cpcloud

hayd · 2014-04-09T21:53:15Z

There's tweak to option 4 I rate: Have top level read_sql which does both, that is still called read_sql in io.sql. We split the two functions (neither of which are top level) so that read_sql is defined as:

def read_sql(sql):  # arg more honestly called sql_or_table_name
    read_sql_table(sql) if has_table(sql) else read_sql_query(sql)

It could take an optional kwarg if you wanted to be more explicit (like we do with regex kwarg in select):

read_sql(table='foo')
read_sql(query='SELECT * FROM foo')

jreback · 2014-04-09T23:04:17Z

I think a modified option 2 might be the best bet:

pd.read_sql_table and pd.read_sql_query, simple, clear what they are doing, and you don't need to have any sql.* based functions.

then

pd.read_sql is simply an alias (or maybe calls) pd.read_sql_table. This is the by far the most common operation (right?) that people will want to do, e.g. I have a table, read it in.

maybe we come back in a later version of allowing something along what @hayd is suggesting (where you can specify a query directly in read_sql.

I dont' think we need to deprecate read_sql at all in this case.

so maybe call this option 2a.

danielballan · 2014-04-10T11:52:27Z

I like 2a. I like what @hayd is suggesting for read_sql. I agree,@jreback,
that reading table in its entirety is the more common operation, but making
read_sql an alias for read_sql_table would break lots of old code.

If we implement 2a and do what @hayd suggests, all at once, nothing breaks.

On Wednesday, April 9, 2014, jreback [email protected] wrote:

I think a modified option 2 might be the best bet:

pd.read_sql_table and pd.read_sql_query, simple, clear what they are
doing, and you don't need to have any sql.* based functions.

then

pd.read_sql is simply is an alias (or maybe calls) pd.read_sql_table.
This is the by far the most common operation (right?) that people will want
to do, e.g. I have a table, read it in.

maybe we come back in a later version of allowing something along what
@hayd https://github.com/hayd is suggesting (where you can specify a
query directly in read_sql.

I dont' think we need to deprecate read_sql at all in this case.

so maybe call this option 2a.

Reply to this email directly or view it on GitHubhttps://github.com//issues/6300#issuecomment-40027102
.

jorisvandenbossche · 2014-04-10T13:53:01Z

@jreback problem with your idea is that read_sql is at this moment an alias for the future read_sql_query, so when we also want to be able to use read_sql for read_sql_table, then you have what @hayd suggested.

So, we have a pd.read_sql_table and a pd.read_sql_query, but

for backwards compatibility, pd.read_sql will in every case have to pass a query to read_sql_query, so question is do we also pass a table name to read_sql_table (as @hayd suggests for read_sql, and is what @danielballan proposes)
if we do that last, then we have three top level functions (pd.read_sql/read_sql_table/read_sql_query). Not too much (in light of profileration of read_.. functions)?
What do we recommend/use in the docs? Consequently using read_sql_table/read_sql_query and only mention that you can also use read_sql for both functions as a convenience alias? Or use read_sql in the docs?

jreback · 2014-04-10T13:57:28Z

is it always unambiguous what read_sql should do ?

meaning that it can easily figure out whether to send to read_sql_query or read_table_query?

jorisvandenbossche · 2014-04-29T21:56:38Z

ok

execute/get_schema: we keep
tquery/uquery: agreed, we deprecate (and for tquery, we can also just point to execute(..).fetchall() I think)
table_exists/has_table: a reason to choose to change it to has_table could be that sqlalchemy uses has_table(as a method of the engine, so it is actually a bit superfluous in pandas: engine.has_table(..) vs pd.io.sql.has_table(...., engine))

jorisvandenbossche · 2014-05-13T19:34:40Z

The above has been implemented in the meantime in different PRs (at the moment I left both has_table and table_exists as aliases).
I have only two remaining questions:

There is a meta keyword in read_sql_table (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L237) Is this necessary (I don't use it myself)? (@mangecoeur @danielballan ?) I would be inclined to remove it to keep the function more simple (1. can always later add it if there is demand, and 2. maybe this is a reason to use the OO class based interface if you want to provide the meta object)
the parse_dates functionality does differ somewhat from how it is eg in read_csv (see https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L302 and https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_sql.py#L839). You can eg provide a format string like parse_dates={'DateCol': '%Y-%m-%d %H:%M:%S'}. I think this is useful functionality, but it would be good to keep this interface consistent between the different readers.

Now there should be a last round about the OO interface. But I am not really familiar with that (not using it, so don't have a strong opinion on what it should look like) and I don't have time for that this week, so I propose we leave this for next release? Unless someone else wants to tackle this.

jreback · 2014-05-13T19:37:19Z

The parse_dates only matters if dates are kept as strings yes? which going forward they will not, so I am not sure how useful embedding parsing is (as you can always parse string in to_datetime).
Actually parse_dates should only be necessary if you are not reading meta-data?

jorisvandenbossche · 2014-05-13T19:43:44Z

Yes, but the datetime support is not that fantastic in all databases, so that could be a reason to keep your dates as strings in a database (eg in sqlite you have to store it as strings). And also even if there is a datetime type, I don't think everyone in the wild uses this.

Actually, I would quite like to be able to use parse_dates={'DateCol': '%Y-%m-%d %H:%M:%S'} in read_csv :-)

jreback · 2014-05-13T19:47:30Z

yes, that would not be hard for read_csv (I think their is an issue for that). ok then
you prob need to have this as an option when you are writing as well then?

jorisvandenbossche · 2014-05-15T19:59:18Z

@danielballan @mangecoeur any comments on the meta kwarg in read_sql_table? Otherwise I would personally remove it.

danielballan · 2014-05-15T20:18:04Z

I've never used it and I don't completely understand its purpose in sqlalchemy, so I don't have an informed opinion.

jorisvandenbossche · 2014-05-15T20:22:29Z

well, that's my problem too .. :-)
As I understand it, meta reflects all tables, and a description (columns, types, etc) of those tables in your database. With the standard read_sql function, this is recreated every time (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L781), which could possibly be costly if you have a large database/have to do this a lot of times.

But, I personally think that this would be a reason to use the OO API using the PandasSQLAlchemy object, and we shoud keep the basic function as simple as possible, and the meta keyword there is not needed.

danielballan · 2014-05-16T02:08:10Z

That seems sensible to me. Without the input of an experienced user of "meta," future generations can't fault us for removing it. :- ) I say go for it.

jorisvandenbossche · 2014-05-16T08:51:25Z

@mangecoeur I suppose you put it in, if you have any objections removing it, please speak up (and we can always easily add it back in the future if it is wanted)

jorisvandenbossche · 2014-05-16T08:53:54Z

Bumped this to 0.14.1 for the remaining OO API part

jorisvandenbossche · 2014-05-16T09:30:39Z

Removed it in jorisvandenbossche@c5c78d5 in #7120

mangecoeur · 2014-05-16T11:44:07Z

@jorisvandenbossche @danielballan wait a sec! Sent my feedback via email but it doesn't seem to have shown up here. The reason for being able to specify meta is that way you can specify the "schema" argument for use with DBs that support schemas, such as Postgres : http://docs.sqlalchemy.org/en/rel_0_8/core/metadata.html#sqlalchemy.schema.MetaData

We should make sure this is still possible, though since it's an advanced use case it doesn't need to be in the functional API, but it should remain in the OO api. I can also imagine scenarios where you might want to manually mess with the Metadata object before passing it.

jorisvandenbossche · 2014-05-16T12:05:29Z

@mangecoeur Thanks for the feedback! I only removed it in the functional api, so it is still in the OO api.
I think that, if you have to manualy change the Metadata object, then you should use the OO api, to keep the functional api as simple as possible.

Now, we still have to clean-up the OO api a bit.

- remove meta kwarg from read_sql_table (see discussion in pandas-dev#6300) - remove flavor kwarg from read_sql (not necessary + not there in 0.13, so would have been API change) - update docstring of to_sql in generic with latest changes - enhance docstring of get_schema

jorisvandenbossche · 2014-08-08T09:20:01Z

The last! round is the OO API, but I created a new issue for that (this was becoming a bit too lenghty), so closing this. Further discussion in #7960.

jorisvandenbossche added the SQL label Feb 7, 2014

jorisvandenbossche mentioned this issue Feb 7, 2014

ENH: sql support with SQLAlchemy - follow-up #6292

Closed

17 tasks

jreback added this to the 0.14.0 milestone Feb 12, 2014

jreback added the API Design label Feb 12, 2014

jorisvandenbossche mentioned this issue Feb 20, 2014

io.rst needs updating to reflect api sql API methods introduced in 0.13 #5936

Closed

jorisvandenbossche modified the milestones: 0.14.1, 0.14.0 May 16, 2014

jorisvandenbossche mentioned this issue Jun 12, 2014

[ENH] Allow passing MetaData (or meta kwargs) to high-level SQL functions #7441

Open

jorisvandenbossche modified the milestones: 0.15.0, 0.14.1 Jul 1, 2014

This was referenced Aug 8, 2014

ENH: add schema support to sql functions #7952

Merged

API: object oriented interface for SQL functionality #7960

Open

jorisvandenbossche closed this as completed Aug 8, 2014

API: user api for new sql functionality #6300

API: user api for new sql functionality #6300

Comments

jorisvandenbossche commented Feb 7, 2014

jreback commented Feb 7, 2014

jorisvandenbossche commented Feb 11, 2014

jreback commented Feb 11, 2014

mangecoeur commented Feb 11, 2014

jreback commented Feb 11, 2014

hayd commented Feb 11, 2014

jreback commented Feb 11, 2014

jorisvandenbossche commented Feb 11, 2014

jreback commented Feb 11, 2014

jorisvandenbossche commented Feb 14, 2014

mangecoeur commented Feb 14, 2014

jreback commented Feb 14, 2014

mangecoeur commented Feb 14, 2014

jreback commented Feb 14, 2014

hayd commented Feb 14, 2014

jorisvandenbossche commented Feb 14, 2014

jreback commented Feb 14, 2014

mangecoeur commented Feb 20, 2014

jreback commented Feb 20, 2014

jorisvandenbossche commented Apr 9, 2014

hayd commented Apr 9, 2014

jreback commented Apr 9, 2014

danielballan commented Apr 10, 2014

jorisvandenbossche commented Apr 10, 2014

jreback commented Apr 10, 2014

jorisvandenbossche commented Apr 29, 2014

jorisvandenbossche commented May 13, 2014

jreback commented May 13, 2014

jorisvandenbossche commented May 13, 2014

jreback commented May 13, 2014

jorisvandenbossche commented May 15, 2014

danielballan commented May 15, 2014

jorisvandenbossche commented May 15, 2014

danielballan commented May 16, 2014

jorisvandenbossche commented May 16, 2014

jorisvandenbossche commented May 16, 2014

jorisvandenbossche commented May 16, 2014

mangecoeur commented May 16, 2014

jorisvandenbossche commented May 16, 2014

jorisvandenbossche commented Aug 8, 2014