Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: user api for new sql functionality #6300

Closed
jorisvandenbossche opened this issue Feb 7, 2014 · 47 comments
Closed

API: user api for new sql functionality #6300

jorisvandenbossche opened this issue Feb 7, 2014 · 47 comments
Labels
API Design IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Starting a new issue to discuss this. See merged PR #5950 and follow-up issue #6292.

Summary of situation:

Current (0.13):

  • top-level read_sql, read_frame, write_frame and DataFrame.to_sql
  • in sql module: uquery, tquery, has_table (+ read_sql)

New after #5950

  • top-level read_sql and DataFrame.to_sql (but not yet used in the docs)
  • in sql module: read_sql, read_table, to_sql, has_table, execute
  • read_frame, write_frame, uquery, tquery are deprecated

Points to dicsuss

@jreback
Copy link
Contributor

jreback commented Feb 7, 2014

I don't recall if we test deprecations in general (e.g. call the function in an assert_produces_warning wrapper)

@jorisvandenbossche
Copy link
Member Author

This is I think easy to agree on, to be consistent with the other io modules:

  • DataFrame.to_sql(...) instead of sql.to_sql(df, ...)
  • pd.read_sql instead of sql.read_sql

But

  • what to do with read_table? (which is, at this moment how it is laid out in the docs, sort of the primary function, http://pandas-docs.github.io/pandas-docs-travis/io.html#io-sql)
    • rename and also put toplevel, eg pd.read_sqltable
    • just keep as is in sql (but this feels odd because, as I said above, this is somewhat the main function of the sql module).
    • rename read_table to toplevel pd.read_sql (the main function), and the current read_sql to something like pd.read_sqlquery or sql.read_query or sql.query(), sql.get_query. But I assume this is not possible due to backwards compatibility.
    • merge read_sql and read_table to one function (but see comments here ENH: sql support via SQLAlchemy, with legacy fallback #5950 (comment))

@hayd @mangecoeur @jreback @danielballan @y-p Opinions?

@jreback
Copy link
Contributor

jreback commented Feb 11, 2014

it seems that you can use read_table if u can reflect the table definition
otherwise u have to use read_sql and then figure out types

well it seems that pandas should then try to use read_table if possible (eg try to reflect the table) then fallback to read_sql?

why is this this a problem or not possible?

it is really confusing to have 2 functions which basically do the same thing depending on an implementation detail

@mangecoeur
Copy link
Contributor

@jreback I addressed some of the reasons here, and some more reasons below
#5950 (comment)

I don't see this as confusing at all, and the functionality is distinct - you have low level SQL querying and higher level "table selection" apis.
You can always reflect a table if you have it's name, so there's no need for read_sql fallback in that sense.

On the other hand, if you issue a read_sql, you will be using an SQL query, that may not return values stored in a db TABLE - e.g. you might do something like
"SELECT COUNT(tbl1.foo), SUM(tbl2.bar) FROM tbl1, tbl2 WHERE tbl1.otherID == tbl2.ID GROUP BY tbl2.groupable_name"
And none of the returned values will be data stored in the DB, instead they are calculated on the fly. You can't reflect type information before hand, so you can't rely on the same type conversion.

If you used read_table in the case you only wanted one table (and we'll see below why that's not straightforward) you would suddenly find you had different type casting rules, because instead of the rules that apply to read_sql (limited type coertion) you get the ones that apply to read_table. You should never create that sort of unexpected behaviour.

In any case, how would you know if you were only selecting one table? You would have to parse the SQL string to find out that "SELECT * FROM table_name" (and it's functionally equivalent alternatives, of which there are a few) meant you want everything from table_name. Which means you need to introduce an SQL parser and to deal with the differences between SQL dialects, which completely defeats the point of using SQLAlchemy.

Next, there are keyword arguments which have meaning for read_table but not for read_sql, and vice-versa. E.g. "flavor" is meaningless for read_table, "columns" is likewise for read_sql. If you tried to merge the functions these options would have a different effect depending on the value of the input arg - that to me is really bad API design, it's as if the accelerator pedal of your car turned into the brake pedal depending on whether you are wearing a hat. It's just nuts!

Having 2 functions which represent different levels of abstraciton from raw SQL completely eliminates these problems. It leaves lots of room to add interesting funcitonality to read_table and to take advantage of the higher level of abstraction offered by SQLAlchemy's expression language.

All we need to do is figure out nice API names. If a top level function is required (though IMHO it's not that important) I think read_sql_table or read_db_table as aliases of sql.read_table would be fine. If it weren't for backward compatibility isses I would prefer even read_table to be called read_sql and the old read_sql to be query_sql or similar.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2014

@mangecoeur ok...I hear ya...so to summarize, read_sql returns a table (that may or may not be defined but pandas will try extra hard to infer dtypes, while the currently namedread_table`` will essentially perform arbitrary sql and not do anything special with the result. ok

The big question here is whether to expose the currently named read_table at the top-level (I think its ok)....but with what name.

Since we are breaking.....why don't we use the names you suggest

I actually think the top-level names ARE important as most users will simply use those by default. In general you don't get too many diving into the modules (nor should they necessarily). If these are both useful, then they can both be top-level.

read_sql_table and read_sql_query; can simply map read_sql to read_sql_table (with a deprc) so things will 'work'.

@hayd
Copy link
Contributor

hayd commented Feb 11, 2014

I had the same thought: sep to read_sql_query and read_sql_table (I think this will be used more).

@jreback Isn't it the other way around? read_sql_table takes a table_name and infers dtypes based on the dtypes of the sql table. read_sql_query takes arbitrary sql and has a go at infering ?

I still think there is a case for read_sql to do read_sql_table(sql) if has_table(sql) else read_sql_query(sql)... I don't think we should shy away from lots of (optional) args, we certainly don't elsewhere!

Definitely the functionality of sql_read_table should be toplevel.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2014

@hayd oh...I see I did put them 'backwards'. Ok if we go with read_sql_query and read_sql_table then map read_sql to read_sql_table (maybe with a deprecate on read_sql directly)....everyone ok with that?

seems minimum pain

@jorisvandenbossche
Copy link
Member Author

@mangecoeur About the top level function: that's the design choice made earlier in pandas, and I think it is important to be consistent on this: the main read_.. function in top level namespace.

Indeed a good summary: there are two cases for reading sql data:

  • full table (with currently read_table) -> infer types from table definition via sqlalchemy types
  • custom sql query (with currently read_sql) -> try to infer types just from result set

And due to the potential differences between the results of both methods, it is indeed important to clearly identify those two cases in the docs (whether as two cases in one function, or two cases in two functions).

For me it is ok to go with read_sql_query and read_sql_table.
But @jreback, what do you mean with mapping read_sql to read_sql_table? Because isn't that just the problem with the backwards compatibility that read_sql already exists and has a different behaviour? But if this deprecation is OK, then we actually can just use read_sql instead of read_sql_table as the name for the function that reads whole tables (unless we find read_sql_table better as it is more explicit).

@jreback
Copy link
Contributor

jreback commented Feb 11, 2014

I think that I am refering to these top-level functions reversed. so ignore me!

@jreback jreback added this to the 0.14.0 milestone Feb 12, 2014
@jorisvandenbossche
Copy link
Member Author

Something else, @mangecoeur, is it also the intention that users could use one of the objects? (in the meaning: has this an added value, so that it would be usefull to mention it in the docs)

Because now, in the docs there is a mention of PandasSQLWIthEngine (just above this header: http://pandas-docs.github.io/pandas-docs-travis/io.html#querying), but I think this is certainly outdated and maybe not your intention of being in the docs? (possibly a leftover from the rebasing complexity, there were some merging conflicts in the docs that I manually, and maybe incorrectly, resolved).

@mangecoeur
Copy link
Contributor

@jorisvandenbossche well spotted, that should be updated to read PandasSQLAlchemy, and maybe also to include PandasSQLTable. But the OO API probably still needs work anyway. It's very useful if you need more fine control for certain use cases (I use it for debugging to check the create table statements before doing an insert for example)

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@mangecoeur you might want to put a small explanation at the top of the sql.py file for these 'internal' classes. clearly will not be externally exposed, but just like anything else, I am sure someone will fine a use for them!

@mangecoeur
Copy link
Contributor

@jreback Not sure what you mean by "externally exposed". They should remain accessible through pandas.io.sql for anyone who needs them (that includes me). The whole module probably needs more docstrings in general, the classes in particular, but they might change a little still so I guess I would wait until most of the bugs are ironed out.

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@mangecoeur I just mean they aren't in the public API (which they are not now)

@hayd
Copy link
Contributor

hayd commented Feb 14, 2014

@jreback The original intention was for one to be in public api. Just like you create a HDF5Store and query it, you'd create a PandasSQL (perhaps should rename to SQLStore or something?) and then do read_sql and write_sql against this.

IMO you want to have a one constructor (to be in global API) which you can pass either a conn or engine to, since these can be reused.

@jorisvandenbossche
Copy link
Member Author

See also the discussion in a previous issue on this topic, around #4163 (comment)

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@hayd ok...that makes sense then...sure....I like SQLStore 👍

@mangecoeur
Copy link
Contributor

@jreback would that be SQLStore to refer to what is now the PandasSqlEngine (for access to tables and queries of tables in a DB) or to the PandasSqlTable (that handles mapping between a single DataFrame and db table)?

@jreback
Copy link
Contributor

jreback commented Feb 20, 2014

@mangecoeur I think that would be PandasSqlEngine as it holds access to possible multiple tables.

caveat, I haven't really look at this though in detail. Idea being this would be the first thing a user would use if they say want to open a db then interactively work on it (as opposed to using the read_sql functionaility (which is analagous to read_hdf.)

@jorisvandenbossche
Copy link
Member Author

OK, we have to come to a decision here. Options:

  • option 1:
    • keep read_sql as is and rename read_table to read_sql_table
    • drawback: 'main' function (the read_table functionality) is not read_sql
  • option 2:
    • rename read_sql to read_sql_query and read_table to read_sql_table. read_sql will eventually be deprecated.
    • drawback: we don't use the shortest/nicest read_sql name
  • option 3:
    • rename read_table to read_sql, and the original read_sql to read_sql_query (or sql.read_query), and in the beginning read_sql will still accept a sql query statement for backwards compatibility.
    • drawback: this is actually merging the functionality of both (for the backwards compatibility), and see the objections of @mangecoeur to this above.
  • option 4:
    • or we have a sql.read_table and sql.read_query, and then only a top level read_sql which delegates to sql.read_table/sql.read_query depending on the input.
    • drawback: different names as toplevel and in sql module is confusing (as users will see both in code depending on how they are imported). + drawback of option 3

Other options? Or combinations?
What do you prefer?

@jreback @mangecoeur @hayd @jtratner @danielballan @cpcloud

@hayd
Copy link
Contributor

hayd commented Apr 9, 2014

There's tweak to option 4 I rate: Have top level read_sql which does both, that is still called read_sql in io.sql. We split the two functions (neither of which are top level) so that read_sql is defined as:

def read_sql(sql):  # arg more honestly called sql_or_table_name
    read_sql_table(sql) if has_table(sql) else read_sql_query(sql)

It could take an optional kwarg if you wanted to be more explicit (like we do with regex kwarg in select):

read_sql(table='foo')
read_sql(query='SELECT * FROM foo')

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

I think a modified option 2 might be the best bet:

pd.read_sql_table and pd.read_sql_query, simple, clear what they are doing, and you don't need to have any sql.* based functions.

then

pd.read_sql is simply an alias (or maybe calls) pd.read_sql_table. This is the by far the most common operation (right?) that people will want to do, e.g. I have a table, read it in.

maybe we come back in a later version of allowing something along what @hayd is suggesting (where you can specify a query directly in read_sql.

I dont' think we need to deprecate read_sql at all in this case.

so maybe call this option 2a.

@danielballan
Copy link
Contributor

I like 2a. I like what @hayd is suggesting for read_sql. I agree,@jreback,
that reading table in its entirety is the more common operation, but making
read_sql an alias for read_sql_table would break lots of old code.

If we implement 2a and do what @hayd suggests, all at once, nothing breaks.

On Wednesday, April 9, 2014, jreback [email protected] wrote:

I think a modified option 2 might be the best bet:

pd.read_sql_table and pd.read_sql_query, simple, clear what they are
doing, and you don't need to have any sql.* based functions.

then

pd.read_sql is simply is an alias (or maybe calls) pd.read_sql_table.
This is the by far the most common operation (right?) that people will want
to do, e.g. I have a table, read it in.

maybe we come back in a later version of allowing something along what
@hayd https://github.com/hayd is suggesting (where you can specify a
query directly in read_sql.

I dont' think we need to deprecate read_sql at all in this case.

so maybe call this option 2a.

Reply to this email directly or view it on GitHubhttps://github.com//issues/6300#issuecomment-40027102
.

@jorisvandenbossche
Copy link
Member Author

@jreback problem with your idea is that read_sql is at this moment an alias for the future read_sql_query, so when we also want to be able to use read_sql for read_sql_table, then you have what @hayd suggested.

So, we have a pd.read_sql_table and a pd.read_sql_query, but

  • for backwards compatibility, pd.read_sql will in every case have to pass a query to read_sql_query, so question is do we also pass a table name to read_sql_table (as @hayd suggests for read_sql, and is what @danielballan proposes)
  • if we do that last, then we have three top level functions (pd.read_sql/read_sql_table/read_sql_query). Not too much (in light of profileration of read_.. functions)?
  • What do we recommend/use in the docs? Consequently using read_sql_table/read_sql_query and only mention that you can also use read_sql for both functions as a convenience alias? Or use read_sql in the docs?

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

is it always unambiguous what read_sql should do ?

meaning that it can easily figure out whether to send to read_sql_query or read_table_query?

@jorisvandenbossche
Copy link
Member Author

ok

  • execute/get_schema: we keep
  • tquery/uquery: agreed, we deprecate (and for tquery, we can also just point to execute(..).fetchall() I think)
  • table_exists/has_table: a reason to choose to change it to has_table could be that sqlalchemy uses has_table(as a method of the engine, so it is actually a bit superfluous in pandas: engine.has_table(..) vs pd.io.sql.has_table(...., engine))

@jorisvandenbossche
Copy link
Member Author

The above has been implemented in the meantime in different PRs (at the moment I left both has_table and table_exists as aliases).
I have only two remaining questions:

Now there should be a last round about the OO interface. But I am not really familiar with that (not using it, so don't have a strong opinion on what it should look like) and I don't have time for that this week, so I propose we leave this for next release? Unless someone else wants to tackle this.

@jreback
Copy link
Contributor

jreback commented May 13, 2014

The parse_dates only matters if dates are kept as strings yes? which going forward they will not, so I am not sure how useful embedding parsing is (as you can always parse string in to_datetime).
Actually parse_dates should only be necessary if you are not reading meta-data?

@jorisvandenbossche
Copy link
Member Author

Yes, but the datetime support is not that fantastic in all databases, so that could be a reason to keep your dates as strings in a database (eg in sqlite you have to store it as strings). And also even if there is a datetime type, I don't think everyone in the wild uses this.

Actually, I would quite like to be able to use parse_dates={'DateCol': '%Y-%m-%d %H:%M:%S'} in read_csv :-)

@jreback
Copy link
Contributor

jreback commented May 13, 2014

yes, that would not be hard for read_csv (I think their is an issue for that). ok then
you prob need to have this as an option when you are writing as well then?

@jorisvandenbossche
Copy link
Member Author

@danielballan @mangecoeur any comments on the meta kwarg in read_sql_table? Otherwise I would personally remove it.

@danielballan
Copy link
Contributor

I've never used it and I don't completely understand its purpose in sqlalchemy, so I don't have an informed opinion.

@jorisvandenbossche
Copy link
Member Author

well, that's my problem too .. :-)
As I understand it, meta reflects all tables, and a description (columns, types, etc) of those tables in your database. With the standard read_sql function, this is recreated every time (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L781), which could possibly be costly if you have a large database/have to do this a lot of times.

But, I personally think that this would be a reason to use the OO API using the PandasSQLAlchemy object, and we shoud keep the basic function as simple as possible, and the meta keyword there is not needed.

@danielballan
Copy link
Contributor

That seems sensible to me. Without the input of an experienced user of "meta," future generations can't fault us for removing it. :- ) I say go for it.

@jorisvandenbossche
Copy link
Member Author

@mangecoeur I suppose you put it in, if you have any objections removing it, please speak up (and we can always easily add it back in the future if it is wanted)

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.14.1, 0.14.0 May 16, 2014
@jorisvandenbossche
Copy link
Member Author

Bumped this to 0.14.1 for the remaining OO API part

@jorisvandenbossche
Copy link
Member Author

Removed it in jorisvandenbossche@c5c78d5 in #7120

@mangecoeur
Copy link
Contributor

@jorisvandenbossche @danielballan wait a sec! Sent my feedback via email but it doesn't seem to have shown up here. The reason for being able to specify meta is that way you can specify the "schema" argument for use with DBs that support schemas, such as Postgres : http://docs.sqlalchemy.org/en/rel_0_8/core/metadata.html#sqlalchemy.schema.MetaData

We should make sure this is still possible, though since it's an advanced use case it doesn't need to be in the functional API, but it should remain in the OO api. I can also imagine scenarios where you might want to manually mess with the Metadata object before passing it.

@jorisvandenbossche
Copy link
Member Author

@mangecoeur Thanks for the feedback! I only removed it in the functional api, so it is still in the OO api.
I think that, if you have to manualy change the Metadata object, then you should use the OO api, to keep the functional api as simple as possible.

Now, we still have to clean-up the OO api a bit.

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue May 16, 2014
- remove meta kwarg from read_sql_table (see discussion in pandas-dev#6300)
- remove flavor kwarg from read_sql (not necessary + not there in 0.13, so would have been API change)
- update docstring of to_sql in generic with latest changes
- enhance docstring of get_schema
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.15.0, 0.14.1 Jul 1, 2014
@jorisvandenbossche
Copy link
Member Author

The last! round is the OO API, but I created a new issue for that (this was becoming a bit too lenghty), so closing this. Further discussion in #7960.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

5 participants