ENH: SQL multiindex support #6735

jorisvandenbossche · 2014-03-29T11:28:29Z

Further work on #6292.

This adds:

fixes the columns argument (wasn't passed) + added test for that
adds multi-index support in to_sql, by just using reset_index on the frame (+ tests)
adds multi-index support in read_table (+ tests)

jreback · 2014-03-29T11:32:46Z

pandas/io/sql.py

+                        "Length of 'index_label' should match number of "
+                        "levels, which is {0}".format(nlevels))
+                else:
+                    self.frame.index.names = index_label


you shouldn't set this here as it's on the original frame, instead use a temp variable

Hmm, yes of course. Is there a way to do it like this (changing the index names, and resetting the index) without making a full copy of the entire dataframe?

Otherwise, maybe the previous logic to not reset the index, but just keeping track of the index names to insert them manually if needed in the table create statement and adding the index values when iterating over the rows of the dataframe, was better?
It is a little bit more complicated than doing reset_index and then just writing that frame, but this avoids an extra copy? (only a copy whith itertuples) I don't know to what extent this is important

self.frame.index.set_names(names) will return a new index. I don't think its a big deal to copy this. reset_index does it. But shouldn't modify the incoming data at all.

@jorisvandenbossche the copy from set_names() is very minimal (it doesn't copy underlying data, just creates new object wrappers) - so it's fine to do that.

jreback · 2014-03-29T11:35:26Z

I think you should raise if the multi index doesn't have named levels
I do that in hdfstore - it can work but it's relying in the level naming scheme and so a bit unexpected
better to force the user to name levels

jorisvandenbossche · 2014-03-29T11:37:59Z

@jreback Only for multi-index when it has no names? And for a single index leave it? (so 'index' is used in that case)

jorisvandenbossche · 2014-03-29T11:40:41Z

@jreback the tests fail on assertListEqual not known on Travis, but this did work on my local machine. Any idea?

jreback · 2014-03-29T11:46:04Z

I use assertEqual for lists - though assertLisstEqual sounds better

jorisvandenbossche · 2014-03-29T11:47:42Z

And the strange thing is, it only broke on Travis for one of the builds, while the sql tests are at least also tested in one of the other builds.

jorisvandenbossche · 2014-03-29T11:48:56Z

Aha, assertListEqual is new in Python 2.7. Just use assertEqual then?

jreback · 2014-03-29T11:52:49Z

look at HDFStore.Table.validate_mulitindex

I do allow none levels (but then convert them back when reading)

only problem was levels overlapping with column names

jtratner · 2014-03-30T05:17:27Z

pandas/io/tests/test_sql.py

        sql.to_sql(temp_frame, 'test_index_label', self.conn)
        frame = sql.read_table('test_index_label', self.conn)
-        self.assertEqual(frame.columns[0], 'pandas_index')
+        self.assertEqual(frame.columns[0], 'index')


what's the benefit to changing the naming here? Not a big deal, just might be nice to enumerate the reason

To be consistent with other places in pandas (eg hdf uses 'index/level_0/1' I think, and I am now following the names that are given in reset_index). As far as I know this is the only place where pandas_index is used. It is also new (in 0.13 writing the index was not included in to_sql), so not really changing the behaviour for the user. See also #6642 (comment) for some discussion.

jorisvandenbossche · 2014-03-31T11:50:56Z

Pushed new version with slightly other approach.

But some questions on the API:

should we allow a MultiIndex without level names? (suggested by @jreback to raise and force user to provide index_labels). Now they get the default names level_0, level_1, ... like in reset_index.
if we detect columns with names like level_0 or index, should we set them as index when reading the data? (personally I would say we leave this to the user, as we can never be fully sure)

@mangecoeur @hayd

jorisvandenbossche · 2014-03-31T11:53:50Z

pandas/io/sql.py

-                data_list.append(data)
+            for idx_label in self.index[::-1]:
+                keys = keys.insert(0, idx_label)
+            temp = self.frame.reset_index()


The only problem with this is that if (maybe an unlikely case) the user has a multi-index level name that is the same as a column name (and then reset_index will fail), but provides another name via index_label to overcome this, this will still fail (as I can't give the labels to set as argument to reset_index)

fixed by first setting the index names on a copy of self.frame and then calling reset_index. The it will generate an error if names are overlapping.

jreback · 2014-03-31T12:51:19Z

@jorisvandenbossche

since SQL cannot really store meta data, its a bit non-trivial to automatically figure out the index_col whether it is index or level_n. Bigger problem is that a user will not know how to specify the index columns in the first place.

Maybe best not to allow non-named levels at all?

jorisvandenbossche · 2014-04-02T08:43:07Z

@mangecoeur Would you have some time to review this? (as you wrote most of this code)

jorisvandenbossche · 2014-04-14T11:15:14Z

Since there isn't more feedback, I am going to merge this to not hold up further PRs. But off course, if there still feedback, glad to hear it! It can always be adapted afterwards.

@jreback I am going to take the same way as reset_index, so allowing non-named index levels. That seems the most consistent with rest of pandas. And also not trying to guess what the index was when reading the data back in from sql.

The only problem I still have is the legacy mode. I didn't add multi-index support to legacy to_sql, so the support to write the index is only half baked. It is also new functionality since 0.13.1, so actually an API change for the legacy mode. But I will open a seperate issue for that.

jreback · 2014-04-14T11:19:31Z

@jorisvandenbossche all fine

FYI if u reset and get an error I think in HDFStore I raise
that - meaning reset dailies because of duplicate index/columns

jorisvandenbossche · 2014-04-14T11:45:02Z

@jreback Ah, yes, good idea, I catched the exception and expanded the error message.

ENH: SQL multiindex support

jorisvandenbossche mentioned this pull request Mar 29, 2014

ENH: sql support with SQLAlchemy - follow-up #6292

Closed

17 tasks

jreback reviewed Mar 29, 2014
View reviewed changes

jtratner reviewed Mar 30, 2014
View reviewed changes

jorisvandenbossche reviewed Mar 31, 2014
View reviewed changes

jreback added the SQL label Apr 4, 2014

jreback added this to the 0.14.0 milestone Apr 4, 2014

jorisvandenbossche mentioned this pull request Apr 11, 2014

BUG/API: converting invalid column names in to_sql #6796

Closed

SQL: fix + test handling columns argument in read_table

c7928f6

jorisvandenbossche added 2 commits April 14, 2014 13:43

ENH/TST SQL: add multi-index support to to_sql

0a26065

ENH/TST SQL: add multi-index support to read_table

18bd0d6

jorisvandenbossche added a commit that referenced this pull request Apr 14, 2014

Merge pull request #6735 from jorisvandenbossche/sql-multiindex

ad1f47d

ENH: SQL multiindex support

jorisvandenbossche merged commit ad1f47d into pandas-dev:master Apr 14, 2014

jorisvandenbossche mentioned this pull request Apr 14, 2014

API: SQL legacy mode to_sql 'index' kwarg behaviour #6881

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: SQL multiindex support #6735

ENH: SQL multiindex support #6735

jorisvandenbossche commented Mar 29, 2014

jreback Mar 29, 2014

jorisvandenbossche Mar 29, 2014

jreback Mar 29, 2014

jtratner Mar 30, 2014

jreback commented Mar 29, 2014

jorisvandenbossche commented Mar 29, 2014

jorisvandenbossche commented Mar 29, 2014

jreback commented Mar 29, 2014

jorisvandenbossche commented Mar 29, 2014

jorisvandenbossche commented Mar 29, 2014

jreback commented Mar 29, 2014

jtratner Mar 30, 2014

jorisvandenbossche Mar 30, 2014

jorisvandenbossche commented Mar 31, 2014

jorisvandenbossche Mar 31, 2014

jorisvandenbossche Apr 14, 2014

jreback commented Mar 31, 2014

jorisvandenbossche commented Apr 2, 2014

jorisvandenbossche commented Apr 14, 2014

jreback commented Apr 14, 2014

jorisvandenbossche commented Apr 14, 2014

ENH: SQL multiindex support #6735

ENH: SQL multiindex support #6735

Conversation

jorisvandenbossche commented Mar 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 29, 2014

jorisvandenbossche commented Mar 29, 2014

jorisvandenbossche commented Mar 29, 2014

jreback commented Mar 29, 2014

jorisvandenbossche commented Mar 29, 2014

jorisvandenbossche commented Mar 29, 2014

jreback commented Mar 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 31, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 31, 2014

jorisvandenbossche commented Apr 2, 2014

jorisvandenbossche commented Apr 14, 2014

jreback commented Apr 14, 2014

jorisvandenbossche commented Apr 14, 2014