Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: SQL multiindex support #6735

Merged

Conversation

jorisvandenbossche
Copy link
Member

Further work on #6292.

This adds:

  • fixes the columns argument (wasn't passed) + added test for that
  • adds multi-index support in to_sql, by just using reset_index on the frame (+ tests)
  • adds multi-index support in read_table (+ tests)

"Length of 'index_label' should match number of "
"levels, which is {0}".format(nlevels))
else:
self.frame.index.names = index_label
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't set this here as it's on the original frame, instead use a temp variable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes of course. Is there a way to do it like this (changing the index names, and resetting the index) without making a full copy of the entire dataframe?

Otherwise, maybe the previous logic to not reset the index, but just keeping track of the index names to insert them manually if needed in the table create statement and adding the index values when iterating over the rows of the dataframe, was better?
It is a little bit more complicated than doing reset_index and then just writing that frame, but this avoids an extra copy? (only a copy whith itertuples) I don't know to what extent this is important

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.frame.index.set_names(names) will return a new index. I don't think its a big deal to copy this. reset_index does it. But shouldn't modify the incoming data at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche the copy from set_names() is very minimal (it doesn't copy underlying data, just creates new object wrappers) - so it's fine to do that.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2014

I think you should raise if the multi index doesn't have named levels
I do that in hdfstore - it can work but it's relying in the level naming scheme and so a bit unexpected
better to force the user to name levels

@jorisvandenbossche
Copy link
Member Author

@jreback Only for multi-index when it has no names? And for a single index leave it? (so 'index' is used in that case)

@jorisvandenbossche
Copy link
Member Author

@jreback the tests fail on assertListEqual not known on Travis, but this did work on my local machine. Any idea?

@jreback
Copy link
Contributor

jreback commented Mar 29, 2014

I use assertEqual for lists - though assertLisstEqual sounds better

@jorisvandenbossche
Copy link
Member Author

And the strange thing is, it only broke on Travis for one of the builds, while the sql tests are at least also tested in one of the other builds.

@jorisvandenbossche
Copy link
Member Author

Aha, assertListEqual is new in Python 2.7. Just use assertEqual then?

@jreback
Copy link
Contributor

jreback commented Mar 29, 2014

look at HDFStore.Table.validate_mulitindex

I do allow none levels (but then convert them back when reading)

only problem was levels overlapping with column names

sql.to_sql(temp_frame, 'test_index_label', self.conn)
frame = sql.read_table('test_index_label', self.conn)
self.assertEqual(frame.columns[0], 'pandas_index')
self.assertEqual(frame.columns[0], 'index')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the benefit to changing the naming here? Not a big deal, just might be nice to enumerate the reason

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with other places in pandas (eg hdf uses 'index/level_0/1' I think, and I am now following the names that are given in reset_index). As far as I know this is the only place where pandas_index is used. It is also new (in 0.13 writing the index was not included in to_sql), so not really changing the behaviour for the user. See also #6642 (comment) for some discussion.

@jorisvandenbossche
Copy link
Member Author

Pushed new version with slightly other approach.

But some questions on the API:

  • should we allow a MultiIndex without level names? (suggested by @jreback to raise and force user to provide index_labels). Now they get the default names level_0, level_1, ... like in reset_index.
  • if we detect columns with names like level_0 or index, should we set them as index when reading the data? (personally I would say we leave this to the user, as we can never be fully sure)

@mangecoeur @hayd

data_list.append(data)
for idx_label in self.index[::-1]:
keys = keys.insert(0, idx_label)
temp = self.frame.reset_index()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem with this is that if (maybe an unlikely case) the user has a multi-index level name that is the same as a column name (and then reset_index will fail), but provides another name via index_label to overcome this, this will still fail (as I can't give the labels to set as argument to reset_index)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed by first setting the index names on a copy of self.frame and then calling reset_index. The it will generate an error if names are overlapping.

@jreback
Copy link
Contributor

jreback commented Mar 31, 2014

@jorisvandenbossche

since SQL cannot really store meta data, its a bit non-trivial to automatically figure out the index_col whether it is index or level_n. Bigger problem is that a user will not know how to specify the index columns in the first place.

Maybe best not to allow non-named levels at all?

@jorisvandenbossche
Copy link
Member Author

@mangecoeur Would you have some time to review this? (as you wrote most of this code)

@jorisvandenbossche
Copy link
Member Author

Since there isn't more feedback, I am going to merge this to not hold up further PRs. But off course, if there still feedback, glad to hear it! It can always be adapted afterwards.

@jreback I am going to take the same way as reset_index, so allowing non-named index levels. That seems the most consistent with rest of pandas. And also not trying to guess what the index was when reading the data back in from sql.

The only problem I still have is the legacy mode. I didn't add multi-index support to legacy to_sql, so the support to write the index is only half baked. It is also new functionality since 0.13.1, so actually an API change for the legacy mode. But I will open a seperate issue for that.

@jreback
Copy link
Contributor

jreback commented Apr 14, 2014

@jorisvandenbossche all fine

FYI if u reset and get an error I think in HDFStore I raise
that - meaning reset dailies because of duplicate index/columns

@jorisvandenbossche
Copy link
Member Author

@jreback Ah, yes, good idea, I catched the exception and expanded the error message.

jorisvandenbossche added a commit that referenced this pull request Apr 14, 2014
@jorisvandenbossche jorisvandenbossche merged commit ad1f47d into pandas-dev:master Apr 14, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants