-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: SQL multiindex support #6735
ENH: SQL multiindex support #6735
Conversation
"Length of 'index_label' should match number of " | ||
"levels, which is {0}".format(nlevels)) | ||
else: | ||
self.frame.index.names = index_label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't set this here as it's on the original frame, instead use a temp variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes of course. Is there a way to do it like this (changing the index names, and resetting the index) without making a full copy of the entire dataframe?
Otherwise, maybe the previous logic to not reset the index, but just keeping track of the index names to insert them manually if needed in the table create statement and adding the index values when iterating over the rows of the dataframe, was better?
It is a little bit more complicated than doing reset_index and then just writing that frame, but this avoids an extra copy? (only a copy whith itertuples) I don't know to what extent this is important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.frame.index.set_names(names)
will return a new index. I don't think its a big deal to copy this. reset_index
does it. But shouldn't modify the incoming data at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche the copy from set_names()
is very minimal (it doesn't copy underlying data, just creates new object wrappers) - so it's fine to do that.
I think you should raise if the multi index doesn't have named levels |
@jreback Only for multi-index when it has no names? And for a single index leave it? (so 'index' is used in that case) |
@jreback the tests fail on |
I use assertEqual for lists - though assertLisstEqual sounds better |
And the strange thing is, it only broke on Travis for one of the builds, while the sql tests are at least also tested in one of the other builds. |
Aha, assertListEqual is new in Python 2.7. Just use assertEqual then? |
look at HDFStore.Table.validate_mulitindex I do allow none levels (but then convert them back when reading) only problem was levels overlapping with column names |
sql.to_sql(temp_frame, 'test_index_label', self.conn) | ||
frame = sql.read_table('test_index_label', self.conn) | ||
self.assertEqual(frame.columns[0], 'pandas_index') | ||
self.assertEqual(frame.columns[0], 'index') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the benefit to changing the naming here? Not a big deal, just might be nice to enumerate the reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with other places in pandas (eg hdf uses 'index/level_0/1' I think, and I am now following the names that are given in reset_index
). As far as I know this is the only place where pandas_index
is used. It is also new (in 0.13 writing the index was not included in to_sql), so not really changing the behaviour for the user. See also #6642 (comment) for some discussion.
Pushed new version with slightly other approach. But some questions on the API:
|
data_list.append(data) | ||
for idx_label in self.index[::-1]: | ||
keys = keys.insert(0, idx_label) | ||
temp = self.frame.reset_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem with this is that if (maybe an unlikely case) the user has a multi-index level name that is the same as a column name (and then reset_index will fail), but provides another name via index_label
to overcome this, this will still fail (as I can't give the labels to set as argument to reset_index
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by first setting the index names on a copy of self.frame and then calling reset_index
. The it will generate an error if names are overlapping.
since SQL cannot really store meta data, its a bit non-trivial to automatically figure out the Maybe best not to allow non-named levels at all? |
@mangecoeur Would you have some time to review this? (as you wrote most of this code) |
Since there isn't more feedback, I am going to merge this to not hold up further PRs. But off course, if there still feedback, glad to hear it! It can always be adapted afterwards. @jreback I am going to take the same way as The only problem I still have is the legacy mode. I didn't add multi-index support to legacy |
@jorisvandenbossche all fine FYI if u reset and get an error I think in HDFStore I raise |
@jreback Ah, yes, good idea, I catched the exception and expanded the error message. |
ENH: SQL multiindex support
Further work on #6292.
This adds:
columns
argument (wasn't passed) + added test for thatto_sql
, by just usingreset_index
on the frame (+ tests)