Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindexing doesn't work as expected for MultiIndexs #12181

Closed
PeterKucirek opened this issue Jan 29, 2016 · 2 comments
Closed

Reindexing doesn't work as expected for MultiIndexs #12181

PeterKucirek opened this issue Jan 29, 2016 · 2 comments
Labels
Duplicate Report Duplicate issue or pull request Enhancement MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@PeterKucirek
Copy link

Using Series.reindex() does not work when the Series uses a MultiIndex and the desired re-index also uses a MultiIndex. An error message reports that the join is ambiguous, even when it is not.

In the following example, I'm setting up some schedule data for individuals who belong to households. Three tables will be created: households, persons, and activities. There is an implicit hierarchy: activities belong to persons which belong to households. Their indexes reflect this relationship.

import pandas as pd
import numpy as np

households = pd.DataFrame({'dwelling_type': np.random.choice(['house', 'apartment'], size=5, replace=True)})
households.index.name = 'hh_id'

person_ids = [0,1,0,1,2,3,0,0,0,1]
person_hhlds = np.repeat(households.index.values, [2,4,1,1,2])
persons_index = pd.MultiIndex.from_arrays([person_hhlds, person_ids], names=['hh_id', 'person_id'])

persons = pd.DataFrame(index=persons_index)
persons['sex'] = np.random.choice(list('FM'), size=10, replace=True)
persons['age'] = np.random.randint(18,60,10)

activity_ids = [0,0,0,1,0,0,1,2,0,0,0,1,0,0,1,2,3]
activity_persons = np.repeat(persons_index.get_level_values('person_id').values, [1,1,2,1,3,1,1,2,1,4])
activity_hhlds = np.repeat(persons_index.get_level_values('hh_id').values, [1,1,2,1,3,1,1,2,1,4])
activity_index = pd.MultiIndex.from_arrays([activity_hhlds, activity_persons, activity_ids], 
                                           names=['hh_id', 'person_id','activity_id'])

activities = pd.DataFrame(index=activity_index)
activities['type'] = np.random.choice(['work', 'school', 'shop'], size=len(activities), replace=True)
activities['zone'] = np.random.randint(1,10, len(activities))

Going from a regular Index to a MultiIndex works as expected. In this case, I'm trying to join dwelling_type data from the households table to the persons table:

person_dwelling_type = households.dwelling_type.reindex(persons.index, level='hh_id')
print pd.DataFrame({'person_dwelling_type': person_dwelling_type})
>>
hh_id  person_id
0      0                house
       1                house
1      0            apartment
       1            apartment
       2            apartment
       3            apartment
2      0            apartment
3      0                house
4      0            apartment
       1            apartment
Name: dwelling_type, dtype: object

But when I try to do a similar operation, joining age from persons to activities, I get an error:

activity_age = persons.age.reindex(activities.index, level=['hh_id', 'person_id'])
print activity_age 
>> ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-e2b526341a93> in <module>()
----> 1 activity_age = persons.age.reindex(activities.index, level=['hh_id', 'person_id'])
      2 pd.DataFrame({'activity_age': activity_age})

C:\Anaconda\lib\site-packages\pandas\core\series.pyc in reindex(self, index, **kwargs)
   2266     @Appender(generic._shared_docs['reindex'] % _shared_doc_kwargs)
   2267     def reindex(self, index=None, **kwargs):
-> 2268         return super(Series, self).reindex(index=index, **kwargs)
   2269 
   2270     @Appender(generic._shared_docs['fillna'] % _shared_doc_kwargs)

C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in reindex(self, *args, **kwargs)
   1960         # perform the reindex on the axes
   1961         return self._reindex_axes(axes, level, limit, tolerance,
-> 1962                                   method, fill_value, copy).__finalize__(self)
   1963 
   1964     def _reindex_axes(self, axes, level, limit, tolerance, method,

C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
   1974             new_index, indexer = ax.reindex(
   1975                 labels, level=level, limit=limit, tolerance=tolerance,
-> 1976                 method=method)
   1977 
   1978             axis = self._get_axis_number(a)

C:\Anaconda\lib\site-packages\pandas\core\index.pyc in reindex(self, target, method, level, limit, tolerance)
   5269             target, indexer, _ = self._join_level(target, level, how='right',
   5270                                                   return_indexers=True,
-> 5271                                                   keep_order=False)
   5272         else:
   5273             if self.equals(target):

C:\Anaconda\lib\site-packages\pandas\core\index.pyc in _join_level(self, other, level, how, return_indexers, keep_order)
   2407 
   2408         if isinstance(self, MultiIndex) and isinstance(other, MultiIndex):
-> 2409             raise TypeError('Join on level between two MultiIndex objects '
   2410                             'is ambiguous')
   2411 

TypeError: Join on level between two MultiIndex objects is ambiguous

This seems to be a completely unintuitive result, given that the relationship is unambiguously one-to-many, even though MultiIndex's are used.

I'm able to work around this by dropping levels and replacing Indexs, but to me it seems that this should just be an easy thing to accomplish.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2016

This is really a NotImplementedError, see the docs here

and can be done by the simple expedient of an actual merge.

In [43]: pd.merge(persons.age.reset_index(),activities.reset_index(),on=['hh_id','person_id']).set_index(['hh_id','person_id'])
Out[43]: 
                 age  activity_id    type  zone
hh_id person_id                                
0     0           45            0    work     5
      1           23            0    work     8
1     0           25            0    work     4
      0           25            1    work     5
      1           28            0    work     8
      2           33            0    shop     3
      2           33            1    shop     3
      2           33            2    shop     7
      3           36            0    shop     6
2     0           45            0    work     7
3     0           58            0  school     3
      0           58            1  school     9
4     0           58            0  school     5
      1           34            0    shop     7
      1           34            1    work     5
      1           34            2    work     9
      1           34            3    work     8

If you'd like to implement would be a welcome PR

@jreback
Copy link
Contributor

jreback commented Jan 30, 2016

this a dupe of #6360 , where a partial soln is implemented.

@jreback jreback closed this as completed Jan 30, 2016
@jreback jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode MultiIndex Duplicate Report Duplicate issue or pull request labels Jan 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Enhancement MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

2 participants