Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex - Comparison with Mixed Frequencies (and other FUBAR) #17112

Closed
jbrockmendel opened this issue Jul 29, 2017 · 10 comments
Closed

MultiIndex - Comparison with Mixed Frequencies (and other FUBAR) #17112

jbrockmendel opened this issue Jul 29, 2017 · 10 comments
Labels
Bug Error Reporting Incorrect or improved errors from pandas MultiIndex Period Period data type

Comments

@jbrockmendel
Copy link
Member

Setup:

index = pd.Index(['PCE']*4, name='Variable')
data = [
	pd.Period('2018Q2'),
	pd.Period('2021', freq='5A-Dec'),
	pd.Period('2026', freq='10A-Dec'),
	pd.Period('2017Q2')
	]
ser = pd.Series(data, index=index, name='Period')

In the real-life version of this issue, 'Period' is a column in a DataFrame and I need to append it as a new level to the index. The snippets here show the problem(s) in both py2 and py3, but for reasons unknown df.set_index('Period', append=True) goes through fine in py2.

The large majority of Period values are quarterly-frequency.

py2

>>> pd.__version__
'0.20.2'
>>> ser.sort_values()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/core/series.py", line 1710, in sort_values
    argsorted = _try_kind_sort(arr[good])
  File "/usr/local/lib/python2.7/site-packages/pandas/core/series.py", line 1696, in _try_kind_sort
    return arr.argsort(kind=kind)
  File "pandas/_libs/period.pyx", line 725, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11842)
pandas._libs.period.IncompatibleFrequency: Input has different freq=10A-DEC from Period(freq=Q-DEC)

>>> ser.to_frame()
         Period
Variable       
PCE      2018Q2
PCE        2021
PGDP       2026
PGDP     2017Q2
>>> ser.to_frame().set_index('Period', append=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2836, in set_index
    index = MultiIndex.from_arrays(arrays, names=names)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/multi.py", line 1100, in from_arrays
    labels, levels = _factorize_from_iterables(arrays)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 2193, in _factorize_from_iterables
    return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
  File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 2165, in _factorize_from_iterable
    cat = Categorical(values, ordered=True)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 310, in __init__
    raise NotImplementedError("> 1 ndim Categorical are not "
NotImplementedError: > 1 ndim Categorical are not supported at this time

No idea why it thinks Categorical is relevant here. That doesn't happen in py3.

For the purposes of sort_values, refusing to sort might make sense. But when all I care about is set_index, I'm pretty indifferent to the ordering.

py3

>>> pd.__version__
'0.20.2'
>>> ser.sort_values()
pandas._libs.period.IncompatibleFrequency: Input has different freq=Q-DEC from Period(freq=5A-DEC)

During handling of the above exception, another exception occurred:
SystemError: <built-in function isinstance> returned a result with an error set
[...]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 1710, in sort_values
    argsorted = _try_kind_sort(arr[good])
  File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 1696, in _try_kind_sort
    return arr.argsort(kind=kind)
  File "pandas/_libs/period.pyx", line 723, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11713)
  File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 375, in __ne__
    return not self == other
  File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 364, in __eq__
    if isinstance(other, compat.string_types):
SystemError: <built-in function isinstance> returned a result with an error set

>>> ser.to_frame().set_index('Period', append=True)
pandas._libs.period.IncompatibleFrequency: Input has different freq=Q-DEC from Period(freq=5A-DEC)

During handling of the above exception, another exception occurred:
SystemError: <built-in function isinstance> returned a result with an error set
[...]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/pandas/core/frame.py", line 2836, in set_index
    index = MultiIndex.from_arrays(arrays, names=names)
  File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/multi.py", line 1100, in from_arrays
    labels, levels = _factorize_from_iterables(arrays)
  File "/usr/local/lib/python3.5/site-packages/pandas/core/categorical.py", line 2193, in _factorize_from_iterables
    return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
  File "/usr/local/lib/python3.5/site-packages/pandas/core/categorical.py", line 2193, in <listcomp>
    return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
  File "/usr/local/lib/python3.5/site-packages/pandas/core/categorical.py", line 2165, in _factorize_from_iterable
    cat = Categorical(values, ordered=True)
  File "/usr/local/lib/python3.5/site-packages/pandas/core/categorical.py", line 298, in __init__
    codes, categories = factorize(values, sort=True)
  File "/usr/local/lib/python3.5/site-packages/pandas/core/algorithms.py", line 567, in factorize
    assume_unique=True)
  File "/usr/local/lib/python3.5/site-packages/pandas/core/algorithms.py", line 486, in safe_sort
    sorter = values.argsort()
  File "pandas/_libs/period.pyx", line 723, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11713)
  File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 375, in __ne__
    return not self == other
  File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 364, in __eq__
    if isinstance(other, compat.string_types):
SystemError: <built-in function isinstance> returned a result with an error set

I have no idea what to make of this.

A problem that I have not been able to replicate with a copy/pasteable subset of the data:

>>> mi = pd.MultiIndex.from_arrays([period.index, period])
>>> mi
[... prints roughly what we'd expect...]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/pandas/core/base.py", line 800, in shape
    return self._values.shape
  File "/usr/local/lib/python3.5/site-packages/pandas/core/base.py", line 860, in _values
    return self.values
  File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/multi.py", line 667, in values
    self._tuples = lib.fast_zip(values)
  File "pandas/_libs/lib.pyx", line 549, in pandas._libs.lib.fast_zip (pandas/_libs/lib.c:10513)
ValueError: all arrays must be same length

>>> mi.names
FrozenList(['Variable', None])
>>> mi[0]
('CPROF', 'Period')
>>> mi[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/multi.py", line 1377, in __getitem__
    if lab[key] == -1:
IndexError: index 1 is out of bounds for axis 0 with size 1

AFAICT it took the name 'Period' and made that the only value in the new level of the MultiIndex. Really no idea what's going on here.

@gfyoung gfyoung added the Datetime Datetime data dtype label Jul 29, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 29, 2017

Yikes! I can't really follow your last code snippet (not sure what period is). That being said, it does appear that all of your issues are stemming from a frequency incompatibility one way or the other.

No idea why it thinks Categorical is relevant here. That doesn't happen in py3.

Yeah...that does look a little weird. Can you try first upgrading to 0.20.3 and see if anything changes on that end? If not, then we most certainly should improve the error message.

@gfyoung gfyoung added the Error Reporting Incorrect or improved errors from pandas label Jul 29, 2017
@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jul 29, 2017

(not sure what period is)

That's the real-life column (2770 rows). Looks a lot like ser from the snippet, but I haven't figured out a snippet that demonstrates the problem. It looks like it's caused by passing a single-column DataFrame to from_arrays:

index = pd.Index(['CPROF', 'HOUSING', 'INDPROD', 'NGDP', 'PGDP'])
data = [pd.Period('1968Q4')]*5
df = pd.DataFrame(data, index=index, columns=['Period'])
mi = pd.MultiIndex.from_arrays([df.index, df])

>>> mi
MultiIndex(levels=[['CPROF', 'HOUSING', 'INDPROD', 'NGDP', 'PGDP'], ['Period']],
           labels=[[0, 1, 2, 3, 4], [0]])
>>> mi.shape
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/pandas/core/base.py", line 800, in shape
    return self._values.shape
  File "/usr/local/lib/python3.5/site-packages/pandas/core/base.py", line 860, in _values
    return self.values
  File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/multi.py", line 667, in values
    self._tuples = lib.fast_zip(values)
  File "pandas/_libs/lib.pyx", line 549, in pandas._libs.lib.fast_zip (pandas/_libs/lib.c:10513)
ValueError: all arrays must be same length

On the plus side its clearly a user error (this guy). Ideally it'd be caught in __init__ though.

Can you try first upgrading to 0.20.3 and see if anything changes on that end?

Unchanged.

@jbrockmendel
Copy link
Member Author

I still don't get why MultiIndex.from_arrays needs to go through Categorical, but a partial fix can be made in Categorical.__init__:

        if categories is None:
            try:
                codes, categories = factorize(values, sort=True)
            except TypeError:
                codes, categories = factorize(values, sort=False)
                if ordered:
                    # raise, as we don't have a sortable data structure and so
                    # the user should give us one by specifying categories
                    raise TypeError("'values' is not ordered, please "
                                    "explicitly specify the categories order "
                                    "by passing in a categories argument.")
            except ValueError:

                # FIXME
                raise NotImplementedError("> 1 ndim Categorical are not "
                                          "supported at this time")

Especially in py3, we unsortable errors to be TypeErrors, but the error that gets raises when trying to compare Periods with different frequencies is _libs.period.IncompatibleFrequency, which subclasses ValueError.

Having the except TypeError: above also catch IncompatibleFrequency gets us one step closer to correctness. But then it raise immediately because ordered is True here. Any idea why MultiIndex.from_arrays is requiring an ordered Categorical?

@gfyoung
Copy link
Member

gfyoung commented Jul 30, 2017

¯ \ (ツ)

Feel free to experiment and see what happens when you loosen that restriction 😄

@jbrockmendel
Copy link
Member Author

There are more effective approaches than trial and error. Someone somewhere knows why this decision was made in the first place.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

Someone somewhere knows why this decision was made in the first place.

Perhaps, but I'm assuming worst case in that we don't remember anymore why that is the case.

@jreback
Copy link
Contributor

jreback commented Jul 31, 2017

I still don't get why MultiIndex.from_arrays needs to go through Categorical, but a partial fix can be made in Categorical.init:

well, you need to factorize things when you construct a MI.

Not really sure what this issue is about, it has gone off on tangents. Can you provide a narrow clear example.

@jreback jreback removed Bug Error Reporting Incorrect or improved errors from pandas Datetime Datetime data dtype labels Jul 31, 2017
@jreback
Copy link
Contributor

jreback commented Jul 31, 2017

@gfyoung don't tag things until it is clear what they are.

@jbrockmendel
Copy link
Member Author

Not really sure what this issue is about, it has gone off on tangents. Can you provide a narrow clear example.

  1. Period.__richcmp__ currently raises IncompatibleFrequency when trying to compare periods with unequal frequencies. This breaks things that do sorting under the hood. It should be changed.
  • 1b) The main impediment here is agreeing on a convention for what the ordering should look like. I suggest lexicographic ordering (start_time, freq), which is equivalent to (start_time, end_time)
  1. Even if it isn't changed, the errors that it raises are misleading and inconsistent.

Setup:

index = pd.Index(['PCE']*4, name='Variable')
data = [
	pd.Period('2018Q2'),
	pd.Period('2021', freq='5A-Dec'),
	pd.Period('2026', freq='10A-Dec'),
	pd.Period('2017Q2')
	]
ser = pd.Series(data, index=index, name='Period')

Clear Error:

>>> ser.sort_values()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/core/series.py", line 1710, in sort_values
    argsorted = _try_kind_sort(arr[good])
  File "/usr/local/lib/python2.7/site-packages/pandas/core/series.py", line 1696, in _try_kind_sort
    return arr.argsort(kind=kind)
  File "pandas/_libs/period.pyx", line 725, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11842)
pandas._libs.period.IncompatibleFrequency: Input has different freq=10A-DEC from Period(freq=Q-DEC)

Incorrect Error Message (I think because IncompatibleFrequency subclasses ValueError and not TypeError)

>>> ser.to_frame().set_index('Period', append=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2836, in set_index
    index = MultiIndex.from_arrays(arrays, names=names)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/multi.py", line 1100, in from_arrays
    labels, levels = _factorize_from_iterables(arrays)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 2193, in _factorize_from_iterables
    return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
  File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 2165, in _factorize_from_iterable
    cat = Categorical(values, ordered=True)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 310, in __init__
    raise NotImplementedError("> 1 ndim Categorical are not "
NotImplementedError: > 1 ndim Categorical are not supported at this time

py3

>>> ser.sort_values()
pandas._libs.period.IncompatibleFrequency: Input has different freq=Q-DEC from Period(freq=5A-DEC)

During handling of the above exception, another exception occurred:
SystemError: <built-in function isinstance> returned a result with an error set
[...]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 1710, in sort_values
    argsorted = _try_kind_sort(arr[good])
  File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 1696, in _try_kind_sort
    return arr.argsort(kind=kind)
  File "pandas/_libs/period.pyx", line 723, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11713)
  File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 375, in __ne__
    return not self == other
  File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 364, in __eq__
    if isinstance(other, compat.string_types):
SystemError: <built-in function isinstance> returned a result with an error set

@toobaz
Copy link
Member

toobaz commented Aug 26, 2017

This results in the same error:

In [2]: pd.Index([pd.Timestamp('2000-01-03 00:00:00', freq='B'),
                  pd.Period('2000-01-03', 'B'),
                  pd.Period('2000-01-03', 'B')]).sort_values()
[...]
SystemError: <built-in function isinstance> returned a result with an error set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas MultiIndex Period Period data type
Projects
None yet
Development

No branches or pull requests

5 participants