Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError for mixed-type indices in python3 #20900

Closed
normanius opened this issue May 1, 2018 · 6 comments
Closed

TypeError for mixed-type indices in python3 #20900

normanius opened this issue May 1, 2018 · 6 comments

Comments

@normanius
Copy link

normanius commented May 1, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
table = pd.DataFrame(data=np.random.randn(5,2), index=[4,1,'foo',3,'bar'])
table = table.sort_index()

Problem description

The code above works in python2. However, in python3 a TypeError is raised:

TypeError: '<' not supported between instances of 'str' and 'int'

The reason for this is described here: mixed-type sequencing cannot be sorted anymore just like this.

Expected Output

No exception when calling sort_index in python3.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 10.0.0
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@normanius
Copy link
Author

normanius commented May 1, 2018

I am currently using this workaround that works for my type of indices (ints and strings). I also posted the problem on Stackoverflow.

import numpy as np
import pandas as pd

def stringifiedSortIndex(table):
    # 1) Stringify the index.
    _stringifyIdx = _StringifyIdx()
    table.index = table.index.map(_stringifyIdx)
    # 2) Sort the index.
    table = table.sort_index()
    # 3) Destringify the sorted table.
    _stringifyIdx.revert = True
    table.index = table.index.map(_stringifyIdx)
    # Return table and IndexSlice together.
    return table

class _StringifyIdx(object):
    def __init__(self):
        self._destringifyMap = dict()
        self.revert = False
    def __call__(self, idx):
        if not self.revert:
            return self._stringifyIdx(idx)
        else:
            return self._destringifyIdx(idx)

    # Stringify whatever needs to be converted.
    # In this example: only ints are stringified.
    @staticmethod
    def _stringify(x):
        if isinstance(x,int):
            x = '%03d' % x
            destringify = int
        else:
            destringify = lambda x: x
        return x, destringify

    def _stringifyIdx(self, idx):
        if isinstance(idx, tuple):
            idx = list(idx)
            destr = [None]*len(idx)
            for i,x in enumerate(idx):
                idx[i], destr[i] = self._stringify(x)
            idx = tuple(idx)
            destr = tuple(destr)
        else:
            idx, destr = self._stringify(idx)
        if self._destringifyMap is not None:
            self._destringifyMap[idx] = destr
        return idx

    def _destringifyIdx(self, idx):
        if idx not in self._destringifyMap:
            raise ValueError(("Index to destringify has not been stringified "
                              "this class instance. Index must not change "
                              "between stringification and destringification."))
        destr = self._destringifyMap[idx]
        if isinstance(idx, tuple):
            assert(len(destr)==len(idx))
            idx = tuple(d(i) for d,i in zip(destr, idx))
        else:
            idx = destr(idx)
        return idx


# Build the table.
index = [(10,3),(10,1),(2,2),('foo',4),('bar',5)]
index = pd.MultiIndex.from_tuples(index)
data = np.random.randn(len(index),2)
table = pd.DataFrame(data=data, index=index)
idx = pd.IndexSlice

table = stringifiedSortIndex(table)
print(table)

# Now, the table rows can be accessed as usual.
table.loc[idx[10],:]
table.loc[idx[:10],:]
table.loc[idx[:'bar',:],:]
table.loc[idx[:,:2],:]

# This works also for simply indexed table.
table = pd.DataFrame(data=data, index=[4,1,'foo',3,'bar'])
table = stringifiedSortIndex(table)
table[:'bar']

@TomAugspurger
Copy link
Contributor

FYI, the example in your original post doesn't run because index isn't defined.

I think we deliberately choose to follow how the language does things here. NumPy does the same ("sorts" on Py2, raises on Py3).

@normanius
Copy link
Author

normanius commented May 1, 2018

FYI, the example in your original post doesn't run because index isn't defined.

Fixed!

I think we deliberately choose to follow how the language does things here. NumPy does the same ("sorts" on Py2, raises on Py3).

I agree that the new ordering policy in python3 is a global change that not necessarily needs to be compensated for in all cases. However, as the index is represented internally by a Categorical object (and not just plain lists) and the fact that indices need to be lex-sorted for certain operations on the table (for example the notationally efficient slicing operators), I expect pandas to provide means to sort the indices such that these elementary operations do work.

The alternative would be to prevent mixed-type indices all the way. Or to provide a cook-book recipe on how to deal with mixed-type indices.

@normanius
Copy link
Author

Here is an example with mixed-type indices that requires a table to be lex-sorted and currently fails on pandas3.

import pandas as pd
import numpy as np
index = [(10,3),(10,1),(2,2),('foo',4),('bar',5)]
index = pd.MultiIndex.from_tuples(index)
data = np.random.randn(len(index),2)
table = pd.DataFrame(data=data, index=index)

idx=pd.IndexSlice
table.loc[idx[:10,:],:]
# The last line will raise an UnsortedIndexError because 
# 'foo' and 'bar' appear in the wrong order.

@jreback
Copy link
Contributor

jreback commented May 1, 2018

this is using pandas in a very non-idiomatic way as mixed types are not easily represented, except by object type. that said, the sorting mechanisms internally for object types could use safe_sort (see pandas.core.sorting) which handles correctly these mixed types cases

@mroeschke
Copy link
Member

This looks like a duplicate of #17010. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants