BUG: `Index.drop_duplicates()` is inconsistent for unhashable values #60925

camriddell · 2025-02-13T17:46:58Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

## example A
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

print(df.columns.drop_duplicates())
# Traceback (most recent call last):
#   File "/home/cameron/.vim-excerpt", line 5, in <module>
#     print(df.columns.drop_duplicates())
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 3117, in drop_duplicates
#     if self.is_unique:
#        ^^^^^^^^^^^^^^
#   File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
#   File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 2346, in is_unique
#     return self._engine.is_unique
#            ^^^^^^^^^^^^^^^^^^^^^^
#   File "index.pyx", line 266, in pandas._libs.index.IndexEngine.is_unique.__get__
#   File "index.pyx", line 271, in pandas._libs.index.IndexEngine._do_unique_check
#   File "index.pyx", line 333, in pandas._libs.index.IndexEngine._ensure_mapping_populated
#   File "pandas/_libs/hashtable_class_helper.pxi", line 7115, in pandas._libs.hashtable.PyObjectHashTable.map_locations
# TypeError: unhashable type: 'list'


## --------
## example B
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

# hasattr triggers a side effect where the `df.columns.drop_duplicates()` now works.
hasattr(df, 'hello_world')
print(df.columns.drop_duplicates())
# Index(['a', ['b', 'c']], dtype='object')

Issue Description

pandas.Index.drop_duplicates() inconsistently raises TypeError: unhashable type: 'list' when its values encompass a list. This error does not seem to prevent the underlying uniqueness computation from happening. In addition to the submitted reproducible example there is a direct causation here in the Index object:

If we call .drop_duplicates when the Index contains unhashable types, we observe a TypeError.

import pandas as pd

idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
idx.drop_duplicates() # TypeError: unhashable type: 'list'

But for some reason if we simply ignore the error the first time and try .drop_duplicates() again it works and removes the duplicated entities including the unhashable ones?

import pandas as pd

idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
try:
    idx.drop_duplicates()    # TypeError: unhashable type: 'list'
except TypeError:
    pass
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')

Where we can see that the underlying Index implementation populates its hashtable mapping even though the original call to drop_duplicates fails. We know this population is successful because the second attempt at .drop_duplicates works.

import pandas as pd

idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
print(idx._engine.mapping)   # None
try:
    idx.drop_duplicates()    # TypeError: unhashable type: 'list'
except TypeError:
    pass
print(idx._engine.mapping)   # <pandas._libs.hashtable.PyObjectHashTable>
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')

Finally, it appears that attribute checking on a pandas.DataFrame causes the PyObjectHashTable to be constructed for the column index. This is likely due to the shared code path between __getattr__ and __getitem__.

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
print(df.columns._engine.mapping)   # None
hasattr(df, 'hello_world')
print(df.columns._engine.mapping)   # <pandas._libs.hashtable.PyObjectHashTable>
print(df.columns.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')

Expected Behavior

I expect that Index.drop_duplicates() should work regardless of whether an attribute has been checked or not. The following two snippets should produce equivalent results (whether that is to raise an error or to produce a result):

import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

print(df.columns.drop_duplicates()) # Currently produces → TypeError

import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

hasattr(df, 'hello_world')
print(df.columns.drop_duplicates()) # Currently produces → Index(['a', ['b', 'c']], dtype='object')

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.7
python-bits : 64
OS : Linux
OS-release : 6.6.52-1-lts
Version : #1 SMP PREEMPT_DYNAMIC Wed, 18 Sep 2024 19:02:04 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.2.2
pytz : 2025.1
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.2.0
html5lib : None
hypothesis : 6.125.3
gcsfs : None
jinja2 : 3.1.5
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 19.0.0
pyreadstat : None
pytest : 8.3.4
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2025-02-13T22:18:07Z

Thanks for the report! This should raise consistently. Further investigations and PRs to fix are welcome!

Edit for visibility: As @MarcoGorelli points out below, this should even raise on index construction!

The source of the issue appears to be here:

pandas/pandas/_libs/index.pyx

Lines 347 to 348 in 0305656

    
           self.mapping = self._make_hash_table(len(values)) 
        
           self.mapping.map_locations(values, self.mask)

We create an unpopulated hash table, and then fail on the map_locations line. However, this only happens when self.is_mapping_populated is False, which uses:

pandas/pandas/_libs/index.pyx

Lines 335 to 337 in 0305656

    
           @property 
        
           def is_mapping_populated(self) -> bool: 
        
               return self.mapping is not None

Still, I do not see how this could then return correct results. It is certainly inefficient.

tuples = [(k,) for k in range(20000)] + [(0,)]
idx = pd.Index(tuples)
%timeit idx.drop_duplicates()
# 445 μs ± 3.66 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

lists = [[k] for k in range(20000)] + [[0]]
idx = pd.Index(lists)
try:
    idx.drop_duplicates()
except TypeError:
    pass
%timeit idx.drop_duplicates()
# 3.03 s ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My guess is that the hash table is somehow degenerating into saying everything is a collision, and therefore doing an O(n) lookup (where n is the size of the hash table).

johnmwu · 2025-02-14T06:14:39Z

Took a look. Think I see what's happening.

The root cause is that values are put in hashtables in two different ways - one where hash(val) is called (raises) and one where it isn't.

First path. Raises due to calling hash().

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Lines 1387 to 1392 in 19ea997

    
           for i in range(n): 
        
               val = values[i] 
        
               hash(val) 
        
               k = kh_put_pymap(self.table, <PyObject*>val, &ret) 
        
               self.table.vals[k] = i

Second path. Does not raise.

pandas/pandas/_libs/hashtable_func_helper.pxi.in

Lines 169 to 172 in 19ea997

    
           else: 
        
               value = {{to_c_type}}(values[i]) 
        
               kh_put_{{ttype}}(table, value, &ret) 
        
               out[i] = ret == 0

The reason there are two code paths is:

In the first call to drop_duplicates(), computation of self.is_unique initializes idx._engine.mapping which needs to call PyObjectHashTable.map_locations (the snippet linked for the first path)

pandas/pandas/core/indexes/base.py

Lines 2847 to 2850 in 19ea997

    
           if self.is_unique: 
        
               return self._view() 
        
           return super().drop_duplicates(keep=keep)

The second time around, self.is_unique is False. We then call super().drop_duplicates, which if you trace through goes down the second code path. Here, we actually rebuild an entirely new and identical (from what I can tell) hash table to idx._engine.mapping in the duplicated function of hashtable_func_helper.pxi. The natural question of course is why this is done.

As for what to do, I'm going on the premise given by @rhshadrach that we should raise twice. Intuitively, there is no need to have strong support for nonhashable values.

I'm going to suggest a few options, though I'm not yet familiar with the code so not sure which is best (if any).

The minimum code option is adding hash(values[i]) in the second path. This achieves the desired behavior.
In drop_duplicates, if computation of self.is_unique failed the first time, should it be False on the second call? I would think no. In this case, the second call will go down exactly the same code path as the first.
Reuse the already computed hash table when calling drop_duplicates. Not a direct fix, but would remove the second code path.

Thoughts on how to proceed?

MarcoGorelli · 2025-02-14T08:53:13Z

Should this not raise even earlier? As in, should columns=['a', ['b', 'c'], ['b', 'c']] even be allowed in the DataFrame constructor? (I think not?)

camriddell added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 13, 2025

camriddell changed the title ~~BUG: Index.drop_duplicates() is inconsistent for hashable values~~ BUG: Index.drop_duplicates() is inconsistent for unhashable values Feb 13, 2025

rhshadrach added Error Reporting Incorrect or improved errors from pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 13, 2025

camriddell mentioned this issue Feb 14, 2025

fix: sqlframe false positives in compliance checks narwhals-dev/narwhals#2011

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `Index.drop_duplicates()` is inconsistent for unhashable values #60925

BUG: `Index.drop_duplicates()` is inconsistent for unhashable values #60925

camriddell commented Feb 13, 2025

INSTALLED VERSIONS

rhshadrach commented Feb 13, 2025 •

edited

Loading

johnmwu commented Feb 14, 2025 •

edited

Loading

MarcoGorelli commented Feb 14, 2025

BUG: Index.drop_duplicates() is inconsistent for unhashable values #60925

BUG: Index.drop_duplicates() is inconsistent for unhashable values #60925

Comments

camriddell commented Feb 13, 2025

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Feb 13, 2025 • edited Loading

johnmwu commented Feb 14, 2025 • edited Loading

MarcoGorelli commented Feb 14, 2025

BUG: `Index.drop_duplicates()` is inconsistent for unhashable values #60925

BUG: `Index.drop_duplicates()` is inconsistent for unhashable values #60925

rhshadrach commented Feb 13, 2025 •

edited

Loading

johnmwu commented Feb 14, 2025 •

edited

Loading