Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index.drop_duplicates() is inconsistent for unhashable values #60925

Open
2 of 3 tasks
camriddell opened this issue Feb 13, 2025 · 3 comments
Open
2 of 3 tasks

BUG: Index.drop_duplicates() is inconsistent for unhashable values #60925

camriddell opened this issue Feb 13, 2025 · 3 comments
Labels
Bug Error Reporting Incorrect or improved errors from pandas

Comments

@camriddell
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

## example A
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

print(df.columns.drop_duplicates())
# Traceback (most recent call last):
#   File "/home/cameron/.vim-excerpt", line 5, in <module>
#     print(df.columns.drop_duplicates())
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 3117, in drop_duplicates
#     if self.is_unique:
#        ^^^^^^^^^^^^^^
#   File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
#   File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 2346, in is_unique
#     return self._engine.is_unique
#            ^^^^^^^^^^^^^^^^^^^^^^
#   File "index.pyx", line 266, in pandas._libs.index.IndexEngine.is_unique.__get__
#   File "index.pyx", line 271, in pandas._libs.index.IndexEngine._do_unique_check
#   File "index.pyx", line 333, in pandas._libs.index.IndexEngine._ensure_mapping_populated
#   File "pandas/_libs/hashtable_class_helper.pxi", line 7115, in pandas._libs.hashtable.PyObjectHashTable.map_locations
# TypeError: unhashable type: 'list'


## --------
## example B
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

# hasattr triggers a side effect where the `df.columns.drop_duplicates()` now works.
hasattr(df, 'hello_world')
print(df.columns.drop_duplicates())
# Index(['a', ['b', 'c']], dtype='object')

Issue Description

pandas.Index.drop_duplicates() inconsistently raises TypeError: unhashable type: 'list' when its values encompass a list. This error does not seem to prevent the underlying uniqueness computation from happening. In addition to the submitted reproducible example there is a direct causation here in the Index object:

If we call .drop_duplicates when the Index contains unhashable types, we observe a TypeError.

import pandas as pd

idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
idx.drop_duplicates() # TypeError: unhashable type: 'list'

But for some reason if we simply ignore the error the first time and try .drop_duplicates() again it works and removes the duplicated entities including the unhashable ones?

import pandas as pd

idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
try:
    idx.drop_duplicates()    # TypeError: unhashable type: 'list'
except TypeError:
    pass
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')

Where we can see that the underlying Index implementation populates its hashtable mapping even though the original call to drop_duplicates fails. We know this population is successful because the second attempt at .drop_duplicates works.

import pandas as pd

idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
print(idx._engine.mapping)   # None
try:
    idx.drop_duplicates()    # TypeError: unhashable type: 'list'
except TypeError:
    pass
print(idx._engine.mapping)   # <pandas._libs.hashtable.PyObjectHashTable>
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')

Finally, it appears that attribute checking on a pandas.DataFrame causes the PyObjectHashTable to be constructed for the column index. This is likely due to the shared code path between __getattr__ and __getitem__.

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
print(df.columns._engine.mapping)   # None
hasattr(df, 'hello_world')
print(df.columns._engine.mapping)   # <pandas._libs.hashtable.PyObjectHashTable>
print(df.columns.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')

Expected Behavior

I expect that Index.drop_duplicates() should work regardless of whether an attribute has been checked or not. The following two snippets should produce equivalent results (whether that is to raise an error or to produce a result):

import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

print(df.columns.drop_duplicates()) # Currently produces → TypeError
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

hasattr(df, 'hello_world')
print(df.columns.drop_duplicates()) # Currently produces → Index(['a', ['b', 'c']], dtype='object')

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.7
python-bits : 64
OS : Linux
OS-release : 6.6.52-1-lts
Version : #1 SMP PREEMPT_DYNAMIC Wed, 18 Sep 2024 19:02:04 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.2.2
pytz : 2025.1
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.2.0
html5lib : None
hypothesis : 6.125.3
gcsfs : None
jinja2 : 3.1.5
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 19.0.0
pyreadstat : None
pytest : 8.3.4
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.1
qtpy : None
pyqt5 : None

@camriddell camriddell added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 13, 2025
@camriddell camriddell changed the title BUG: Index.drop_duplicates() is inconsistent for hashable values BUG: Index.drop_duplicates() is inconsistent for unhashable values Feb 13, 2025
@rhshadrach
Copy link
Member

rhshadrach commented Feb 13, 2025

Thanks for the report! This should raise consistently. Further investigations and PRs to fix are welcome!

Edit for visibility: As @MarcoGorelli points out below, this should even raise on index construction!

The source of the issue appears to be here:

self.mapping = self._make_hash_table(len(values))
self.mapping.map_locations(values, self.mask)

We create an unpopulated hash table, and then fail on the map_locations line. However, this only happens when self.is_mapping_populated is False, which uses:

@property
def is_mapping_populated(self) -> bool:
return self.mapping is not None

Still, I do not see how this could then return correct results. It is certainly inefficient.

tuples = [(k,) for k in range(20000)] + [(0,)]
idx = pd.Index(tuples)
%timeit idx.drop_duplicates()
# 445 μs ± 3.66 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

lists = [[k] for k in range(20000)] + [[0]]
idx = pd.Index(lists)
try:
    idx.drop_duplicates()
except TypeError:
    pass
%timeit idx.drop_duplicates()
# 3.03 s ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My guess is that the hash table is somehow degenerating into saying everything is a collision, and therefore doing an O(n) lookup (where n is the size of the hash table).

@rhshadrach rhshadrach added Error Reporting Incorrect or improved errors from pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 13, 2025
@johnmwu
Copy link

johnmwu commented Feb 14, 2025

Took a look. Think I see what's happening.

The root cause is that values are put in hashtables in two different ways - one where hash(val) is called (raises) and one where it isn't.

First path. Raises due to calling hash().

for i in range(n):
val = values[i]
hash(val)
k = kh_put_pymap(self.table, <PyObject*>val, &ret)
self.table.vals[k] = i

Second path. Does not raise.

else:
value = {{to_c_type}}(values[i])
kh_put_{{ttype}}(table, value, &ret)
out[i] = ret == 0

The reason there are two code paths is:

  1. In the first call to drop_duplicates(), computation of self.is_unique initializes idx._engine.mapping which needs to call PyObjectHashTable.map_locations (the snippet linked for the first path)
    if self.is_unique:
    return self._view()
    return super().drop_duplicates(keep=keep)
  2. The second time around, self.is_unique is False. We then call super().drop_duplicates, which if you trace through goes down the second code path. Here, we actually rebuild an entirely new and identical (from what I can tell) hash table to idx._engine.mapping in the duplicated function of hashtable_func_helper.pxi. The natural question of course is why this is done.

As for what to do, I'm going on the premise given by @rhshadrach that we should raise twice. Intuitively, there is no need to have strong support for nonhashable values.

I'm going to suggest a few options, though I'm not yet familiar with the code so not sure which is best (if any).

  1. The minimum code option is adding hash(values[i]) in the second path. This achieves the desired behavior.
  2. In drop_duplicates, if computation of self.is_unique failed the first time, should it be False on the second call? I would think no. In this case, the second call will go down exactly the same code path as the first.
  3. Reuse the already computed hash table when calling drop_duplicates. Not a direct fix, but would remove the second code path.

Thoughts on how to proceed?

@MarcoGorelli
Copy link
Member

Should this not raise even earlier? As in, should columns=['a', ['b', 'c'], ['b', 'c']] even be allowed in the DataFrame constructor? (I think not?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

No branches or pull requests

4 participants