PERF: Unreliable performance of .loc
with big non-unique index dataframe
#54550
Labels
Indexing
Related to indexing on series/frames, not to indexes themselves
Performance
Memory or execution speed performance
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Hello everyone,
In this example, I create a big
DataFrame
with non-unique indexes. I then make sure the index is sorted and that it's been already queried (You could confirm, but since the first calls toloc
are always slower, I assume pandas is doing some kind of lazy optimization behind).Then the problem. I get very good performance (~0.5ms) until I ask for more than 4 indexes, then it drastically increases to ~500ms (X1000). If I instead index one at a time and
concat
after, performance stays in the same order of magnitude (there is a small linear increase from the loop+concat of course and it does get weird for 4...). Here is the output for pandas version2.1.0rc0+10.gfc308235f
(but the same problem arises with pandas==2.0.3 and 1.5.3):I would expect
loc
to have steady performance since, in my understanding, it should act close to a hashmap. Even if it is not the case, I think pandas would definitely benefit from a stable way of querying multiple indexes.Thanks in advance for the help!
Installed Versions
INSTALLED VERSIONS
commit : fc30823
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.6-76060206-generic
Version : #202303130630
168547333822.04~995127e SMP PREEMPT_DYNAMIC Tue Mmachine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 2.1.0rc0+10.gfc308235f
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
brotli : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
Prior Performance
No response
The text was updated successfully, but these errors were encountered: