PERF: Unreliable performance of `.loc` with big non-unique index dataframe #54550

l3robot · 2023-08-15T02:56:46Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import random
import time

import pandas as pd


if __name__ == "__main__":
    # Create a large pandas dataframe with non-unique indexes.
    table_size = 10_000_000
    num_index = 1_000_000
    df = pd.DataFrame([1] * table_size)
    index = random.choices(range(num_index), k=table_size)
    df.index = index
    df = df.sort_index()

    # Seems like we need to pre-query the index to force optimizations.
    df.loc[[0, 1, 2, 131, 341414]]
    df.loc[[3434]]
    
    for i in range(10):
        indexes = random.sample(df.index.tolist(), k=i+1)
        start = time.monotonic()
        df.loc[indexes]
        measure = time.monotonic() - start
        print(f"With all at once: num_indexes={i+1} => {measure:.5f}s")

    print("---")

    for i in range(10):
        indexes = random.sample(df.index.tolist(), k=i+1)
        start = time.monotonic()
        pd.concat([df.loc[[idx]] for idx in indexes])
        measure = time.monotonic() - start
        print(f"With one at a time: num_indexes={i+1} => {measure:.5f}s")

Hello everyone,

In this example, I create a big DataFrame with non-unique indexes. I then make sure the index is sorted and that it's been already queried (You could confirm, but since the first calls to loc are always slower, I assume pandas is doing some kind of lazy optimization behind).

Then the problem. I get very good performance (~0.5ms) until I ask for more than 4 indexes, then it drastically increases to ~500ms (X1000). If I instead index one at a time and concat after, performance stays in the same order of magnitude (there is a small linear increase from the loop+concat of course and it does get weird for 4...). Here is the output for pandas version 2.1.0rc0+10.gfc308235f (but the same problem arises with pandas==2.0.3 and 1.5.3):

With all at once: num_indexes=1 => 0.00039s                                                                                             
With all at once: num_indexes=2 => 0.00045s                                                                                             
With all at once: num_indexes=3 => 0.00041s                                                                                             
With all at once: num_indexes=4 => 0.00041s                                                                                             
With all at once: num_indexes=5 => 0.47082s                                                                                             
With all at once: num_indexes=6 => 0.48561s                                                                                             
With all at once: num_indexes=7 => 0.47594s                                                                                             
With all at once: num_indexes=8 => 0.47567s                                                                                             
---                                                                                                                                     
With one at a time: num_indexes=1 => 0.00066s                                                                                           
With one at a time: num_indexes=2 => 0.00086s                                                                                           
With one at a time: num_indexes=3 => 0.00103s                                                                                           
With one at a time: num_indexes=4 => 0.05667s                                                                                           
With one at a time: num_indexes=5 => 0.00156s                                                                                           
With one at a time: num_indexes=6 => 0.00160s                                                                                           
With one at a time: num_indexes=7 => 0.00181s                                                                                           
With one at a time: num_indexes=8 => 0.00200s

I would expect loc to have steady performance since, in my understanding, it should act close to a hashmap. Even if it is not the case, I think pandas would definitely benefit from a stable way of querying multiple indexes.

Thanks in advance for the help!

Installed Versions

INSTALLED VERSIONS

commit : fc30823
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.6-76060206-generic
Version : #202303130630~~1685473338~~22.04~995127e SMP PREEMPT_DYNAMIC Tue M
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.0rc0+10.gfc308235f
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
brotli : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

l3robot · 2023-08-19T03:09:57Z

So after a bit of research, I found that the source of the behavior is coming from PR #22826 at this line and particularly this comment here.

In the comment, @rtlee9 makes an excellent explanation for the reason behind the threshold. The default behavior is a linear search (O(n)) in the pandas index and #22826 added a binary search optimization for monotonically increasing index which is in O(m*log(n)) where m is the number of target indexes.

I'm proposing we increase the limit, which was set conservatively to 5. We could at least increase it to n / (2 * log2[n]) which is derivable from m*log2[n]<n and dividing per 2 for keeping a certain margin. I'll open a PR for that.

rhshadrach · 2023-08-19T11:18:51Z

cc @lukemanley - I believe you've done a lot of work here, thought you might be interested.

l3robot added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Aug 15, 2023

lithomas1 added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2023

l3robot mentioned this issue Aug 25, 2023

PERF: Increase threashold for using binary search in IndexEngine #54746

Merged

5 tasks

mroeschke closed this as completed in #54746 Aug 26, 2023

Alexia-I mentioned this issue Jan 7, 2024

PERF: Potential room for optimizaing the performance of .loc with big non-unique masked index dataframe #56759

Open

3 tasks

This was referenced Feb 29, 2024

Potential performance issue: Unreliable performance of .loc in pandas 2.0.3 FederatedAI/FATE#5513

Open

Potential performance issue: Unreliable performance of .loc below pandas 2.1 krishnaik06/The-Grand-Complete-Data-Science-Materials#147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Unreliable performance of `.loc` with big non-unique index dataframe #54550

PERF: Unreliable performance of `.loc` with big non-unique index dataframe #54550

l3robot commented Aug 15, 2023 •

edited

Loading

INSTALLED VERSIONS

l3robot commented Aug 19, 2023 •

edited

Loading

rhshadrach commented Aug 19, 2023

PERF: Unreliable performance of .loc with big non-unique index dataframe #54550

PERF: Unreliable performance of .loc with big non-unique index dataframe #54550

Comments

l3robot commented Aug 15, 2023 • edited Loading

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

l3robot commented Aug 19, 2023 • edited Loading

rhshadrach commented Aug 19, 2023

PERF: Unreliable performance of `.loc` with big non-unique index dataframe #54550

PERF: Unreliable performance of `.loc` with big non-unique index dataframe #54550

l3robot commented Aug 15, 2023 •

edited

Loading

l3robot commented Aug 19, 2023 •

edited

Loading