Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Unreliable performance of .loc with big non-unique index dataframe #54550

Closed
3 tasks done
l3robot opened this issue Aug 15, 2023 · 2 comments · Fixed by #54746
Closed
3 tasks done

PERF: Unreliable performance of .loc with big non-unique index dataframe #54550

l3robot opened this issue Aug 15, 2023 · 2 comments · Fixed by #54746
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Comments

@l3robot
Copy link
Contributor

l3robot commented Aug 15, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import random
import time

import pandas as pd


if __name__ == "__main__":
    # Create a large pandas dataframe with non-unique indexes.
    table_size = 10_000_000
    num_index = 1_000_000
    df = pd.DataFrame([1] * table_size)
    index = random.choices(range(num_index), k=table_size)
    df.index = index
    df = df.sort_index()

    # Seems like we need to pre-query the index to force optimizations.
    df.loc[[0, 1, 2, 131, 341414]]
    df.loc[[3434]]
    
    for i in range(10):
        indexes = random.sample(df.index.tolist(), k=i+1)
        start = time.monotonic()
        df.loc[indexes]
        measure = time.monotonic() - start
        print(f"With all at once: num_indexes={i+1} => {measure:.5f}s")

    print("---")

    for i in range(10):
        indexes = random.sample(df.index.tolist(), k=i+1)
        start = time.monotonic()
        pd.concat([df.loc[[idx]] for idx in indexes])
        measure = time.monotonic() - start
        print(f"With one at a time: num_indexes={i+1} => {measure:.5f}s")

Hello everyone,

In this example, I create a big DataFrame with non-unique indexes. I then make sure the index is sorted and that it's been already queried (You could confirm, but since the first calls to loc are always slower, I assume pandas is doing some kind of lazy optimization behind).

Then the problem. I get very good performance (~0.5ms) until I ask for more than 4 indexes, then it drastically increases to ~500ms (X1000). If I instead index one at a time and concat after, performance stays in the same order of magnitude (there is a small linear increase from the loop+concat of course and it does get weird for 4...). Here is the output for pandas version 2.1.0rc0+10.gfc308235f (but the same problem arises with pandas==2.0.3 and 1.5.3):

With all at once: num_indexes=1 => 0.00039s                                                                                             
With all at once: num_indexes=2 => 0.00045s                                                                                             
With all at once: num_indexes=3 => 0.00041s                                                                                             
With all at once: num_indexes=4 => 0.00041s                                                                                             
With all at once: num_indexes=5 => 0.47082s                                                                                             
With all at once: num_indexes=6 => 0.48561s                                                                                             
With all at once: num_indexes=7 => 0.47594s                                                                                             
With all at once: num_indexes=8 => 0.47567s                                                                                             
---                                                                                                                                     
With one at a time: num_indexes=1 => 0.00066s                                                                                           
With one at a time: num_indexes=2 => 0.00086s                                                                                           
With one at a time: num_indexes=3 => 0.00103s                                                                                           
With one at a time: num_indexes=4 => 0.05667s                                                                                           
With one at a time: num_indexes=5 => 0.00156s                                                                                           
With one at a time: num_indexes=6 => 0.00160s                                                                                           
With one at a time: num_indexes=7 => 0.00181s                                                                                           
With one at a time: num_indexes=8 => 0.00200s

I would expect loc to have steady performance since, in my understanding, it should act close to a hashmap. Even if it is not the case, I think pandas would definitely benefit from a stable way of querying multiple indexes.

Thanks in advance for the help!

Installed Versions

INSTALLED VERSIONS

commit : fc30823
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.6-76060206-generic
Version : #202303130630168547333822.04~995127e SMP PREEMPT_DYNAMIC Tue M
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.0rc0+10.gfc308235f
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
brotli : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

No response

@l3robot l3robot added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Aug 15, 2023
@l3robot
Copy link
Contributor Author

l3robot commented Aug 19, 2023

So after a bit of research, I found that the source of the behavior is coming from PR #22826 at this line and particularly this comment here.

In the comment, @rtlee9 makes an excellent explanation for the reason behind the threshold. The default behavior is a linear search (O(n)) in the pandas index and #22826 added a binary search optimization for monotonically increasing index which is in O(m*log(n)) where m is the number of target indexes.

I'm proposing we increase the limit, which was set conservatively to 5. We could at least increase it to n / (2 * log2[n]) which is derivable from m*log2[n]<n and dividing per 2 for keeping a certain margin. I'll open a PR for that.

@lithomas1 lithomas1 added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2023
@rhshadrach
Copy link
Member

cc @lukemanley - I believe you've done a lot of work here, thought you might be interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants