Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Potential room for optimizaing the performance of .loc with big non-unique masked index dataframe #56759

Open
3 tasks done
Alexia-I opened this issue Jan 7, 2024 · 0 comments
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance

Comments

@Alexia-I
Copy link

Alexia-I commented Jan 7, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Hello, I am writing to suggest a potential improvement in the time efficiency of Pandas. This pertains to a performance issue similar to the one highlighted in xref. #54550. However, its PR, #54746, overlooks the aspect of adjusting the threshold for get_indexer_non_unique in the class MaskedIndexEngine. My suggestion is to revise this limit, aligning with the strategy adopted in #54746, specifically setting it to len(targets) < (n / (2 * n.bit_length())). I believe this adjustment could positively impact the performance.

I am willing to create a pull request for this if you believe it would be beneficial.

import random
import time
import pandas as pd
import numpy as np

if __name__ == "__main__":
    # Create a large pandas dataframe with non-unique indexes and some NaN values
    table_size = 10_000_000
    num_index = 1_000_000
    data = [1] * table_size
    # Introduce NaNs into the data
    for _ in range(table_size // 10):  # Introduce NaNs in 10% of the data
        data[random.randint(0, table_size - 1)] = np.nan
    df = pd.DataFrame(data)
    index = random.choices(range(num_index), k=table_size)
    df.index = index
    df = df.sort_index()

    # Pre-query the index to force optimizations.
    df.loc[[5, 6, 7, 456, 65743]]
    df.loc[[1000]]

    # Testing 'df.loc' with all at once using a list of indexes, on masked data.
    for i in range(10):
        indexes = random.sample(list(df.index), k=i+1)
        start = time.monotonic()
        df.loc[indexes]
        measure = time.monotonic() - start
        print(f"With all at once (masked data): num_indexes={i+1} => {measure:.5f}s")

    print("---")

    # Testing 'df.loc' one at a time using a list of indexes, on masked data.
    for i in range(10):
        indexes = random.sample(list(df.index), k=i+1)
        start = time.monotonic()
        pd.concat([df.loc[[idx]] for idx in indexes])
        measure = time.monotonic() - start
        print(f"With one at a time (masked data): num_indexes={i+1} => {measure:.5f}s")

printed result:


With all at once (masked data): num_indexes=1 => 0.00045s
With all at once (masked data): num_indexes=2 => 0.00048s
With all at once (masked data): num_indexes=3 => 0.00052s
With all at once (masked data): num_indexes=4 => 0.00050s
With all at once (masked data): num_indexes=5 => 0.64931s
With all at once (masked data): num_indexes=6 => 0.65066s
With all at once (masked data): num_indexes=7 => 0.65181s
With all at once (masked data): num_indexes=8 => 0.80003s
With all at once (masked data): num_indexes=9 => 0.65251s
With all at once (masked data): num_indexes=10 => 0.66629s
---
With one at a time (masked data): num_indexes=1 => 0.00081s
With one at a time (masked data): num_indexes=2 => 0.00134s
With one at a time (masked data): num_indexes=3 => 0.00114s
With one at a time (masked data): num_indexes=4 => 0.00173s
With one at a time (masked data): num_indexes=5 => 0.00132s
With one at a time (masked data): num_indexes=6 => 0.00191s
With one at a time (masked data): num_indexes=7 => 0.20084s
With one at a time (masked data): num_indexes=8 => 0.00180s
With one at a time (masked data): num_indexes=9 => 0.00201s
With one at a time (masked data): num_indexes=10 => 0.00169s

Installed Versions

commit : a671b5a
python : 3.9.18.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-88-generic
Version : #98 SMP Mon Oct 9 16:43:45 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.4
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.18.1
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None

Prior Performance

No response

@Alexia-I Alexia-I added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

1 participant