PERF: Potential room for optimizaing the performance of .loc with big non-unique masked index dataframe #56759

Alexia-I · 2024-01-07T04:29:03Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Hello, I am writing to suggest a potential improvement in the time efficiency of Pandas. This pertains to a performance issue similar to the one highlighted in xref. #54550. However, its PR, #54746, overlooks the aspect of adjusting the threshold for get_indexer_non_unique in the class MaskedIndexEngine. My suggestion is to revise this limit, aligning with the strategy adopted in #54746, specifically setting it to len(targets) < (n / (2 * n.bit_length())). I believe this adjustment could positively impact the performance.

I am willing to create a pull request for this if you believe it would be beneficial.

import random
import time
import pandas as pd
import numpy as np

if __name__ == "__main__":
    # Create a large pandas dataframe with non-unique indexes and some NaN values
    table_size = 10_000_000
    num_index = 1_000_000
    data = [1] * table_size
    # Introduce NaNs into the data
    for _ in range(table_size // 10):  # Introduce NaNs in 10% of the data
        data[random.randint(0, table_size - 1)] = np.nan
    df = pd.DataFrame(data)
    index = random.choices(range(num_index), k=table_size)
    df.index = index
    df = df.sort_index()

    # Pre-query the index to force optimizations.
    df.loc[[5, 6, 7, 456, 65743]]
    df.loc[[1000]]

    # Testing 'df.loc' with all at once using a list of indexes, on masked data.
    for i in range(10):
        indexes = random.sample(list(df.index), k=i+1)
        start = time.monotonic()
        df.loc[indexes]
        measure = time.monotonic() - start
        print(f"With all at once (masked data): num_indexes={i+1} => {measure:.5f}s")

    print("---")

    # Testing 'df.loc' one at a time using a list of indexes, on masked data.
    for i in range(10):
        indexes = random.sample(list(df.index), k=i+1)
        start = time.monotonic()
        pd.concat([df.loc[[idx]] for idx in indexes])
        measure = time.monotonic() - start
        print(f"With one at a time (masked data): num_indexes={i+1} => {measure:.5f}s")

printed result:


With all at once (masked data): num_indexes=1 => 0.00045s
With all at once (masked data): num_indexes=2 => 0.00048s
With all at once (masked data): num_indexes=3 => 0.00052s
With all at once (masked data): num_indexes=4 => 0.00050s
With all at once (masked data): num_indexes=5 => 0.64931s
With all at once (masked data): num_indexes=6 => 0.65066s
With all at once (masked data): num_indexes=7 => 0.65181s
With all at once (masked data): num_indexes=8 => 0.80003s
With all at once (masked data): num_indexes=9 => 0.65251s
With all at once (masked data): num_indexes=10 => 0.66629s
---
With one at a time (masked data): num_indexes=1 => 0.00081s
With one at a time (masked data): num_indexes=2 => 0.00134s
With one at a time (masked data): num_indexes=3 => 0.00114s
With one at a time (masked data): num_indexes=4 => 0.00173s
With one at a time (masked data): num_indexes=5 => 0.00132s
With one at a time (masked data): num_indexes=6 => 0.00191s
With one at a time (masked data): num_indexes=7 => 0.20084s
With one at a time (masked data): num_indexes=8 => 0.00180s
With one at a time (masked data): num_indexes=9 => 0.00201s
With one at a time (masked data): num_indexes=10 => 0.00169s

Installed Versions

commit : a671b5a
python : 3.9.18.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-88-generic
Version : #98 SMP Mon Oct 9 16:43:45 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.4
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.18.1
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

Alexia-I added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Potential room for optimizaing the performance of .loc with big non-unique masked index dataframe #56759

PERF: Potential room for optimizaing the performance of .loc with big non-unique masked index dataframe #56759

Alexia-I commented Jan 7, 2024

PERF: Potential room for optimizaing the performance of .loc with big non-unique masked index dataframe #56759

PERF: Potential room for optimizaing the performance of .loc with big non-unique masked index dataframe #56759

Comments

Alexia-I commented Jan 7, 2024

Pandas version checks

Reproducible Example

Installed Versions

Prior Performance