Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Left join large memory usage regression #18106

Open
2 tasks done
CHDev93 opened this issue Aug 8, 2024 · 7 comments
Open
2 tasks done

Left join large memory usage regression #18106

CHDev93 opened this issue Aug 8, 2024 · 7 comments
Labels
invalid A bug report that is not actually a bug python Related to Python Polars

Comments

@CHDev93
Copy link

CHDev93 commented Aug 8, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

# polars_join_bug_mwe.py 

import time

import numpy as np
import polars as pl


#@profile
def get_index_with_time(idx_df: pl.DataFrame, data_df: pl.DataFrame) -> pl.DataFrame:
    time_coords_df = data_df.select(pl.col("time").unique().sort())
    return idx_df.join(time_coords_df, how="cross")


#@profile
def process_data(time_idx_df, data_df):
    tmp_df = time_idx_df.join(
        data_df,
        how="left",
        left_on=["sensor", "date", "time"],
        right_on=["_sensor", "_date", "time"],
        coalesce=True,
    )
    time.sleep(2)

    # _sensor, _date, time, _marker, 0, 1, 2, 3
    data_df = tmp_df.drop("sensor", "date", "time").fill_null(0)
    return data_df


#@profile
def get_data():

    sensor_df = pl.DataFrame({"sensor": np.arange(1000)})
    date_df = pl.DataFrame({"date": np.arange(250)})

    # sensor, date
    idx_df = sensor_df.join(date_df, how="cross")

    id_df = pl.DataFrame({"_sensor": np.arange(1000)})
    day_df = pl.DataFrame({"_date": np.arange(250)})
    times_df = pl.DataFrame({"time": np.arange(150)})

    # _sensor, _date, time, _marker, 0, 1, 2, 3
    data_df = (
        id_df.join(day_df, how="cross")
        .join(times_df, how="cross")
        .with_columns(
            pl.lit(0).alias("_marker"),
            pl.lit(0.0).cast(pl.Float32).alias("0"),
            pl.lit(1.0).cast(pl.Float32).alias("1"),
            pl.lit(2.0).cast(pl.Float32).alias("2"),
            pl.lit(3.0).cast(pl.Float32).alias("3"),
        )
    )

    return idx_df, data_df


#@profile
def main():
    import time

    start = time.perf_counter()
    idx_df, data_df = get_data()

    time_idx_df = get_index_with_time(idx_df, data_df)
    print(f"Augment index: {time.perf_counter() - start:.3f}s")

    data_df = process_data(time_idx_df, data_df)
    print(f"Process data: {time.perf_counter() - start:.3f}s")

    return data_df


def sleepy():
    import time

    time.sleep(5)


if __name__ == "__main__":
    df = main()

    # I get weird Keyboard interrupts when running polars through memory-profiler without some sleeps
    sleepy()

I installed memo

Log output

join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
Augment index: 1.441s
join parallel: true
LEFT join dataframes finished
Process data: 7.592s

Issue description

I'm doing a join of two tables on a compound key of [int, int, int]. In newer versions of polars (inlcuding polars==1.4.1) it uses much more memory than I'd expect. I confirmed by rolling back to polars==0.19.19 and found it did use significantly less memory.

Expected behavior

I'd expect a left join for this problem to use something like 2x the space of the left table as it did in 0.19.19. I ran the script with python's memory-profiler package and used the commands

Running: mprof run --python -o polars_141_small.prof -M --include-children python polars_join_bug_mwe.py
Plotting: mprof plot polars_141_small.prof --title polars_1_4_1 -o polars_141_small.png -w 0,12

image

image

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
nest_asyncio:         1.6.0
numpy:                1.23.5
openpyxl:             <not installed>
pandas:               1.5.3
pyarrow:              11.0.0
pydantic:             2.6.3
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                2.1.0+cu118
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>
@CHDev93 CHDev93 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 8, 2024
@CHDev93
Copy link
Author

CHDev93 commented Aug 8, 2024

Other things of note

  • you can uncomment the @profile decorators when running with mprof (no need to import anything) to get the the little colored function markers
  • memory usage in this small example isn't a huge deal but my actual problem is 100s of GB and this ends up failing with OOM errors

@corwinjoy
Copy link
Contributor

corwinjoy commented Nov 21, 2024

OK. Can confirm that we see this performance regression with the latest dev version of polars.
Latest seems to perform significantly worse both in memory but run time is comparable.
Ran the following script with @profile enabled. Had to remove coalesce=True in line 17 since this is not supported in 0.19.19.
@adamreeve

#dev version: 1.12.0
source .venv/bin/activate
make build-dist-release
mprof run --python -o polars_1120_small.prof  -M --include-children python polars_join_bug_mwe.py 
# Augment index: 0.684s
# Process data: 5.261s
mprof plot polars_1120_small.prof --title polars_1_12_0 -o polars_1_12_0_small.png -w 0,12

# Compare with 0.19.19
pip uninstall polars 
pip install --force-reinstall -v "polars==0.19.19"
mprof run --python -o polars_1919_small.prof  -M --include-children python polars_join_bug_mwe.py 
# Augment index: 0.859s
# Process data: 5.787s
mprof plot polars_1919_small.prof --title polars_0_19_19 -o polars_0_19_19_small.png -w 0,12

@corwinjoy
Copy link
Contributor

corwinjoy commented Nov 21, 2024

polars_0_19_19_small
polars_1_12_0_small

@ritchie46
Copy link
Member

I cannot reproduce. First bump is 1.14, second bump is 0.19.19

image

Performance difference:

1.14: 4.899s
0.19.19: 5.804s

@CHDev93
Copy link
Author

CHDev93 commented Nov 22, 2024

@corwinjoy the plot for 0.19.19 shows process_data being essentially instant. Are you sure the same thing is being run in both cases? The graph for polars 1.12.0 looks far too slow compared with 0.19.19

@corwinjoy
Copy link
Contributor

corwinjoy commented Nov 22, 2024

@CHDev93 The same script was being run in both cases, but I agree that the time for process_data was odd from mprof. I think the sleep command was throwing off the timings there. Re-running 0.19.19 with the sleep commands removed I get more reasonable timings out. However, the memory issue remains as discussed on discord. It looks like @ritchie46 has some ideas there.
polars_0_19_19_small

The mprof command outputs (with sleep removed using 0.19.19):

Augment index: 0.879s
Process data: 3.900s
``

@ritchie46
Copy link
Member

ritchie46 commented Nov 23, 2024

Yes, we now go into the row-encodign for the group-by. The new algorithm is faster when data doesn't fit in your cache size anymore. You must increase the dataset (depending on the beefyness of your machine) to see the result. Latest Polars is 1.3-1.5x faster for me, but does indeed require more memory (1.5).

The old code had to go though. The memory requirements will improve with the new-streaming engine and with fixed size row-encoding which we plan to add as well.

#19929 will also improve this if we land it.

In any case, it isn't a bug but the cost of our new algorithm. We have to be able to remove old code branches if they hurt us and sometimes this has a different memory footprint.

@ritchie46 ritchie46 added invalid A bug report that is not actually a bug and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid A bug report that is not actually a bug python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants