Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Culling in HashJoinP2PLayer can blow up memory #8196

Closed
phofl opened this issue Sep 20, 2023 · 0 comments · Fixed by #8197
Closed

Culling in HashJoinP2PLayer can blow up memory #8196

phofl opened this issue Sep 20, 2023 · 0 comments · Fixed by #8197

Comments

@phofl
Copy link
Collaborator

phofl commented Sep 20, 2023

Describe the issue:

_cull_dependencies in HashJoinP2PLayer creates a defaultdict that is huge as the number of partitions grows. we have 10476 parts out and 10476 partitions, which will create relatively big sets. The memory footprint of deps is over 10GB

Minimal Complete Verifiable Example:

def read_data(filename):

    path = "s3://coiled-runtime-ci/tpch-scale-1000/" + filename + "/"
    return dd.read_parquet(path, engine="pyarrow")


if __name__ == "__main__":
    cluster = LocalCluster()
    client = cluster.get_client()

    lineitem_ds = read_data("lineitem")
    orders_ds = read_data("orders")
    cutomer_ds = read_data("customer")

    jn1 = cutomer_ds.merge(orders_ds, left_on="c_custkey", right_on="o_custkey")
    jn2 = jn1.merge(lineitem_ds, left_on="o_orderkey", right_on="l_orderkey")
    jn2["revenue"] = jn2.l_extendedprice * (1 - jn2.l_discount)
    total = jn2.groupby(["l_orderkey", "o_orderdate", "o_shippriority"])[
        "revenue"
    ].sum()
    total.reset_index().head(10)[
        ["l_orderkey", "revenue", "o_orderdate", "o_shippriority"]
    ]

sorry for the non-ideal reproducer

Anything else we need to know?:

Can we reduce the memory footprint here somehow?

cc @fjetter

Environment:

  • Dask version: main
  • Python version: 3.11
  • Operating System: mac
  • Install method (conda, pip, source):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant