Fuse more aggressively if parquet files are tiny #1029

phofl · 2024-04-16T18:02:10Z

This PR does a few things:

Fuse tiny parquet files more aggressive up to 75MB per partition (in memory)
set split_out in groupby also for length one grouping keys (this is more conservative and helps uf get rid of a few split_out=True optimisations in benchmarks)
make isin work with npartitions=1 Dask Series objects

Here is an AB test for this (fuse tag)

https://github.com/coiled/benchmarks/actions/runs/8709180542

fjetter · 2024-04-17T08:30:28Z

dask_expr/_groupby.py

@@ -674,7 +676,7 @@ class GroupByReduction(Reduction, GroupByBase):
    _chunk_cls = GroupByChunk

    def _tune_down(self):
-        if len(self.by) > 1 and self.operand("split_out") is None:
+        if self.operand("split_out") is None:


Wouldn't we always shuffle now?

except if split_out=1 is set explicitly

We just had a conversation about this and agreed that we'll go for this automatic behavior. This means that some group by operations will perform a bit worse since we are forcing a shuffle that is not strictly necessary.
For large output results the shuffle is a necessity and for tiny output results, the additional shuffle step only adds a marginal performance penalty in our testing since it operates on the already reduced data.

It is a safer choice and most users will not want to or be able to dig in deep enough to set this parameter such that this is a good default choice.

fjetter · 2024-04-17T08:30:56Z

dask_expr/io/parquet.py

        for col in approx_stats["columns"]:
            total_uncompressed += col["total_uncompressed_size"]
            if col["path_in_schema"] in col_op:
                after_projection += col["total_uncompressed_size"]

+        total_uncompressed = max(total_uncompressed, 75_000_000)


I suggest to expose this as a config value

dask/dask#11052

I'll write docs for this tomorrow or later today after I finished the blog (we need release notes anyway for the trivial shuffles)

phofl · 2024-04-17T09:59:23Z

dask_expr/io/parquet.py

@@ -821,6 +821,8 @@ def sample_statistics(self, n=3):
        ixs = []
        for i in range(0, nfrags, stepsize):
            sort_ix = finfo_argsort[i]
+            # TODO: This is crude but the most conservative estimate
+            sort_ix = sort_ix if sort_ix < nfrags else 0


fjetter · 2024-04-17T10:57:42Z

dask_expr/io/parquet.py

+        min_size = (
+            dask.config.get("dataframe.parquet.minimum-partition-size") or 75_000_000
+        )


I just noticed that we have a parameter that more or less closely matches the functionality of this.

There is blocksize which is used in the legacy parquet reader to control how row groups are concatenated. It's not a perfect match but very close one. I'm fine with keeping things as they are for now but wanted to document this for prosperity.

phofl added 10 commits April 16, 2024 01:57

Fuse for size as well

efc260a

Fixup

061196d

split also for one key

496eab1

split also for one key

f6cdc44

Update isin implementation

f6b7438

Fixup

d0c9e99

Remove shuffle stuff

4793dad

Fixuo

088da94

Fixup

1a6e4dd

Fixup

0c2489d

phofl changed the title ~~Fuse size1~~ Fuse more aggressively if parquet files are tiny Apr 16, 2024

fjetter reviewed Apr 17, 2024

View reviewed changes

phofl added 2 commits April 17, 2024 11:44

Fixup

95930fd

Fixup

974ba6f

phofl commented Apr 17, 2024

View reviewed changes

Fixup

24a3fab

fjetter reviewed Apr 17, 2024

View reviewed changes

Merge branch 'main' into fuse_size1

a6f2b25

fjetter merged commit ce10d5a into dask:main Apr 18, 2024
7 checks passed

phofl deleted the fuse_size1 branch April 18, 2024 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse more aggressively if parquet files are tiny #1029

Fuse more aggressively if parquet files are tiny #1029

phofl commented Apr 16, 2024

fjetter Apr 17, 2024

phofl Apr 17, 2024

phofl Apr 17, 2024

fjetter Apr 17, 2024 •

edited

Loading

fjetter Apr 17, 2024

phofl Apr 17, 2024

phofl Apr 17, 2024

phofl Apr 17, 2024 •

edited

Loading

fjetter Apr 17, 2024

Fuse more aggressively if parquet files are tiny #1029

Fuse more aggressively if parquet files are tiny #1029

Conversation

phofl commented Apr 16, 2024

fjetter Apr 17, 2024

Choose a reason for hiding this comment

phofl Apr 17, 2024

Choose a reason for hiding this comment

phofl Apr 17, 2024

Choose a reason for hiding this comment

fjetter Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

fjetter Apr 17, 2024

Choose a reason for hiding this comment

phofl Apr 17, 2024

Choose a reason for hiding this comment

phofl Apr 17, 2024

Choose a reason for hiding this comment

phofl Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

fjetter Apr 17, 2024

Choose a reason for hiding this comment

fjetter Apr 17, 2024 •

edited

Loading

phofl Apr 17, 2024 •

edited

Loading