dask.dataframe.read_*: change default blocksize to 128 MiB #9850

crusaderky · 2023-01-19T13:09:45Z

Related: Add blocksize to read_parquet and read_json (non-line json) #9849.

Oversized chunks cause the RAM-per-host requirements of a workload to shoot up.
High variance in the chunk size of a dataset confuses the duration prediction algorithm in the distributed scheduler - and probably a few other heuristics that I can't think of right now.

It is exceedingly common in real life to have to deal with loading files from disk that are far too large for comfort and/or vary wildly in size among the same dataset. I had recently to load up a dataset where the files varied between 22 MiB and 830 MiB each. Once decompressed, the 830 MiB partitions took in excees of 6 GiB RAM each to load up.

New dask users are unlikely to be immediately aware of this problem, which will cause their computatio to crash, and they may not notice that there's a simple parameter that would fix their problem.
Experienced dask uses are also not guaranteed to notice that there are a dozen of huge files among a dataset of 1000+ modestly sized ones.

Proposed design

In an effort to improve user experience, I believe we should change the default blocksize for all dask.dataframe.read_* functions from False (1 file = 1 partition) to 128 MiB, coherently with the default chunk size in dask.array.
Note that 128 MiB is the size on disk, which can be substantially different from the size in memory. However I believe that ending up with somewhat too large or too small, but constrained, partitions is much preferable to having to deal with unconstrained partitions.

Special cases

read_parquet and read_json (for non-line json) don't support blocksize. It should be added (Add blocksize to read_parquet and read_json (non-line json) #9849).
For the two above file formats, print a warning:

Some partitions on disk are up to 830 MiB in size; they will be split into 128 MiB ones but only after loading them up at once. This will cause your RAM usage to spike during data load. If you produced this dataset yourself, consider saving it in smaller partitions (e.g. by calling repartition just before writing to disk). Explicitly set the blocksize parameter to disable this warning.

(only print the warning if there are partitions exceeding 256 MiB).

read_hdf has chunksize (number of rows) instead of blocksize (number of bytes). We should add a mutually-exclusive blocksize parameter, which would crudely estimate size of 1 row in bytes = size of the file in bytes / number of rows in the file. The default when neither is explicitly stated should change from chunksize=1_000_000 to blocksize="128MiB".

The text was updated successfully, but these errors were encountered:

crusaderky · 2023-01-24T15:17:31Z

XREF (for parquet only): #9637

github-actions bot added the needs triage Needs a response from a contributor label Jan 19, 2023

crusaderky added dataframe io good second issue Clearly described, educational, but less trivial than "good first issue". enhancement Improve existing functionality or make things work better and removed needs triage Needs a response from a contributor labels Jan 19, 2023

crusaderky mentioned this issue Jan 20, 2023

Post-mortem: why an easy workflow was horribly non-performant, and what we could do to make it easier for users to write fast dask code dask/community#301

Open

novdanody mentioned this issue May 9, 2023

Add blocksize to read_parquet and read_json (non-line json) #9849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dask.dataframe.read_*: change default blocksize to 128 MiB #9850

dask.dataframe.read_*: change default blocksize to 128 MiB #9850

crusaderky commented Jan 19, 2023 •

edited

Loading

crusaderky commented Jan 24, 2023

dask.dataframe.read_*: change default blocksize to 128 MiB #9850

dask.dataframe.read_*: change default blocksize to 128 MiB #9850

Comments

crusaderky commented Jan 19, 2023 • edited Loading

Proposed design

Special cases

crusaderky commented Jan 24, 2023

crusaderky commented Jan 19, 2023 •

edited

Loading