Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask.dataframe.read_*: change default blocksize to 128 MiB #9850

Open
crusaderky opened this issue Jan 19, 2023 · 1 comment
Open

dask.dataframe.read_*: change default blocksize to 128 MiB #9850

crusaderky opened this issue Jan 19, 2023 · 1 comment
Labels
dataframe enhancement Improve existing functionality or make things work better good second issue Clearly described, educational, but less trivial than "good first issue". io

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Jan 19, 2023

Oversized chunks cause the RAM-per-host requirements of a workload to shoot up.
High variance in the chunk size of a dataset confuses the duration prediction algorithm in the distributed scheduler - and probably a few other heuristics that I can't think of right now.

It is exceedingly common in real life to have to deal with loading files from disk that are far too large for comfort and/or vary wildly in size among the same dataset. I had recently to load up a dataset where the files varied between 22 MiB and 830 MiB each. Once decompressed, the 830 MiB partitions took in excees of 6 GiB RAM each to load up.

New dask users are unlikely to be immediately aware of this problem, which will cause their computatio to crash, and they may not notice that there's a simple parameter that would fix their problem.
Experienced dask uses are also not guaranteed to notice that there are a dozen of huge files among a dataset of 1000+ modestly sized ones.

Proposed design

In an effort to improve user experience, I believe we should change the default blocksize for all dask.dataframe.read_* functions from False (1 file = 1 partition) to 128 MiB, coherently with the default chunk size in dask.array.
Note that 128 MiB is the size on disk, which can be substantially different from the size in memory. However I believe that ending up with somewhat too large or too small, but constrained, partitions is much preferable to having to deal with unconstrained partitions.

Special cases

Some partitions on disk are up to 830 MiB in size; they will be split into 128 MiB ones but only after loading them up at once. This will cause your RAM usage to spike during data load. If you produced this dataset yourself, consider saving it in smaller partitions (e.g. by calling repartition just before writing to disk). Explicitly set the blocksize parameter to disable this warning.

(only print the warning if there are partitions exceeding 256 MiB).

  • read_hdf has chunksize (number of rows) instead of blocksize (number of bytes). We should add a mutually-exclusive blocksize parameter, which would crudely estimate size of 1 row in bytes = size of the file in bytes / number of rows in the file. The default when neither is explicitly stated should change from chunksize=1_000_000 to blocksize="128MiB".
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Jan 19, 2023
@crusaderky crusaderky added dataframe io good second issue Clearly described, educational, but less trivial than "good first issue". enhancement Improve existing functionality or make things work better and removed needs triage Needs a response from a contributor labels Jan 19, 2023
@crusaderky
Copy link
Collaborator Author

XREF (for parquet only): #9637

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe enhancement Improve existing functionality or make things work better good second issue Clearly described, educational, but less trivial than "good first issue". io
Projects
None yet
Development

No branches or pull requests

1 participant