dask.dataframe.read_*: change default blocksize to 128 MiB #9850
Labels
dataframe
enhancement
Improve existing functionality or make things work better
good second issue
Clearly described, educational, but less trivial than "good first issue".
io
Oversized chunks cause the RAM-per-host requirements of a workload to shoot up.
High variance in the chunk size of a dataset confuses the duration prediction algorithm in the distributed scheduler - and probably a few other heuristics that I can't think of right now.
It is exceedingly common in real life to have to deal with loading files from disk that are far too large for comfort and/or vary wildly in size among the same dataset. I had recently to load up a dataset where the files varied between 22 MiB and 830 MiB each. Once decompressed, the 830 MiB partitions took in excees of 6 GiB RAM each to load up.
New dask users are unlikely to be immediately aware of this problem, which will cause their computatio to crash, and they may not notice that there's a simple parameter that would fix their problem.
Experienced dask uses are also not guaranteed to notice that there are a dozen of huge files among a dataset of 1000+ modestly sized ones.
Proposed design
In an effort to improve user experience, I believe we should change the default
blocksize
for alldask.dataframe.read_*
functions from False (1 file = 1 partition) to 128 MiB, coherently with the default chunk size in dask.array.Note that 128 MiB is the size on disk, which can be substantially different from the size in memory. However I believe that ending up with somewhat too large or too small, but constrained, partitions is much preferable to having to deal with unconstrained partitions.
Special cases
read_parquet
andread_json
(for non-line json) don't supportblocksize
. It should be added (Add blocksize to read_parquet and read_json (non-line json) #9849).(only print the warning if there are partitions exceeding 256 MiB).
read_hdf
haschunksize
(number of rows) instead ofblocksize
(number of bytes). We should add a mutually-exclusiveblocksize
parameter, which would crudely estimatesize of 1 row in bytes = size of the file in bytes / number of rows in the file
. The default when neither is explicitly stated should change fromchunksize=1_000_000
toblocksize="128MiB"
.The text was updated successfully, but these errors were encountered: