libcurl error on scanning huge pyarrow dataset over s3 #9505
Labels
A-io-cloud
Area: reading/writing to cloud storage
blocked
Cannot be worked on due to external dependencies, or significant new internal features needed first
bug
Something isn't working
python
Related to Python Polars
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
For huge
pyarrow.dataset.Dataset
over S3, scanning with.scanner
works well but not withpl.scan_pyarrow_dataset
, raising the following libcurl error.I have several such datasets, each of filecount ~ 100,000 (naively hived), and all of them succeeded with
pyarrow
but failed withpolars
. When restricted to ~200 files, both work. Some observations:https://<access_key>:<secret_key>@<bucket>/<common_prefix>/?region=<region>&endpoint_override=<endpoint_override>
and not settingfilesystem
argument) pyarrow dataset gets a bit slower butpl.scan_pyarrow_dataset
hangs forever. I suspect that this is due to serialization cost of dataset (for my dataset pickled dataset has size >1Gb when created via credential-included URI, contrary to <1Kb when created withS3FileSystem
)dataset(files[:272], format="parquet", filesystem=fs)
always works anddataset(files[:273], format="parquet", filesystem=fs)
always fails (for all of my datasets and randomly shuffled files). All paths (S3 keys I mean) have same length, so I suspect that some paths are passed truncated when there are too many paths. (At least that is what the error message implies?)streaming=True
orFalse
: if a dataset succeeded withstreaming=True
then it always succeed withstreaming=False
(and the other way round)I can confirm that this error exists for all polars versions I have used so far (effectively all of
polars>=0.16
).Reproducible example
Expected behavior
pl.scan_pyarrow_dataset(ds).head().collect()
works wheneverds.head(5)
works.Installed versions
The text was updated successfully, but these errors were encountered: