libcurl error on scanning huge pyarrow dataset over s3 #9505

cjackal · 2023-06-22T14:50:38Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

For huge pyarrow.dataset.Dataset over S3, scanning with .scanner works well but not with pl.scan_pyarrow_dataset, raising the following libcurl error.

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[2], line 16
     14 files = ds.files
     15 ds = dataset(files[:300], format="parquet", filesystem=fs)
---> 16 pl.scan_pyarrow_dataset(ds).select(pl.col("area").head()).collect()

File /mnt/venvs/main/lib/python3.10/site-packages/polars/lazyframe/frame.py:1504, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1493     common_subplan_elimination = False
   1495 ldf = self._ldf.optimization_toggle(
   1496     type_coercion,
   1497     predicate_pushdown,
   (...)
   1502     streaming,
   1503 )
-> 1504 return wrap_df(ldf.collect())

ComputeError: OSError: When reading information for key '<redacted - string of length 190>' in bucket '<redacted - string of length 27>': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode 43, A libcurl function was given a bad argument

I have several such datasets, each of filecount ~ 100,000 (naively hived), and all of them succeeded with pyarrow but failed with polars. When restricted to ~200 files, both work. Some observations:

When using credential-included URI (https://<access_key>:<secret_key>@<bucket>/<common_prefix>/?region=<region>&endpoint_override=<endpoint_override> and not setting filesystem argument) pyarrow dataset gets a bit slower but pl.scan_pyarrow_dataset hangs forever. I suspect that this is due to serialization cost of dataset (for my dataset pickled dataset has size >1Gb when created via credential-included URI, contrary to <1Kb when created with S3FileSystem)
The filecount threshold is exactly 272: dataset(files[:272], format="parquet", filesystem=fs) always works and dataset(files[:273], format="parquet", filesystem=fs) always fails (for all of my datasets and randomly shuffled files). All paths (S3 keys I mean) have same length, so I suspect that some paths are passed truncated when there are too many paths. (At least that is what the error message implies?)
It does not matter if streaming=True or False: if a dataset succeeded with streaming=True then it always succeed with streaming=False (and the other way round)

I can confirm that this error exists for all polars versions I have used so far (effectively all of polars>=0.16).

Reproducible example

Due to the nature of the issue, I doubt I can make a minimal reproducer with free-of-charge S3 account. The datasets are under company firewall (completely closed), but I will try whatever suggestions at work on line.


import polars as pl
from pyarrow.dataset import dataset
from pyarrow.fs import S3FileSystem

fs = S3FileSystem(
    access_key=access_key,
    secret_key=secret_key,
    region=region,
    endpoint_override=endpoint_override,
)
ds = dataset("my-bucket/common-prefix", format="parquet", partitioning="hive", filesystem=fs)
ds.head(10)  # succeeded in few secs
pl.scan_pyarrow_dataset(ds).select(pl.col("area").head()).collect()  # failed after >10 mins with libcurl error
files = ds.files  # to get the full list of URIs
ds = dataset(files[:300], format="parquet", filesystem=fs)  # restrict num of files
ds.head(10)  # still works
pl.scan_pyarrow_dataset(ds).select(pl.col("area").head()).collect()  # failed after >10 mins with libcurl error
ds = dataset(files[:272], format="parquet", filesystem=fs)
pl.scan_pyarrow_dataset(ds).select(pl.col("area").head()).collect()  # now it works

Expected behavior

pl.scan_pyarrow_dataset(ds).head().collect() works whenever ds.head(5) works.

Installed versions

--------Version info---------
Polars:      0.18.3
Index type:  UInt32
Platform:    # it's ubuntu jammy latest, I forgot to copy the output
Python:      3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]

----Optional dependencies----
numpy:       1.24.3
pandas:      1.5.2
pyarrow:     12.0.1
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  <not installed>
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2023-06-23T07:20:23Z

This error is on pyarrow's side. Can you open an issue upstream?

cjackal · 2023-06-23T17:01:21Z

@ritchie46 Upstream issue opened!

The libcurl error is potentially relevant to this comment, is there a chance that pickle-unpickling dataset make a mess on s3 connection cleanup process?

shomilj · 2023-11-16T06:19:57Z

@cjackal - did you make any progress on investigating this? We're seeing this in a different context and are unable to make any progress debugging it (ray-project/ray#41137).

stinodego · 2024-03-29T11:55:39Z

I'm closing this as the issue is not on Polars' side.

cjackal added bug Something isn't working python Related to Python Polars labels Jun 22, 2023

cjackal mentioned this issue Jun 23, 2023

[Python] libcurl error on scanning huge dataset using polars.scan_pyarrow_dataset apache/arrow#36272

Open

stinodego added the blocked Cannot be worked on due to external dependencies, or significant new internal features needed first label Oct 14, 2023

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

stinodego added the A-io-cloud Area: reading/writing to cloud storage label Jan 20, 2024

stinodego removed the needs triage Awaiting prioritization by a maintainer label Mar 29, 2024

stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libcurl error on scanning huge pyarrow dataset over s3 #9505

libcurl error on scanning huge pyarrow dataset over s3 #9505

cjackal commented Jun 22, 2023

ritchie46 commented Jun 23, 2023

cjackal commented Jun 23, 2023

shomilj commented Nov 16, 2023

stinodego commented Mar 29, 2024

libcurl error on scanning huge pyarrow dataset over s3 #9505

libcurl error on scanning huge pyarrow dataset over s3 #9505

Comments

cjackal commented Jun 22, 2023

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

ritchie46 commented Jun 23, 2023

cjackal commented Jun 23, 2023

shomilj commented Nov 16, 2023

stinodego commented Mar 29, 2024