Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcurl error on scanning huge pyarrow dataset over s3 #9505

Closed
2 tasks done
cjackal opened this issue Jun 22, 2023 · 4 comments
Closed
2 tasks done

libcurl error on scanning huge pyarrow dataset over s3 #9505

cjackal opened this issue Jun 22, 2023 · 4 comments
Labels
A-io-cloud Area: reading/writing to cloud storage blocked Cannot be worked on due to external dependencies, or significant new internal features needed first bug Something isn't working python Related to Python Polars

Comments

@cjackal
Copy link
Contributor

cjackal commented Jun 22, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

For huge pyarrow.dataset.Dataset over S3, scanning with .scanner works well but not with pl.scan_pyarrow_dataset, raising the following libcurl error.

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[2], line 16
     14 files = ds.files
     15 ds = dataset(files[:300], format="parquet", filesystem=fs)
---> 16 pl.scan_pyarrow_dataset(ds).select(pl.col("area").head()).collect()

File /mnt/venvs/main/lib/python3.10/site-packages/polars/lazyframe/frame.py:1504, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1493     common_subplan_elimination = False
   1495 ldf = self._ldf.optimization_toggle(
   1496     type_coercion,
   1497     predicate_pushdown,
   (...)
   1502     streaming,
   1503 )
-> 1504 return wrap_df(ldf.collect())

ComputeError: OSError: When reading information for key '<redacted - string of length 190>' in bucket '<redacted - string of length 27>': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode 43, A libcurl function was given a bad argument

I have several such datasets, each of filecount ~ 100,000 (naively hived), and all of them succeeded with pyarrow but failed with polars. When restricted to ~200 files, both work. Some observations:

  • When using credential-included URI (https://<access_key>:<secret_key>@<bucket>/<common_prefix>/?region=<region>&endpoint_override=<endpoint_override> and not setting filesystem argument) pyarrow dataset gets a bit slower but pl.scan_pyarrow_dataset hangs forever. I suspect that this is due to serialization cost of dataset (for my dataset pickled dataset has size >1Gb when created via credential-included URI, contrary to <1Kb when created with S3FileSystem)
  • The filecount threshold is exactly 272: dataset(files[:272], format="parquet", filesystem=fs) always works and dataset(files[:273], format="parquet", filesystem=fs) always fails (for all of my datasets and randomly shuffled files). All paths (S3 keys I mean) have same length, so I suspect that some paths are passed truncated when there are too many paths. (At least that is what the error message implies?)
  • It does not matter if streaming=True or False: if a dataset succeeded with streaming=True then it always succeed with streaming=False (and the other way round)

I can confirm that this error exists for all polars versions I have used so far (effectively all of polars>=0.16).

Reproducible example

Due to the nature of the issue, I doubt I can make a minimal reproducer with free-of-charge S3 account. The datasets are under company firewall (completely closed), but I will try whatever suggestions at work on line.


import polars as pl
from pyarrow.dataset import dataset
from pyarrow.fs import S3FileSystem

fs = S3FileSystem(
    access_key=access_key,
    secret_key=secret_key,
    region=region,
    endpoint_override=endpoint_override,
)
ds = dataset("my-bucket/common-prefix", format="parquet", partitioning="hive", filesystem=fs)
ds.head(10)  # succeeded in few secs
pl.scan_pyarrow_dataset(ds).select(pl.col("area").head()).collect()  # failed after >10 mins with libcurl error
files = ds.files  # to get the full list of URIs
ds = dataset(files[:300], format="parquet", filesystem=fs)  # restrict num of files
ds.head(10)  # still works
pl.scan_pyarrow_dataset(ds).select(pl.col("area").head()).collect()  # failed after >10 mins with libcurl error
ds = dataset(files[:272], format="parquet", filesystem=fs)
pl.scan_pyarrow_dataset(ds).select(pl.col("area").head()).collect()  # now it works

Expected behavior

pl.scan_pyarrow_dataset(ds).head().collect() works whenever ds.head(5) works.

Installed versions

--------Version info---------
Polars:      0.18.3
Index type:  UInt32
Platform:    # it's ubuntu jammy latest, I forgot to copy the output
Python:      3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]

----Optional dependencies----
numpy:       1.24.3
pandas:      1.5.2
pyarrow:     12.0.1
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  <not installed>
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>
@cjackal cjackal added bug Something isn't working python Related to Python Polars labels Jun 22, 2023
@ritchie46
Copy link
Member

This error is on pyarrow's side. Can you open an issue upstream?

@cjackal
Copy link
Contributor Author

cjackal commented Jun 23, 2023

@ritchie46 Upstream issue opened!

The libcurl error is potentially relevant to this comment, is there a chance that pickle-unpickling dataset make a mess on s3 connection cleanup process?

@stinodego stinodego added the blocked Cannot be worked on due to external dependencies, or significant new internal features needed first label Oct 14, 2023
@shomilj
Copy link

shomilj commented Nov 16, 2023

@cjackal - did you make any progress on investigating this? We're seeing this in a different context and are unable to make any progress debugging it (ray-project/ray#41137).

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@stinodego stinodego added the A-io-cloud Area: reading/writing to cloud storage label Jan 20, 2024
@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label Mar 29, 2024
@stinodego
Copy link
Member

I'm closing this as the issue is not on Polars' side.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-cloud Area: reading/writing to cloud storage blocked Cannot be worked on due to external dependencies, or significant new internal features needed first bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants