[REVIEW] Enable writing to `s3` storage in chunked parquet writer #10769

galipremsagar · 2022-05-02T20:53:36Z

Resolves: #10522
This PR:

Enables s3 writing support in ParquetDatasetWriter
Add's a work-around to reading an s3 directory in cudf.read_parquet. Issue here: https://issues.apache.org/jira/browse/ARROW-16438
Introduces all the required s3 python library combinations that will work together with such that test_s3.py can be run locally on dev environments.
Improved the default s3fs error logs by changing the log level to DEBUG in pytests.(S3FS_LOGGING_LEVEL)

codecov · 2022-05-02T22:11:10Z

Codecov Report

Merging #10769 (808f976) into branch-22.06 (8d861ce) will decrease coverage by 0.12%.
The diff coverage is 93.12%.

❗ Current head 808f976 differs from pull request most recent head a11f507. Consider uploading reports for the commit a11f507 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10769      +/-   ##
================================================
- Coverage         86.40%   86.27%   -0.13%     
================================================
  Files               143      144       +1     
  Lines             22448    22565     +117     
================================================
+ Hits              19396    19468      +72     
- Misses             3052     3097      +45

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/lists.py	`91.66% <ø> (-0.42%)`	⬇️
python/cudf/cudf/core/column/string.py	`88.78% <ø> (-0.31%)`	⬇️
python/cudf/cudf/core/frame.py	`93.41% <ø> (ø)`
python/cudf/cudf/core/indexed_frame.py	`91.70% <ø> (ø)`
python/cudf/cudf/io/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/io/parquet.py	`90.83% <86.60%> (-1.87%)`	⬇️
python/cudf/cudf/core/index.py	`92.06% <88.88%> (-0.25%)`	⬇️
python/cudf/cudf/core/scalar.py	`89.01% <90.90%> (-0.31%)`	⬇️
python/cudf/cudf/core/dataframe.py	`93.77% <96.29%> (+0.08%)`	⬆️
python/cudf/cudf/__init__.py	`90.69% <100.00%> (+0.22%)`	⬆️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 566f29a...a11f507. Read the comment docs.

conda/recipes/cudf/meta.yaml

ajschmidt8

Approving ops-codeowner file changes

rjzamora

Thanks @galipremsagar!

conda/environments/cudf_dev_cuda11.5.yml

python/cudf/cudf/io/parquet.py

python/cudf/cudf/tests/test_s3.py

bdice · 2022-05-09T16:40:05Z

python/cudf/cudf/tests/test_s3.py

+@pytest.fixture(scope="session")
+def endpoint_port():
+    # Return a free port per worker session.
+    sock = socket.socket()
+    sock.bind(("", 0))
+    return sock.getsockname()[1]


Does this resource get closed cleanly?

https://docs.python.org/3/library/socket.html#socket.socket.close

Sockets are automatically closed when they are garbage-collected, but it is recommended to close() them explicitly, or to use a with statement around them.

We are binding the socket to listen on "" which maps to listening on all addresses (including public addresses, which is a bad safety practice). As far as I can tell, we only access it from 127.0.0.1. The socket should be bound to 127.0.0.1 instead.

Suggested change

@pytest.fixture(scope="session")

def endpoint_port():

# Return a free port per worker session.

sock = socket.socket()

sock.bind(("", 0))

return sock.getsockname()[1]

@pytest.fixture(scope="session")

def endpoint_port():

# Return a free port per worker session.

with socket.socket() as sock:

sock.bind(("127.0.0.1", 0))

yield sock.getsockname()[1]

I did make the above change locally but it seems to error consistently, did it work for you?:

During handling of the above exception, another exception occurred: s3_base = 'http://127.0.0.1:47913/', s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:47913/'}} pdf = Integer Float Integer2 String Boolean 0 2345 9.001 2345 Alpha True 1 11987 8.343 106 Beta False 2 9027 6.000 2088 Gamma True 3 9027 2.781 789277 Delta False bytes_per_thread = 32 @pytest.mark.parametrize("bytes_per_thread", [32, 1024]) def test_read_csv(s3_base, s3so, pdf, bytes_per_thread): # Write to buffer fname = "test_csv_reader.csv" bname = "csv" buffer = pdf.to_csv(index=False) # Use fsspec file object > with s3_context(s3_base=s3_base, bucket=bname, files={fname: buffer}): python/cudf/cudf/tests/test_s3.py:150: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../envs/cudfdev/lib/python3.8/contextlib.py:113: in __enter__ return next(self.gen) python/cudf/cudf/tests/test_s3.py:105: in s3_context client.create_bucket(Bucket=bucket, ACL="public-read-write") ../envs/cudfdev/lib/python3.8/site-packages/botocore/client.py:395: in _api_call return self._make_api_call(operation_name, kwargs) ../envs/cudfdev/lib/python3.8/site-packages/botocore/client.py:711: in _make_api_call http, parsed_response = self._make_request( ../envs/cudfdev/lib/python3.8/site-packages/botocore/client.py:731: in _make_request return self._endpoint.make_request(operation_model, request_dict) ../envs/cudfdev/lib/python3.8/site-packages/botocore/endpoint.py:107: in make_request return self._send_request(request_dict, operation_model) ../envs/cudfdev/lib/python3.8/site-packages/botocore/endpoint.py:183: in _send_request while self._needs_retry(attempts, operation_model, request_dict, ../envs/cudfdev/lib/python3.8/site-packages/botocore/endpoint.py:305: in _needs_retry responses = self._event_emitter.emit( ../envs/cudfdev/lib/python3.8/site-packages/botocore/hooks.py:357: in emit return self._emitter.emit(aliased_event_name, **kwargs) ../envs/cudfdev/lib/python3.8/site-packages/botocore/hooks.py:228: in emit return self._emit(event_name, kwargs) ../envs/cudfdev/lib/python3.8/site-packages/botocore/hooks.py:211: in _emit response = handler(**kwargs) ../envs/cudfdev/lib/python3.8/site-packages/botocore/retryhandler.py:192: in __call__ if self._checker(**checker_kwargs): ../envs/cudfdev/lib/python3.8/site-packages/botocore/retryhandler.py:265: in __call__ should_retry = self._should_retry(attempt_number, response, ../envs/cudfdev/lib/python3.8/site-packages/botocore/retryhandler.py:292: in _should_retry return self._checker(attempt_number, response, caught_exception) ../envs/cudfdev/lib/python3.8/site-packages/botocore/retryhandler.py:331: in __call__ checker_response = checker(attempt_number, response, ../envs/cudfdev/lib/python3.8/site-packages/botocore/retryhandler.py:231: in __call__ return self._check_caught_exception( ../envs/cudfdev/lib/python3.8/site-packages/botocore/retryhandler.py:374: in _check_caught_exception raise caught_exception ../envs/cudfdev/lib/python3.8/site-packages/botocore/endpoint.py:249: in _do_get_response http_response = self._send(request) ../envs/cudfdev/lib/python3.8/site-packages/botocore/endpoint.py:321: in _send return self.http_session.send(request) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <botocore.httpsession.URLLib3Session object at 0x7efe37194bb0> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:47913/csv, headers={'x-amz-acl': b'public-re...nvocation-id': b'23e9f9cc-531f-4d50-ad98-8ca7ab3f4a60', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}> def send(self, request): try: proxy_url = self._proxy_config.proxy_url_for(request.url) manager = self._get_connection_manager(request.url, proxy_url) conn = manager.connection_from_url(request.url) self._setup_ssl_cert(conn, request.url, self._verify) if ensure_boolean( os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '') ): # This is currently an "experimental" feature which provides # no guarantees of backwards compatibility. It may be subject # to change or removal in any patch version. Anyone opting in # to this feature should strictly pin botocore. host = urlparse(request.url).hostname conn.proxy_headers['host'] = host request_target = self._get_request_target(request.url, proxy_url) urllib_response = conn.urlopen( method=request.method, url=request_target, body=request.body, headers=request.headers, retries=Retry(False), assert_same_host=False, preload_content=False, decode_content=False, chunked=self._chunked(request.headers), ) http_response = botocore.awsrequest.AWSResponse( request.url, urllib_response.status, urllib_response.headers, urllib_response, ) if not request.stream_output: # Cause the raw stream to be exhausted immediately. We do it # this way instead of using preload_content because # preload_content will never buffer chunked responses http_response.content return http_response except URLLib3SSLError as e: raise SSLError(endpoint_url=request.url, error=e) except (NewConnectionError, socket.gaierror) as e: > raise EndpointConnectionError(endpoint_url=request.url, error=e) E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:47913/csv" ../envs/cudfdev/lib/python3.8/site-packages/botocore/httpsession.py:434: EndpointConnectionError ---------------------------------------------------- Captured stderr setup --------------------------------

FWIW, I've seen it close connections to the ports very consistently with the existing approach too.

I didn't test this before suggesting, but I'm not sure what's going wrong. I wonder if the context manager is yielding the port and immediately exiting the context / closing the resource? Might need to refresh myself on how pytest fixtures interact with context managers, or add some print statements to check what's happening. Alternatively, you can try this instead.

Suggested change

@pytest.fixture(scope="session")

def endpoint_port():

# Return a free port per worker session.

sock = socket.socket()

sock.bind(("", 0))

return sock.getsockname()[1]

@pytest.fixture(scope="session")

def endpoint_port():

# Return a free port per worker session.

sock = socket.socket()

sock.bind(("127.0.0.1", 0))

yield sock.getsockname()[1]

sock.close()

@pytest.fixture(scope="session") def endpoint_port(): # Return a free port per worker session. sock = socket.socket() sock.bind(("127.0.0.1", 0)) yield sock.getsockname()[1] sock.close()

⬆️ This too gives the same error, but this works:

@pytest.fixture(scope="session") def endpoint_port(): # Return a free port per worker session. sock = socket.socket() sock.bind(("127.0.0.1", 0)) port = sock.getsockname()[1] sock.close() return port

I've pushed this change, let me know if that looks good to you.

python/cudf/cudf/tests/test_s3.py

bdice · 2022-05-09T16:47:57Z

python/dask_cudf/dask_cudf/io/tests/test_s3.py

+@pytest.fixture(scope="session")
+def endpoint_port():
+    # Return a free port per worker session.
+    sock = socket.socket()
+    sock.bind(("", 0))
+    return sock.getsockname()[1]


Suggested change

@pytest.fixture(scope="session")

def endpoint_port():

# Return a free port per worker session.

sock = socket.socket()

sock.bind(("", 0))

return sock.getsockname()[1]

@pytest.fixture(scope="session")

def endpoint_port():

# Return a free port per worker session.

with socket.socket() as sock:

sock.bind(("127.0.0.1", 0))

yield sock.getsockname()[1]

Co-authored-by: Bradley Dice <[email protected]>

bdice

I have a one suggestion related to variable naming that I'd like to see addressed before merging.

The other suggestions are related to the moto server: avoiding repeated definitions of hosts/URIs, and avoiding subprocess for the server (suggesting to use its Python API instead). Those moto related suggestions can be handled in a follow-up PR.

bdice · 2022-05-10T15:58:15Z

python/cudf/cudf/tests/test_s3.py

-            5000
-            if worker_id == "master"
-            else 5550 + int(worker_id.lstrip("gw"))
-        )
        endpoint_uri = f"http://127.0.0.1:{endpoint_port}/"

        proc = subprocess.Popen(


It looks like we're using moto in Stand-alone server mode. Is it possible to use the Python API for this server instead of calling subprocess? http://docs.getmoto.org/en/latest/docs/server_mode.html#start-within-python

I think it'd look like this:

from moto.server import ThreadedMotoServer server = ThreadedMotoServer(ip_address="127.0.0.1", port=endpoint_port) server.start() yield endpoint_uri # run tests server.stop()

bdice · 2022-05-10T15:58:59Z

python/cudf/cudf/tests/test_s3.py

    """
    Returns s3 storage options to pass to fsspec
    """
-    endpoint_port = (
-        5000 if worker_id == "master" else 5550 + int(worker_id.lstrip("gw"))
-    )
    endpoint_uri = f"http://127.0.0.1:{endpoint_port}/"


We're hard-coding the endpoint URI in a few locations in this file. Can we re-use the host/port/endpoint values instead? (Possibly with a fixture.)

python/cudf/cudf/tests/test_s3.py

bdice · 2022-05-10T16:04:21Z

python/dask_cudf/dask_cudf/io/tests/test_s3.py

+def endpoint_port():
+    # Return a free port per worker session.
+    sock = socket.socket()
+    sock.bind(("127.0.0.1", 0))


Same here - would be nice to define 127.0.0.1 once and re-use that value and the endpoint URI.

bdice · 2022-05-10T16:06:46Z

python/dask_cudf/dask_cudf/io/tests/test_s3.py

+    sock = socket.socket()
+    sock.bind(("127.0.0.1", 0))
+    port = sock.getsockname()[1]
+    sock.close()


Wait, is this just to get a port number of an open port? If so, I was misunderstanding how this worked earlier. The context manager I suggested was probably failing because the socket was still held by sock when you needed to close it and give up that port number to the moto server. If that is correct, you aren't supposed to hold the socket resource with a context manager / yield fixture. Please use this approach and ensure that sock is closed before returning the port number.

galipremsagar · 2022-05-10T16:20:30Z

Thanks @bdice, I've resolved bucket renaming review. For the rest of the comments, I'm opening another PR.

galipremsagar · 2022-05-10T18:29:35Z

@gpucibot merge

…server (#10822) This is a follow-up PR to address review comments from here: #10769 (review) This PR: - [x] Uses `ThreadedMotoServer` instead of using `subprocess.open` to create a new server, this way it is guaranteed to close the server upon exit. - [x] Add's IP address fixture instead of having it hard-coded at multiple places. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #10822

enable s3 writes in chunked parquet writer

a9ff579

galipremsagar added bug Something isn't working 2 - In Progress Currently a work in progress Python Affects Python cuDF API. non-breaking Non-breaking change labels May 2, 2022

galipremsagar self-assigned this May 2, 2022

github-actions bot added the conda label May 2, 2022

updates

8176093

galipremsagar mentioned this pull request May 2, 2022

[REVIEW] Upgrade s3fs & moto rapidsai/integration#472

Merged

galipremsagar marked this pull request as ready for review May 2, 2022 23:17

galipremsagar requested review from a team as code owners May 2, 2022 23:17

galipremsagar requested review from vyasr and skirui-source May 2, 2022 23:17

galipremsagar changed the title ~~[WIP] Enable s3 writes in chunked parquet writer~~ [REVIEW] Enable s3 writes in chunked parquet writer May 2, 2022

galipremsagar added 4 - Needs cuDF (Python) Reviewer and removed 2 - In Progress Currently a work in progress labels May 2, 2022

galipremsagar requested review from vuule and devavret May 2, 2022 23:27

galipremsagar added the 4 - Needs cuIO Reviewer label May 2, 2022

galipremsagar changed the title ~~[REVIEW] Enable s3 writes in chunked parquet writer~~ [REVIEW] Enable writing to s3 storage in chunked parquet writer May 2, 2022

ayushdg reviewed May 3, 2022

View reviewed changes

conda/recipes/cudf/meta.yaml Outdated Show resolved Hide resolved

remove s3fs hard requirement

7569db6

galipremsagar requested a review from ayushdg May 3, 2022 01:39

ajschmidt8 approved these changes May 3, 2022

View reviewed changes

galipremsagar added 2 commits May 6, 2022 09:50

remove s3fs and fix a pytest

a79e6d5

tweak todo comment

a11b658

galipremsagar requested a review from rjzamora May 6, 2022 17:11

rjzamora approved these changes May 6, 2022

View reviewed changes

update docs

aa7e050

galipremsagar added 4 - Needs cuDF (Python) Reviewer and removed 4 - Needs cuDF (Python) Reviewer labels May 6, 2022

ayushdg approved these changes May 7, 2022

View reviewed changes

bdice requested changes May 9, 2022

View reviewed changes

galipremsagar and others added 7 commits May 9, 2022 13:43

Apply suggestions from code review

23eccb6

Co-authored-by: Bradley Dice <[email protected]>

Merge remote-tracking branch 'upstream/branch-22.06' into 10522

867b8de

remove explicit flask dependency

14a100a

use 127.0.0.1 explicitly

3591b3e

context manager style

75c5de3

merge

e42796d

close port

2a743a2

bdice approved these changes May 10, 2022

View reviewed changes

bname to bucket

a11f507

rapids-bot bot merged commit 0fcd364 into rapidsai:branch-22.06 May 10, 2022

galipremsagar mentioned this pull request May 10, 2022

[REVIEW] Use ThreadedMotoServer instead of subprocess in spinning up s3 server #10822

Merged

2 tasks

charlesbluca mentioned this pull request May 20, 2022

Relax jsonschema testing dependency dask-contrib/dask-sql#546

Merged

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Python) Reviewer labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Enable writing to `s3` storage in chunked parquet writer #10769

[REVIEW] Enable writing to `s3` storage in chunked parquet writer #10769

galipremsagar commented May 2, 2022 •

edited

Loading

codecov bot commented May 2, 2022 •

edited

Loading

ajschmidt8 left a comment

rjzamora left a comment

bdice May 9, 2022

galipremsagar May 9, 2022

galipremsagar May 9, 2022

bdice May 10, 2022

galipremsagar May 10, 2022

bdice May 9, 2022

bdice left a comment

bdice May 10, 2022 •

edited

Loading

bdice May 10, 2022

bdice May 10, 2022

bdice May 10, 2022 •

edited

Loading

galipremsagar commented May 10, 2022

galipremsagar commented May 10, 2022

[REVIEW] Enable writing to s3 storage in chunked parquet writer #10769

[REVIEW] Enable writing to s3 storage in chunked parquet writer #10769

Conversation

galipremsagar commented May 2, 2022 • edited Loading

codecov bot commented May 2, 2022 • edited Loading

Codecov Report

ajschmidt8 left a comment

Choose a reason for hiding this comment

rjzamora left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

bdice May 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice May 10, 2022 • edited Loading

Choose a reason for hiding this comment

galipremsagar commented May 10, 2022

galipremsagar commented May 10, 2022

[REVIEW] Enable writing to `s3` storage in chunked parquet writer #10769

[REVIEW] Enable writing to `s3` storage in chunked parquet writer #10769

galipremsagar commented May 2, 2022 •

edited

Loading

codecov bot commented May 2, 2022 •

edited

Loading

bdice May 10, 2022 •

edited

Loading

bdice May 10, 2022 •

edited

Loading