[FEA] Support S3 writes in chunked writer - ParquetDatasetWriter #10522

sauravdev · 2022-03-28T16:07:05Z

Parquet writes to external storage such as s3 should be possible using ParquetDatasetWriter, right now it errors out.

GregoryKimball · 2022-03-28T16:24:13Z

Thanks @sauravdev for sharing this use case. Would you please post a code sample and the error message, if any?

sauravdev · 2022-03-28T16:48:10Z

Sure, below is sample code

stream = Stream.from_kafka_batched(
    kafka_input_topic, consumer_conf,asynchronous=True,
    poll_interval="60s", max_batch_size=90000,
    engine="cudf", dask=True
) 

from cudf.io.parquet import ParquetDatasetWriter

def cudf_passthrough_to_parquet_chunk_test(gdf):
    batch_process_start_time = time.time()
    size = len(gdf)
    with ParquetDatasetWriter("s3://samplepath/", partition_cols=["samplecol"], index=False) as cw:
        cw.write_table(gdf)
    batch_process_finish_time = time.time()
    return (batch_process_start_time, batch_process_finish_time, size)

output = stream.map(cudf_passthrough_to_parquet_chunk_test).gather().sink_to_list()
stream.start()
output

and below is error:
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7f27e7e25e50>>, <Future finished exception=RuntimeError('cuDF failure at: ../src/io/utilities/data_sink.cpp:36: Cannot open output file')>)

Resolves: #10522 This PR: - [x] Enables `s3` writing support in `ParquetDatasetWriter` - [x] Add's a work-around to reading an `s3` directory in `cudf.read_parquet`. Issue here: https://issues.apache.org/jira/browse/ARROW-16438 - [x] Introduces all the required `s3` python library combinations that will work together with such that `test_s3.py` can be run locally on dev environments. - [x] Improved the default `s3fs` error logs by changing the log level to `DEBUG` in pytests.(`S3FS_LOGGING_LEVEL`) Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Richard (Rick) Zamora (https://github.com/rjzamora) - Ayush Dattagupta (https://github.com/ayushdg) - Bradley Dice (https://github.com/bdice) URL: #10769

sauravdev added Needs Triage Need team to review and classify feature request New feature or request labels Mar 28, 2022

GregoryKimball added the cuIO cuIO issue label Mar 28, 2022

galipremsagar self-assigned this Apr 12, 2022

galipremsagar added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Apr 27, 2022

galipremsagar mentioned this issue May 2, 2022

[REVIEW] Enable writing to s3 storage in chunked parquet writer #10769

Merged

4 tasks

rapids-bot bot closed this as completed in #10769 May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support S3 writes in chunked writer - ParquetDatasetWriter #10522

[FEA] Support S3 writes in chunked writer - ParquetDatasetWriter #10522

sauravdev commented Mar 28, 2022

GregoryKimball commented Mar 28, 2022

sauravdev commented Mar 28, 2022

[FEA] Support S3 writes in chunked writer - ParquetDatasetWriter #10522

[FEA] Support S3 writes in chunked writer - ParquetDatasetWriter #10522

Comments

sauravdev commented Mar 28, 2022

GregoryKimball commented Mar 28, 2022

sauravdev commented Mar 28, 2022