[Feature Request]: Allow customization of filename and sharding for dataframe IOs. #22923

robertwb · 2022-08-26T20:40:25Z

What would you like to happen?

Other sinks, such as TextIO and FileSink, allow this customization.

Issue Priority

Priority: 2

Issue Component

Component: dsl-dataframe

robertwb · 2022-08-26T20:57:11Z

Context: https://stackoverflow.com/questions/73498119/apache-beam-dataframe-write-csv-to-gcs-without-shard-name-template

This fixes issue apache#22923.

jzxu · 2023-11-30T08:28:10Z

Hi, I noticed that despite #22925 being merged, DeferredDataFrame.to_csv() still doesn't respect the num_shards argument. Minimal test case:

from typing import NamedTuple
import apache_beam as beam
from apache_beam.dataframe import convert

class Row(NamedTuple):
  x: int

with beam.Pipeline('DirectRunner') as p:
  c = (p | beam.Create([Row(x=i) for i in range(1000000)]))
  df = convert.to_dataframe(c)
  df.to_csv('/tmp/apache_beam_test.csv', index=False, num_shards=2)

Running this with apache_beam 2.50.0 results in a single shard being written.

jconlon · 2024-07-18T21:25:32Z

Having same problem with df.to_parquet

robertwb added new feature awaiting triage labels Aug 26, 2022

robertwb self-assigned this Aug 26, 2022

github-actions bot added dataframe dsl P2 and removed awaiting triage labels Aug 26, 2022

robertwb mentioned this issue Aug 26, 2022

[BEAM-22923] Allow sharding specification for dataframe writes. #22925

Merged

4 tasks

github-actions bot added the stale label Oct 26, 2022

damccorm removed the stale label Dec 2, 2022

langner mentioned this issue Jan 15, 2025

Support sharding in WriteToFiles (tested for to_csv) #33612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Allow customization of filename and sharding for dataframe IOs. #22923

[Feature Request]: Allow customization of filename and sharding for dataframe IOs. #22923

robertwb commented Aug 26, 2022

robertwb commented Aug 26, 2022

jzxu commented Nov 30, 2023

jconlon commented Jul 18, 2024

[Feature Request]: Allow customization of filename and sharding for dataframe IOs. #22923

[Feature Request]: Allow customization of filename and sharding for dataframe IOs. #22923

Comments

robertwb commented Aug 26, 2022

What would you like to happen?

Issue Priority

Issue Component

robertwb commented Aug 26, 2022

jzxu commented Nov 30, 2023

jconlon commented Jul 18, 2024