Optimize translation when Schema information is available in Spark Structured Streaming runner #19989

damccorm · 2022-06-04T14:41:37Z

Spark Structured Streaming runner supports Datasets that already have Schema information. This is used by Spark to optimize jobs (via Catalyst). This issue is to implement optimized translations of the transforms for the runner so we can benefit of the performance improvements internally done by Spark.

Notice that we also may need to map Beam's core internal representations like WindowedValue so we can have intermediary optimizations.

Imported from Jira BEAM-9451. Original Jira may contain additional context.
Reported by: iemejia.

mosche · 2022-10-21T14:18:14Z

Fixed with #22445

damccorm added improvement P3 runner-spark structured-streaming labels Jun 4, 2022

damccorm added runners spark and removed runner-spark labels Jun 16, 2022

mosche closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize translation when Schema information is available in Spark Structured Streaming runner #19989

Optimize translation when Schema information is available in Spark Structured Streaming runner #19989

damccorm commented Jun 4, 2022

mosche commented Oct 21, 2022

Optimize translation when Schema information is available in Spark Structured Streaming runner #19989

Optimize translation when Schema information is available in Spark Structured Streaming runner #19989

Comments

damccorm commented Jun 4, 2022

mosche commented Oct 21, 2022