[Spark Dataset runner] Break linage of dataset to reduce Spark planning overhead in case of large query plans #25187

mosche · 2023-01-26T11:43:10Z

#24711 removed some unnecessary caching when processing MultiParDos with additional but unconsumed outputs. With storage level "Memory Only" caching was done via RDDs. This also had a positive side effect as it breaks linage of the dataset. This is particularly beneficial with complex query plans as generated by TCPDS query 83 (due to #24314).

With caching removed, performance dropped for query 83.
See http://metrics.beam.apache.org/d/tkqc0AdGk2/tpc-ds-spark-classic-new-sql?orgId=1&viewPanel=38&from=1670799600000&to=1673305199000

This change tracks a rough complexity estimate of the datasets and breaks linage where necessary by converting it to an RDD and back again.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

mosche · 2023-01-26T11:45:03Z

Run Spark Runner Tpcds Tests

mosche · 2023-01-26T12:28:13Z

Run Spark Runner Tpcds Tests

mosche · 2023-01-27T08:35:49Z

Run Spark Runner Tpcds Tests

mosche · 2023-01-27T09:03:49Z

Run Spark Runner Tpcds Tests

mosche · 2023-01-27T09:43:45Z

R: @aromanenko-dev
R: @echauchot

github-actions · 2023-01-27T09:44:59Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

echauchot · 2023-01-27T09:51:29Z

@aromanenko-dev I'll be in vacations tonight, can you take care of this review ?

aromanenko-dev · 2023-01-27T17:21:29Z

@echauchot Sure, I'll take a look

mosche · 2023-02-01T08:49:52Z

@aromanenko-dev kind ping :)

aromanenko-dev

Thanks, looks fine for me, only several notes, ptal

aromanenko-dev · 2023-02-01T12:41:14Z

...apache/beam/runners/spark/structuredstreaming/translation/batch/PipelineTranslatorBatch.java

-    TRANSFORM_TRANSLATORS.put(Impulse.class, new ImpulseTranslatorBatch());
-    TRANSFORM_TRANSLATORS.put(Combine.PerKey.class, new CombinePerKeyTranslatorBatch<>());
-    TRANSFORM_TRANSLATORS.put(Combine.Globally.class, new CombineGloballyTranslatorBatch<>());
+    TRANSFORM_TRANSLATORS.put(Impulse.class, new ImpulseTranslatorBatch(0));


How all this and below complexity factors were estimated?

That's mostly just relative factors comparing what the translator adds to the DAG, no exact science needed here.

Does it depend on a translated pipeline?

No, it doesn't ... this is only about a single PTransform (and it's translation) looked at in isolation

Then it could be just hardcoded as a constant value in every transform translator class and I don't see a reason to have a dedicated constructor for this.

That's the way i implemented it initially... I personally find it easier to have an overview of all values in a single place, but happy to revert.

ok, done @aromanenko-dev

...n/java/org/apache/beam/runners/spark/structuredstreaming/translation/PipelineTranslator.java

mosche · 2023-02-01T15:18:12Z

@aromanenko-dev pls have another look

aromanenko-dev

LGTM

aromanenko-dev

LGTM
Feel free to self-merge it once tests are green

mosche · 2023-02-01T18:58:09Z

Run Spark StructuredStreaming ValidatesRunner

mosche · 2023-02-01T21:11:45Z

Run Spark StructuredStreaming ValidatesRunner

mosche · 2023-02-02T08:54:34Z

Thanks @aromanenko-dev , have to look into the vr tests, not sure what's going on :(

mosche · 2023-02-02T10:52:24Z

🤯 wth, below the output of the following statements, somehow rdd.persist() corrupts the data 💥

  dataset.persist(storageLevel);

  System.out.println("\nContent of persisted dataset (1st eval)");
  dataset.foreach(printValue);

  System.out.println("\nContent of persisted dataset (2nd eval)");
  dataset.foreach(printValue);

  System.out.println("\nContent of rdd (1st eval)");
  dataset.rdd().foreach(printValue);

  System.out.println("\nContent of rdd (2nd eval)");
  dataset.rdd().foreach(printValue);

  System.out.println("\nContent of persisted rdd (1st eval)");
  dataset.rdd().persist().foreach(printValue);

  System.out.println("\nContent of persisted rdd (2nd eval)");
  dataset.rdd().persist().foreach(printValue);

Content of persisted dataset (1st eval)
key=k3 {@1797166500}, value=[0] {@621459392}
key=k5 {@522613115}, value=[2147483647, -2147483648] {@50229760}
key=k1 {@935655811}, value=[3, 4] {@1059339408}
key=k2 {@1698407107}, value=[66, -33] {@1059339408}

Content of persisted dataset (2nd eval)
key=k3 {@491772294}, value=[0] {@1519705721}
key=k1 {@1801297116}, value=[3, 4] {@929237815}
key=k2 {@1278247606}, value=[66, -33] {@929237815}
key=k5 {@1642777670}, value=[2147483647, -2147483648] {@822592932}

Content of rdd (1st eval)
key=k5 {@1706950017}, value=[2147483647, -2147483648] {@935984436}
key=k1 {@830413362}, value=[3, 4] {@1431942185}
key=k3 {@1807383422}, value=[0] {@1220036903}
key=k2 {@1334173112}, value=[66, -33] {@1431942185}

Content of rdd (2nd eval)
key=k1 {@1977617218}, value=[3, 4] {@1966593054}
key=k2 {@131385719}, value=[66, -33] {@1966593054}
key=k3 {@966124643}, value=[0] {@4762028}
key=k5 {@1477114665}, value=[2147483647, -2147483648] {@942837752}

Content of persisted rdd (1st eval)
key=k3 {@2050847337}, value=[0] {@83038769}
key=k5 {@906982325}, value=[2147483647, -2147483648] {@714957170}
key=k1 {@654493203}, value=[66, -33] {@1747607045}
key=k2 {@287079803}, value=[66, -33] {@1747607045}

Content of persisted rdd (2nd eval)
key=k1 {@654493203}, value=[66, -33] {@1747607045}
key=k2 {@287079803}, value=[66, -33] {@1747607045}
key=k5 {@906982325}, value=[2147483647, -2147483648] {@714957170}
key=k3 {@2050847337}, value=[0] {@83038769}

aromanenko-dev · 2023-02-02T11:02:48Z

@mosche Are you sure that it's caused by your changes?

mosche · 2023-02-02T14:01:44Z

@aromanenko-dev It's triggered by this optional part https://github.com/apache/beam/pull/25187/files#diff-4df56f442668d45bf7269e0bc379e95298b178c2f3072c72f30c1c0c296caed9R301-R304

I could simply remove that and keep caching as dataset. But there's also other places where caching is done on the RDD rather than the dataset if storage level is MEMORY_ONLY. It's a bit concerning to now know why / how this bug is triggered :(

mosche · 2023-02-03T06:33:29Z

Tracked this down to #25296, will fix the bug first and then get back to this PR.

…ng overhead in case of large complex query plans (relates to apache#24710 and apache#23845)

mosche · 2023-02-03T09:43:53Z

Run Spark StructuredStreaming ValidatesRunner

mosche · 2023-02-03T10:23:57Z

Run Java PreCommit

mosche · 2023-02-03T10:44:51Z

Run Spark Runner Tpcds Tests

mosche · 2023-02-03T11:13:40Z

Run Spark Runner Tpcds Tests

Abacn · 2023-03-04T01:19:03Z

Hi @mosche and @aromanenko-dev Either this PR or #25297 likely caused SparkStructuredStreaming Batch Load test failing since Feb 2: https://ci-beam.apache.org/view/LoadTests/job/beam_LoadTests_Java_ParDo_SparkStructuredStreaming_Batch/1016/

error message is java.lang.StackOverflowError. It is an infinite loop of recursion.

are there actions taken?

Update: excluded #25297 because reverting that PR the StackOverflowError is still seen: https://ci-beam.apache.org/view/LoadTests/job/beam_LoadTests_Java_ParDo_SparkStructuredStreaming_Batch_PR/5/

mosche · 2023-03-06T08:41:10Z

Thanks @Abacn , having a look

mosche · 2023-03-06T12:18:31Z

fixed here #25732

github-actions bot added runners spark labels Jan 26, 2023

mosche force-pushed the tpcds_83_2022-12-28 branch from ecd030c to 00cd938 Compare January 27, 2023 09:20

mosche changed the title ~~[Spark Dataset runner] Experiment to limit execution plan complexity~~ [Spark Dataset runner] Break linage of dataset to reduce Spark planning overhead in case of large query plans Jan 27, 2023

mosche marked this pull request as ready for review January 27, 2023 09:43

aromanenko-dev self-requested a review January 27, 2023 17:21

aromanenko-dev reviewed Feb 1, 2023

View reviewed changes

aromanenko-dev approved these changes Feb 1, 2023

View reviewed changes

mosche mentioned this pull request Feb 3, 2023

[Spark Dataset runner] Fix collection encoder bug leading to corrupt data #25297

Merged

3 tasks

[Spark Dataset runner] Break linage of dataset to reduce Spark planni…

39f241a

…ng overhead in case of large complex query plans (relates to apache#24710 and apache#23845)

mosche force-pushed the tpcds_83_2022-12-28 branch from 3116d46 to 39f241a Compare February 3, 2023 09:42

mosche merged commit 72781cb into apache:master Feb 3, 2023

mosche deleted the tpcds_83_2022-12-28 branch February 3, 2023 11:42

mosche mentioned this pull request Feb 24, 2023

Evaluate removal of RDD caching for MEMORY_ONLY in the Spark Dataset runner #25327

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark Dataset runner] Break linage of dataset to reduce Spark planning overhead in case of large query plans #25187

[Spark Dataset runner] Break linage of dataset to reduce Spark planning overhead in case of large query plans #25187

mosche commented Jan 26, 2023 •

edited

Loading

mosche commented Jan 26, 2023

mosche commented Jan 26, 2023

mosche commented Jan 27, 2023

mosche commented Jan 27, 2023

mosche commented Jan 27, 2023

github-actions bot commented Jan 27, 2023

echauchot commented Jan 27, 2023

aromanenko-dev commented Jan 27, 2023

mosche commented Feb 1, 2023

aromanenko-dev left a comment

aromanenko-dev Feb 1, 2023

mosche Feb 1, 2023

aromanenko-dev Feb 1, 2023

mosche Feb 1, 2023

aromanenko-dev Feb 1, 2023 •

edited

Loading

mosche Feb 1, 2023

mosche Feb 1, 2023

mosche commented Feb 1, 2023

aromanenko-dev left a comment

aromanenko-dev left a comment

mosche commented Feb 1, 2023

mosche commented Feb 1, 2023

mosche commented Feb 2, 2023

mosche commented Feb 2, 2023

aromanenko-dev commented Feb 2, 2023

mosche commented Feb 2, 2023

mosche commented Feb 3, 2023

mosche commented Feb 3, 2023

mosche commented Feb 3, 2023

mosche commented Feb 3, 2023

mosche commented Feb 3, 2023

Abacn commented Mar 4, 2023 •

edited

Loading

mosche commented Mar 6, 2023

mosche commented Mar 6, 2023

[Spark Dataset runner] Break linage of dataset to reduce Spark planning overhead in case of large query plans #25187

[Spark Dataset runner] Break linage of dataset to reduce Spark planning overhead in case of large query plans #25187

Conversation

mosche commented Jan 26, 2023 • edited Loading

GitHub Actions Tests Status (on master branch)

mosche commented Jan 26, 2023

mosche commented Jan 26, 2023

mosche commented Jan 27, 2023

mosche commented Jan 27, 2023

mosche commented Jan 27, 2023

github-actions bot commented Jan 27, 2023

echauchot commented Jan 27, 2023

aromanenko-dev commented Jan 27, 2023

mosche commented Feb 1, 2023

aromanenko-dev left a comment

Choose a reason for hiding this comment

aromanenko-dev Feb 1, 2023

Choose a reason for hiding this comment

mosche Feb 1, 2023

Choose a reason for hiding this comment

aromanenko-dev Feb 1, 2023

Choose a reason for hiding this comment

mosche Feb 1, 2023

Choose a reason for hiding this comment

aromanenko-dev Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

mosche Feb 1, 2023

Choose a reason for hiding this comment

mosche Feb 1, 2023

Choose a reason for hiding this comment

mosche commented Feb 1, 2023

aromanenko-dev left a comment

Choose a reason for hiding this comment

aromanenko-dev left a comment

Choose a reason for hiding this comment

mosche commented Feb 1, 2023

mosche commented Feb 1, 2023

mosche commented Feb 2, 2023

mosche commented Feb 2, 2023

aromanenko-dev commented Feb 2, 2023

mosche commented Feb 2, 2023

mosche commented Feb 3, 2023

mosche commented Feb 3, 2023

mosche commented Feb 3, 2023

mosche commented Feb 3, 2023

mosche commented Feb 3, 2023

Abacn commented Mar 4, 2023 • edited Loading

mosche commented Mar 6, 2023

mosche commented Mar 6, 2023

mosche commented Jan 26, 2023 •

edited

Loading

aromanenko-dev Feb 1, 2023 •

edited

Loading

Abacn commented Mar 4, 2023 •

edited

Loading