Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial DaskRunner for Beam. #1

Closed
wants to merge 142 commits into from
Closed

Initial DaskRunner for Beam. #1

wants to merge 142 commits into from

Conversation

alxmrs
Copy link
Owner

@alxmrs alxmrs commented Jun 22, 2022

Here, I've created a minimum viable Apache Beam runner for Dask. My approach is to visit a Beam Pipeline an translate PCollections into Dask Bags, and PTransformations to Bag methods.

In this version, I have supported enough operations to make test pipeline asserts work. The test themselves are not comprehensive. Further, there are many Bag operations that could be translated for greater efficiency.

CC: @pabloem

Fixes: apache#18962


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@github-actions github-actions bot added the python Pull requests that update Python code label Jun 23, 2022
@cisaacstern
Copy link
Collaborator

cisaacstern commented Jun 23, 2022

There's some ambiguity on how we can connect Dask to the runner. One of two options are:

  • To create a full Bag graph up front, then call compute.

This conceptually aligned with what the DirectRunner does. AFAICT, it:

  1. Instantiates a visitor which is passed into pipeline.visit
  2. The visit_transform method of the visitor adds items to a few different iterable attributes of self
  3. Those iterables are then passed to the executor in a few places here
  • To create pseudo graph structures, and evaluate as we traverse the graph.

I don't fully grok what is meant by "pseudo graph structure" here, but the RayRunner evaluates during its visitation cycle:

  1. The visitor and executor are collapsed into a single object
  2. The hybrid visitor/executor's visit_transform method executes during visitation, using a translations dict to map beam operations onto their ray equivalents

It looks like 7cbf5fa lays a foundation for following the RayRunner approach, which makes sense to me!

cc @rabernat

@cisaacstern
Copy link
Collaborator

Also just thought I'd link the original meeting notes here, so we have all relevant documents in once place:

https://docs.google.com/document/d/1Awj_eNmH-WRSte3bKcCcUlQDiZ5mMKmCO_xV-mHWAak/edit#

@alxmrs
Copy link
Owner Author

alxmrs commented Jul 2, 2022

Hey @cisaacstern! Check this out now -- I've made some good progress today. The beam pipeline doesn't return the proper thing, but after inspecting it in the debugger, it does seem to pass Beam/python operations to Dask Bag executors.

@cisaacstern
Copy link
Collaborator

Awesome, @alxmrs! I should have a moment to take a closer look at this later this week.

@alxmrs alxmrs changed the title WIP: Created a skeleton dask runner implementation. Initial DaskRunner on Beam. Jul 8, 2022
@alxmrs alxmrs changed the title Initial DaskRunner on Beam. Initial DaskRunner for Beam. Jul 8, 2022
robertwb and others added 12 commits July 8, 2022 16:14
…n of google-cloud-bigquery-storage and google-cloud-core
This covers everything through section 4, Transforms.

Also implements missing transforms CoGroupByKey and Partition and
fixes a bug in DoubleCoder.
…/interactive/extensions/apache-beam-jupyterlab-sidepanel (apache#22200)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Updated Regex and test

* Updated Regex and test
…apache#22134)

This is better aligned with standard javascript convention.

Often libraries have the default version be asynchronous, and
name the Sync one explicitly, we are marking the async ones as
they are by far the most common and usable for pipeline
construction.
@alxmrs alxmrs closed this in 9e475f6 Sep 5, 2022
alxmrs pushed a commit that referenced this pull request Oct 9, 2022
* [Tour of Beam][Frontend][apache#22600] TourScreen layout

* theme setup

* Replaced ThemeProvider with ThemeSwitchNotifier

* header with theme mode switcher and logo

* page container with header & footer

* theme mode tests

* renamed the directory to tour-of-beam

* compressed beam_logo.png

* added missing license comments

* rudimentary layout of the first screen

* review comments fixes #1

* moved notifyListeners inside then

* responsive todo

* split into 2 simple functions

* deleted redundant constants &
replaced 2018 text theme with 2021

* styling refinement

* home screen layout

* clickable sign in text

* font weights fix

* removed _getBaseFontTheme function

* fixed border and bg color

* color fixes

* difficulty component

* _LastModuleBody

* todo in test

* footer border

* fixed overflows

* replaced Project prefix with Tob

* replaced then with await

* inferred type

* started translation of the home screen

* sorted translations

* Complexity comments

* comment fixes

* home screen translations

* sign in overlay

* import fix

* integration test does not fail

* playground_components package with
dismissible_overlay

* missing license

* removed _dots from build

* widgets refinement

* renamed home screen to welcome screen

* deleted copyWith

* _SdkButton

* trailing comma & pubspec formatting

* license and lints

* license

* removed license from .metadata

* pubspec formatting

* total lints update

* changed from tour_of_beam to
tour-of-beam in build.gradle.kts

* license check

* _SdkButton mimics Radio button

* renamed MyApp to TourOfBeamApp

* onChanged mimics Radio button

Tour of Beam frontend blank project

[Tour of Beam][Frontend][apache#22600] TourScreen layout

TourScreen layout (apache#22600)

common theme, constants, split view

missing license

flutter_gen, summary layout details

content layout details

no functional widgets in split view

main screen todos & translation

main screen todos & translation

comment fixes #1

ExpansionTileWrapper

SplitViewController

lists in tour screen widgets

comment fixes #1 (31.08)

split view package in PGC

fixed button overflow

splitter theme color

comment fixes #2 (31.08)

gradlew check

welcome screen overflow test (apache#22600)

SDK dropdown (apache#22600)

flexible complete unit OutlinedButton (apache#22600)

renamed PageContainer to TobScaffold

dropdown style refinement

DropdownButton implicit type

sdk instead of e

licenses apache#22600

renamed _ShrinkedTour to _NarrowTour apache#22600

tour screen style refinement apache#22600

BeamDivider in PGC apache#22600

removed todo, added license apache#22600

built with text apache#22600

_WideWelcome with IntrinsicHeight (apache#22600)

Co-Authored-By: darkhan.nausharipov <[email protected]>

* addressing review comments apache#22600

replaced magic numbers apache#22600

comments (apache#22600)

comments apache#22600

comments apache#22600

comments apache#22600

comments apache#22600

comments, flutter 3.3.0 upgrade apache#22600

renamed ActionPadding to ActionVerticalPadding apache#22600

actions formatting apache#22600

* branded sign in buttons apache#22600

* _BrandedSignInButtons apache#22600

* _Divider color apache#22600

* profile apache#22600

* moved split_view from PGC into ToB apache#22600

* indentation fix apache#22600

* split ProfileContent into widgets apache#22600

* Extract playground components to a separate package (apache#22600)

* Minor fixes (apache#22600)

* Address review issues (apache#22600)

* Upgrade Flutter to v3.3.2 (apache#22600)

* Add precommit Gradle task for playground_components, add code generation to frontend Gradle task, remove generated mocks, fix linter issues (apache#22600)

* startTour button (apache#22600)

* lint fixes (apache#22600)

* Fix highlighting for Python and SCIO (apache#22600)

Co-authored-by: darkhan.nausharipov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for a Dask runner