Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code change guide contributor-doc #30879

Merged
merged 9 commits into from
Apr 18, 2024
Merged

Add code change guide contributor-doc #30879

merged 9 commits into from
Apr 18, 2024

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented Apr 5, 2024

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Abacn Abacn force-pushed the codechangemd branch 3 times, most recently from d916d86 to dceb59e Compare April 12, 2024 02:53
@Abacn Abacn marked this pull request as ready for review April 12, 2024 02:59
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb added as fallback since no labels match configuration

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Abacn
Copy link
Contributor Author

Abacn commented Apr 12, 2024

R: @rszper

Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
```
./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar
```
The jar is located in the build output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The jar is located in the build output
The jar is located in the build output.

./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar
```
The jar is located in the build output
2. Pass the pipelineOption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Pass the pipelineOption
2. Use the following command to pass `pipelineOption`:

--dataflowWorkerJar=/.../beam-runners-google-cloud-dataflow-java-legacy-worker-2.XX.0-SNAPSHOT.jar
```

* [Dataflow runner v2] If `sdks/java/harness` or its dependency (like `sdks/java/core`) are changed:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [Dataflow runner v2] If `sdks/java/harness` or its dependency (like `sdks/java/core`) are changed:
* If you're using Dataflow Runner v2 and `sdks/java/harness` or its dependency (like `sdks/java/core`) have changed, do the following:

```

* [Dataflow runner v2] If `sdks/java/harness` or its dependency (like `sdks/java/core`) are changed:
1. Build SDK harness container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Build SDK harness container
1. Use the following command to build the SDK harness container:

"us.gcr.io/apache-beam-testing/beam_java11_sdk:2.49.0-custom" # change to your container registry
docker push "us.gcr.io/apache-beam-testing/beam_java11_sdk:2.49.0-custom"
```
2. Run a Runner v2 pipeline with the following options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Run a Runner v2 pipeline with the following options:
2. Run the pipeline with the following options:

--sdkContainerImage="us.gcr.io/apache-beam-testing/beam_java11_sdk:2.49.0-custom"
```

# Python Guide
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Python Guide
# Python guide


# Python Guide

Beam Python SDK is distributed as a single wheel (which is much simpler than Java SDK), and the Python development is consequently more straightforward.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Beam Python SDK is distributed as a single wheel (which is much simpler than Java SDK), and the Python development is consequently more straightforward.
The Beam Python SDK is distributed as a single wheel, which is more straightforward than the Java SDK. Python development is consequently more less complicated.


## Console (shell) setup

In the following, working directory is set to `sdks/python`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the following, working directory is set to `sdks/python`.
These instructions explain how to configure your console. In this example, the working directory is set to `sdks/python`.


In the following, working directory is set to `sdks/python`.

1. Install python interpreter with pyenv (recommended)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Install python interpreter with pyenv (recommended)
1. Recommended: Install the Python interpreter by using `pyenv`. Use the following commands:

a. install prerequisites
b. `curl https://pyenv.run | bash`
c. `pyenv install 3.X` (a supported Python version, refer to python_version in [project property](https://github.com/apache/beam/blob/master/gradle.properties))
2. Setup and activate virtual environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Setup and activate virtual environment
2. Use the following commands to set up and activate the virtual environment:


2. Install the Beam SDK under your Python virtual environment with necessary extensions (e.g. `pip install /path/to/apache-beam.tar.gz[gcp]`)

3. Initiate your Python script running the Pipeline with command like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. Initiate your Python script running the Pipeline with command like
3. Initiate your Python script, running the Pipeline with command similar to the following example:

python my_pipeline.py --runner=DataflowRunner --sdk_location=/path/to/apache-beam.tar.gz --project=my_project --region=us-central1 --temp_location=gs://my-bucket/temp ...
```

Tips for Dataflow runner:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Tips for Dataflow runner:
Tips for using the Dataflow runner:


Tips for Dataflow runner:

* Python worker will install the provided Apache Beam SDK before processing work items, so usually no need to provide custom worker container unless your GCP VM does not have Internet access and transient dependencies are changed from the officially released container images. In this case, refer to "Building containers for modified SDK code" instruction above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Python worker will install the provided Apache Beam SDK before processing work items, so usually no need to provide custom worker container unless your GCP VM does not have Internet access and transient dependencies are changed from the officially released container images. In this case, refer to "Building containers for modified SDK code" instruction above
* The Python worker installs the Apache Beam SDK before processing work items. Therefore, you don't usually need to provide a custom worker container. If your Google Cloud VM doesn't have Internet access and transient dependencies are changed from the officially released container images, you do need to provide a custom worker container. In this case, see the section "Build containers for modified SDK code."


* Python worker will install the provided Apache Beam SDK before processing work items, so usually no need to provide custom worker container unless your GCP VM does not have Internet access and transient dependencies are changed from the officially released container images. In this case, refer to "Building containers for modified SDK code" instruction above

* Installing Beam Python SDK from source is known to be slow (3.5 min for `n1-standard-1` machine). Alternatively, if the host machine is amd64 architecture, one can build a wheel (instead of a tarball) with command with command (e.g. `./gradle :sdks:python:bdistPy311linux` for Python 3.11) and pass the built wheel to `--sdk_location`. The installation should then just take seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Installing Beam Python SDK from source is known to be slow (3.5 min for `n1-standard-1` machine). Alternatively, if the host machine is amd64 architecture, one can build a wheel (instead of a tarball) with command with command (e.g. `./gradle :sdks:python:bdistPy311linux` for Python 3.11) and pass the built wheel to `--sdk_location`. The installation should then just take seconds.
* Installing the Beam Python SDK from source can be slow (3.5 minutes for a`n1-standard-1` machine). As an alternative, if the host machine uses amd64 architecture, you can build a wheel instead of a tarball by using a command similar to `./gradle :sdks:python:bdistPy311linux` (for Python 3.11). Pass the built wheel using the `--sdk_location` option. That installation completes in seconds.


### Caveat - `save_main_session`

* `NameError` when running DoFn on remote runner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `NameError` when running DoFn on remote runner
* `NameError` when running `DoFn` on remote runner

Abacn and others added 2 commits April 12, 2024 16:00
Copy link
Contributor

@rszper rszper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better! Just a bit more cleanup.

contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
contributor-docs/code-change-guide.md Outdated Show resolved Hide resolved
Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of good information here, but it's unclear if it's targeted at the casual (I'm just trying to contribute a single transform) vs. advanced (I just got hired to develop Beam for a living) developer. Also, how much redundancy is there with the existing docs/wiki?

* `sdks/java/core` Java core
* `sdks/java/harness` SDK harness (entrypoint of SDK container)

* `runners` runner supports, written in Java. For example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are only relevant for Java. It might make sense to state that for everything but Java, all relevant files are in sdks/LANG, whereas for Java the code is split up more widely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this information to code path:

"Java code paths are mainly found in two directories:" ...

"For SDKS in other language, all relevant files are in sdks/LANG, for example," ...


To check if your environment is set up, follow these steps:

Your `PATH` needs to have the following elements configured.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not needed for all development, e.g. one can do much Go/Python development without java installed, and vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified to "Depending on the languages involved, your PATH needs to have ..." and for each item below there is a comment about when it is needed

* A Go environment. Install the latest Go version.
* Needed for Go SDK development and SDK container change (for all SDKs), because
the container entrypoint scripts are written in Go.
* A Docker environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need docker (and, consequently, go) for cross-language anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I noted "some cross-language functionality (if you run a SDK container image)". To clarify further, added note "not required since Beam 2.53.0")

@Abacn
Copy link
Contributor Author

Abacn commented Apr 16, 2024

There's a lot of good information here, but it's unclear if it's targeted at the casual (I'm just trying to contribute a single transform) vs. advanced (I just got hired to develop Beam for a living) developer. Also, how much redundancy is there with the existing docs/wiki?

It aims to help both casual and future long-term developers quickstart. As for redundancy, it is a combination of several internal docs (see internal tracker 283183339). But every often we find user want to need to patch Beam code to fix issues on their side, and thus need a public facing guide, and ideally a self contained one (that easy to follow)

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a "last updated" note somewhere in this file? Any thoughts on keeping things up to date as Beam evolves?

@Abacn
Copy link
Contributor Author

Abacn commented Apr 18, 2024

Could you add a "last updated" note somewhere in this file? Any thoughts on keeping things up to date as Beam evolves?

added "last updated" timestamp. Keeping up-to-date is not easy. Our wiki has some similar instructions but largely outdated (they refer largely to Jenkins). I try to avoid language that known to be outdated in the future, e.g. specific java and python version. If not possible noted like "Java 8 as of 2024"

@Abacn Abacn merged commit 3e52e35 into apache:master Apr 18, 2024
4 checks passed
@Abacn Abacn deleted the codechangemd branch April 18, 2024 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants