diff --git a/website/www/site/content/en/documentation/io/built-in/google-bigquery.md b/website/www/site/content/en/documentation/io/built-in/google-bigquery.md index 26ca0baec0cf..769b05741345 100644 --- a/website/www/site/content/en/documentation/io/built-in/google-bigquery.md +++ b/website/www/site/content/en/documentation/io/built-in/google-bigquery.md @@ -261,7 +261,7 @@ BigQuery's exported JSON format. {{< paragraph class="language-py" >}} ***Note:*** `BigQuerySource()` is deprecated as of Beam SDK 2.25.0. Before 2.25.0, to read from -a BigQuery table using the Beam SDK, you will apply a `Read` transform on a `BigQuerySource`. For example, +a BigQuery table using the Beam SDK, apply a `Read` transform on a `BigQuerySource`. For example, `beam.io.Read(beam.io.BigQuerySource(table_spec))`. {{< /paragraph >}} @@ -397,8 +397,8 @@ for the destination table(s): whether the destination table must exist or can be created by the write operation. * The destination table's write disposition. The write disposition specifies - whether the data you write will replace an existing table, append rows to an - existing table, or write only to an empty table. + whether the data you write replaces an existing table, appends rows to an + existing table, or writes only to an empty table. In addition, if your write operation creates a new BigQuery table, you must also supply a table schema for the destination table. @@ -512,7 +512,7 @@ use a string that contains a JSON-serialized `TableSchema` object. To create a table schema in Python, you can either use a `TableSchema` object, or use a string that defines a list of fields. Single string based schemas do not support nested fields, repeated fields, or specifying a BigQuery mode for -fields (the mode will always be set to `NULLABLE`). +fields (the mode is always set to `NULLABLE`). {{< /paragraph >}} #### Using a TableSchema @@ -539,7 +539,7 @@ To create and use a table schema as a `TableSchema` object, follow these steps. 2. Create and append a `TableFieldSchema` object for each field in your table. -3. Next, use the `schema` parameter to provide your table schema when you apply +3. Use the `schema` parameter to provide your table schema when you apply a write transform. Set the parameter’s value to the `TableSchema` object. {{< /paragraph >}} @@ -728,8 +728,8 @@ The following examples use this `PCollection` that contains quotes. The `writeTableRows` method writes a `PCollection` of BigQuery `TableRow` objects to a BigQuery table. Each element in the `PCollection` represents a single row in the table. This example uses `writeTableRows` to write elements to a -`PCollection`. The write operation creates a table if needed; if the -table already exists, it will be replaced. +`PCollection`. The write operation creates a table if needed. If the +table already exists, it is replaced. {{< /paragraph >}} {{< highlight java >}} @@ -745,7 +745,7 @@ table already exists, it will be replaced. {{< paragraph class="language-py" >}} The following example code shows how to apply a `WriteToBigQuery` transform to write a `PCollection` of dictionaries to a BigQuery table. The write operation -creates a table if needed; if the table already exists, it will be replaced. +creates a table if needed. If the table already exists, it is replaced. {{< /paragraph >}} {{< highlight py >}} @@ -759,8 +759,8 @@ The `write` transform writes a `PCollection` of custom typed objects to a BigQue table. Use `.withFormatFunction(SerializableFunction)` to provide a formatting function that converts each input element in the `PCollection` into a `TableRow`. This example uses `write` to write a `PCollection`. The -write operation creates a table if needed; if the table already exists, it will -be replaced. +write operation creates a table if needed. If the table already exists, it is +replaced. {{< /paragraph >}} {{< highlight java >}} @@ -786,7 +786,7 @@ BigQuery Storage Write API for Python SDK currently has some limitations on supp {{< /paragraph >}} {{< paragraph class="language-py" >}} -**Note:** If you want to run WriteToBigQuery with Storage Write API from the source code, you need to run `./gradlew :sdks:java:io:google-cloud-platform:expansion-service:build` to build the expansion-service jar. If you are running from a released Beam SDK, the jar will already be included. +**Note:** If you want to run WriteToBigQuery with Storage Write API from the source code, you need to run `./gradlew :sdks:java:io:google-cloud-platform:expansion-service:build` to build the expansion-service jar. If you are running from a released Beam SDK, the jar is already included. **Note:** Auto sharding is not currently supported for Python's Storage Write API exactly-once mode on DataflowRunner. @@ -877,32 +877,33 @@ Similar to streaming inserts, `STORAGE_WRITE_API` supports dynamically determini the number of parallel streams to write to BigQuery (starting 2.42.0). You can explicitly enable this using [`withAutoSharding`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html#withAutoSharding--). -***Note:*** `STORAGE_WRITE_API` will default to dynamic sharding when +`STORAGE_WRITE_API` defaults to dynamic sharding when `numStorageWriteApiStreams` is set to 0 or is unspecified. -***Note:*** Auto sharding with `STORAGE_WRITE_API` is supported on Dataflow's legacy runner, but **not** on Runner V2 +***Note:*** Auto sharding with `STORAGE_WRITE_API` is supported by Dataflow, but **not** on Runner v2. {{< /paragraph >}} When using `STORAGE_WRITE_API`, the `PCollection` returned by [`WriteResult.getFailedStorageApiInserts`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.html#getFailedStorageApiInserts--) -will contain the rows that failed to be written to the Storage Write API sink. +contains the rows that failed to be written to the Storage Write API sink. #### At-least-once semantics If your use case allows for potential duplicate records in the target table, you can use the [`STORAGE_API_AT_LEAST_ONCE`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#STORAGE_API_AT_LEAST_ONCE) -method. Because this method doesn’t persist the records to be written to -BigQuery into its shuffle storage (needed to provide the exactly-once semantics -of the `STORAGE_WRITE_API` method), it is cheaper and results in lower latency -for most pipelines. If you use `STORAGE_API_AT_LEAST_ONCE`, you don’t need to +method. This method doesn’t persist the records to be written to +BigQuery into its shuffle storage, which is needed to provide the exactly-once semantics +of the `STORAGE_WRITE_API` method. Therefore, for most pipelines, using this method is often +less expensive and results in lower latency. +If you use `STORAGE_API_AT_LEAST_ONCE`, you don’t need to specify the number of streams, and you can’t specify the triggering frequency. Auto sharding is not applicable for `STORAGE_API_AT_LEAST_ONCE`. When using `STORAGE_API_AT_LEAST_ONCE`, the `PCollection` returned by [`WriteResult.getFailedStorageApiInserts`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.html#getFailedStorageApiInserts--) -will contain the rows that failed to be written to the Storage Write API sink. +contains the rows that failed to be written to the Storage Write API sink. #### Quotas diff --git a/website/www/site/content/en/documentation/sdks/python-unrecoverable-errors.md b/website/www/site/content/en/documentation/sdks/python-unrecoverable-errors.md index 4e5d94ce8a8d..4fbb739e7ec7 100644 --- a/website/www/site/content/en/documentation/sdks/python-unrecoverable-errors.md +++ b/website/www/site/content/en/documentation/sdks/python-unrecoverable-errors.md @@ -16,46 +16,58 @@ See the License for the specific language governing permissions and limitations under the License. --> -# Unrecoverable Errors in Beam Python +# Unrecoverable errors in Beam Python -## What is an Unrecoverable Error? +Unrecoverable errors are issues that occur at job start-up time and +prevent jobs from ever running successfully. The problem usually stems +from a misconfiguration. This page provides context about +common errors and troubleshooting information. -An unrecoverable error is an issue at job start-up time that will -prevent a job from ever running successfully, usually due to some kind -of misconfiguration. Solving these issues when they occur is key to -successfully running a Beam Python pipeline. +## Job submission or Python runtime version mismatch {#python-version-mismatch} -## Common Unrecoverable Errors +If the Python version that you use to submit your job doesn't match the +Python version used to build the worker container, the job doesn't run. +The job fails immediately after job submission. -### Job Submission/Runtime Python Version Mismatch +To resolve this issue, ensure that the Python version used to submit the job +matches the Python container version. -If the Python version used for job submission does not match the -Python version used to build the worker container, the job will not -execute. Ensure that the Python version being used for job submission -and the container Python version match. +## Dependency resolution failures with pip {#dependency-resolution-failures} -### PIP Dependency Resolution Failures +During worker start-up, the worker might fail and, depending on the +runner, try to restart. -During worker start-up, dependencies are checked and installed in -the worker container before accepting work. If a pipeline requires -additional dependencies not already present in the runtime environment, -they are installed here. If there’s an issue during this process -(e.g. a dependency version cannot be found, or a worker cannot -connect to PyPI) the worker will fail and may try to restart -depending on the runner. Ensure that dependency versions provided in -your requirements.txt file exist and can be installed locally before -submitting jobs. +Before workers accept work, dependencies are checked and installed in +the worker container. If a pipeline requires +dependencies not already present in the runtime environment, +they are installed at this time. +When a problem occurs during this process, you might encounter +dependency resolution failures. -### Dependency Version Mismatches +Examples of problems include the following: -When additional dependencies like `torch`, `transformers`, etc. are not -specified via a requirements_file or preinstalled in a custom container -then the worker might fail to deserialize (unpickle) the user code. -This can result in `ModuleNotFound` errors. If dependencies are installed -but their versions don't match the versions in submission environment, -pipeline might have `AttributeError` messages. +- A dependency version can't be found. +- A worker can't connect to PyPI. -Ensure that the required dependencies at runtime and in the submission -environment are the same along with their versions. For better visibility, -debug logs are added specifying the dependencies at both stages starting in -Beam 2.52.0. For more information, see: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#control-dependencies \ No newline at end of file +To resolve this issue, before submitting your job, ensure that the +dependency versions provided in your `requirements.txt` file exist +and that you can install them locally. + +## Dependency version mismatches {#dependency-version} + +When your pipeline has dependency version mismatches, you might +see `ModuleNotFound` errors or `AttributeError` messages. + + - The `ModuleNotFound` errors occur when additional dependencies, + such as `torch` and `transformers`, are neither specified in a + `requirements_file` nor preinstalled in a custom container. + In this case, the worker might fail to deserialize (unpickle) the user code. + +- Your pipeline might have `AttributeError` messages when dependencies + are installed but their versions don't match the versions in submission environment. + +To resolve these problems, ensure that the required dependencies and their versions are the same +at runtime and in the submission environment. To help you identify these issues, +in Apache Beam 2.52.0 and later versions, debug logs specify the dependencies at both stages. +For more information, see +[Control the dependencies the pipeline uses](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#control-dependencies). \ No newline at end of file diff --git a/website/www/site/content/en/get-started/_index.md b/website/www/site/content/en/get-started/_index.md index c436129b066a..8aa6ff626c42 100644 --- a/website/www/site/content/en/get-started/_index.md +++ b/website/www/site/content/en/get-started/_index.md @@ -21,17 +21,18 @@ limitations under the License. # Get Started with Apache Beam -Learn to use Beam to create data processing pipelines that run on supported processing back-ends: +Learn how to use Beam to create data processing pipelines that run on supported processing back-ends. -## [Tour of Beam](https://tour.beam.apache.org) +## Tour of Beam -Learn Beam with an interactive tour with learning topics covering core Beam concepts -from simple ones to more advanced ones. +[Learn Beam with an interactive tour](https://tour.beam.apache.org). +Topics include core Beam concepts, from simple to advanced. You can try examples, do exercises, and solve challenges along the learning journey. -## [Beam Overview](/get-started/beam-overview) +## Beam Overview -Learn about the Beam model, the currently available Beam SDKs and Runners, and Beam's native I/O connectors. +Read the [Apache Beam Overview](/get-started/beam-overview) to learn about the Beam model, +the currently available Beam SDKs and runners, and Beam's native I/O connectors. ## Quickstarts for Java, Python, Go, and TypeScript @@ -49,10 +50,15 @@ See detailed walkthroughs of complete Beam pipelines. - [WordCount](/get-started/wordcount-example): Simple example pipelines that demonstrate basic Beam programming, including debugging and testing - [Mobile Gaming](/get-started/mobile-gaming-example): A series of more advanced pipelines that demonstrate use cases in the mobile gaming domain -## [Downloads and Releases](/get-started/downloads) +## Downloads and Releases -Find download links and information on the latest Beam releases, including versioning and release notes. +Find download links and information about the latest Beam releases, including versioning and release notes, +on the [Apache Beam Downloads](/get-started/downloads) page. -## [Support](/get-started/support) +## Support -Find resources, such as mailing lists and issue tracking, to help you use Beam. Ask questions and discuss topics via [Stack Overflow](https://stackoverflow.com/questions/tagged/apache-beam) or on Beam's [Slack Channel](https://apachebeam.slack.com). +- Find resources to help you use Beam, such as mailing lists and issue tracking, + on the [Support](/get-started/support) page. + - Ask questions and discuss topics on + [Stack Overflow](https://stackoverflow.com/questions/tagged/apache-beam) + or in the Beam [Slack Channel](https://apachebeam.slack.com). diff --git a/website/www/site/content/en/get-started/downloads.md b/website/www/site/content/en/get-started/downloads.md index b564a5801cd8..cc71f3101eb1 100644 --- a/website/www/site/content/en/get-started/downloads.md +++ b/website/www/site/content/en/get-started/downloads.md @@ -19,7 +19,7 @@ See the License for the specific language governing permissions and limitations under the License. --> -# Apache Beam™ Downloads +# Apache Beam® Downloads > Beam SDK {{< param release_latest >}} is the latest released version.