Skip to content

Commit

Permalink
Editorial pass on content (apache#29094)
Browse files Browse the repository at this point in the history
  • Loading branch information
rszper authored and Kanishk Karanawat committed Oct 21, 2023
1 parent 6c3cca4 commit 2c7310f
Show file tree
Hide file tree
Showing 4 changed files with 82 additions and 63 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,7 @@ BigQuery's exported JSON format.

{{< paragraph class="language-py" >}}
***Note:*** `BigQuerySource()` is deprecated as of Beam SDK 2.25.0. Before 2.25.0, to read from
a BigQuery table using the Beam SDK, you will apply a `Read` transform on a `BigQuerySource`. For example,
a BigQuery table using the Beam SDK, apply a `Read` transform on a `BigQuerySource`. For example,
`beam.io.Read(beam.io.BigQuerySource(table_spec))`.
{{< /paragraph >}}

Expand Down Expand Up @@ -397,8 +397,8 @@ for the destination table(s):
whether the destination table must exist or can be created by the write
operation.
* The destination table's write disposition. The write disposition specifies
whether the data you write will replace an existing table, append rows to an
existing table, or write only to an empty table.
whether the data you write replaces an existing table, appends rows to an
existing table, or writes only to an empty table.

In addition, if your write operation creates a new BigQuery table, you must also
supply a table schema for the destination table.
Expand Down Expand Up @@ -512,7 +512,7 @@ use a string that contains a JSON-serialized `TableSchema` object.
To create a table schema in Python, you can either use a `TableSchema` object,
or use a string that defines a list of fields. Single string based schemas do
not support nested fields, repeated fields, or specifying a BigQuery mode for
fields (the mode will always be set to `NULLABLE`).
fields (the mode is always set to `NULLABLE`).
{{< /paragraph >}}

#### Using a TableSchema
Expand All @@ -539,7 +539,7 @@ To create and use a table schema as a `TableSchema` object, follow these steps.

2. Create and append a `TableFieldSchema` object for each field in your table.

3. Next, use the `schema` parameter to provide your table schema when you apply
3. Use the `schema` parameter to provide your table schema when you apply
a write transform. Set the parameter’s value to the `TableSchema` object.
{{< /paragraph >}}

Expand Down Expand Up @@ -728,8 +728,8 @@ The following examples use this `PCollection` that contains quotes.
The `writeTableRows` method writes a `PCollection` of BigQuery `TableRow`
objects to a BigQuery table. Each element in the `PCollection` represents a
single row in the table. This example uses `writeTableRows` to write elements to a
`PCollection<TableRow>`. The write operation creates a table if needed; if the
table already exists, it will be replaced.
`PCollection<TableRow>`. The write operation creates a table if needed. If the
table already exists, it is replaced.
{{< /paragraph >}}

{{< highlight java >}}
Expand All @@ -745,7 +745,7 @@ table already exists, it will be replaced.
{{< paragraph class="language-py" >}}
The following example code shows how to apply a `WriteToBigQuery` transform to
write a `PCollection` of dictionaries to a BigQuery table. The write operation
creates a table if needed; if the table already exists, it will be replaced.
creates a table if needed. If the table already exists, it is replaced.
{{< /paragraph >}}

{{< highlight py >}}
Expand All @@ -759,8 +759,8 @@ The `write` transform writes a `PCollection` of custom typed objects to a BigQue
table. Use `.withFormatFunction(SerializableFunction)` to provide a formatting
function that converts each input element in the `PCollection` into a
`TableRow`. This example uses `write` to write a `PCollection<String>`. The
write operation creates a table if needed; if the table already exists, it will
be replaced.
write operation creates a table if needed. If the table already exists, it is
replaced.
{{< /paragraph >}}

{{< highlight java >}}
Expand All @@ -786,7 +786,7 @@ BigQuery Storage Write API for Python SDK currently has some limitations on supp
{{< /paragraph >}}

{{< paragraph class="language-py" >}}
**Note:** If you want to run WriteToBigQuery with Storage Write API from the source code, you need to run `./gradlew :sdks:java:io:google-cloud-platform:expansion-service:build` to build the expansion-service jar. If you are running from a released Beam SDK, the jar will already be included.
**Note:** If you want to run WriteToBigQuery with Storage Write API from the source code, you need to run `./gradlew :sdks:java:io:google-cloud-platform:expansion-service:build` to build the expansion-service jar. If you are running from a released Beam SDK, the jar is already included.

**Note:** Auto sharding is not currently supported for Python's Storage Write API exactly-once mode on DataflowRunner.

Expand Down Expand Up @@ -877,32 +877,33 @@ Similar to streaming inserts, `STORAGE_WRITE_API` supports dynamically determini
the number of parallel streams to write to BigQuery (starting 2.42.0). You can
explicitly enable this using [`withAutoSharding`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html#withAutoSharding--).

***Note:*** `STORAGE_WRITE_API` will default to dynamic sharding when
`STORAGE_WRITE_API` defaults to dynamic sharding when
`numStorageWriteApiStreams` is set to 0 or is unspecified.

***Note:*** Auto sharding with `STORAGE_WRITE_API` is supported on Dataflow's legacy runner, but **not** on Runner V2
***Note:*** Auto sharding with `STORAGE_WRITE_API` is supported by Dataflow, but **not** on Runner v2.
{{< /paragraph >}}

When using `STORAGE_WRITE_API`, the `PCollection` returned by
[`WriteResult.getFailedStorageApiInserts`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.html#getFailedStorageApiInserts--)
will contain the rows that failed to be written to the Storage Write API sink.
contains the rows that failed to be written to the Storage Write API sink.

#### At-least-once semantics

If your use case allows for potential duplicate records in the target table, you
can use the
[`STORAGE_API_AT_LEAST_ONCE`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#STORAGE_API_AT_LEAST_ONCE)
method. Because this method doesn’t persist the records to be written to
BigQuery into its shuffle storage (needed to provide the exactly-once semantics
of the `STORAGE_WRITE_API` method), it is cheaper and results in lower latency
for most pipelines. If you use `STORAGE_API_AT_LEAST_ONCE`, you don’t need to
method. This method doesn’t persist the records to be written to
BigQuery into its shuffle storage, which is needed to provide the exactly-once semantics
of the `STORAGE_WRITE_API` method. Therefore, for most pipelines, using this method is often
less expensive and results in lower latency.
If you use `STORAGE_API_AT_LEAST_ONCE`, you don’t need to
specify the number of streams, and you can’t specify the triggering frequency.

Auto sharding is not applicable for `STORAGE_API_AT_LEAST_ONCE`.

When using `STORAGE_API_AT_LEAST_ONCE`, the `PCollection` returned by
[`WriteResult.getFailedStorageApiInserts`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.html#getFailedStorageApiInserts--)
will contain the rows that failed to be written to the Storage Write API sink.
contains the rows that failed to be written to the Storage Write API sink.

#### Quotas

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,46 +16,58 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# Unrecoverable Errors in Beam Python
# Unrecoverable errors in Beam Python

## What is an Unrecoverable Error?
Unrecoverable errors are issues that occur at job start-up time and
prevent jobs from ever running successfully. The problem usually stems
from a misconfiguration. This page provides context about
common errors and troubleshooting information.

An unrecoverable error is an issue at job start-up time that will
prevent a job from ever running successfully, usually due to some kind
of misconfiguration. Solving these issues when they occur is key to
successfully running a Beam Python pipeline.
## Job submission or Python runtime version mismatch {#python-version-mismatch}

## Common Unrecoverable Errors
If the Python version that you use to submit your job doesn't match the
Python version used to build the worker container, the job doesn't run.
The job fails immediately after job submission.

### Job Submission/Runtime Python Version Mismatch
To resolve this issue, ensure that the Python version used to submit the job
matches the Python container version.

If the Python version used for job submission does not match the
Python version used to build the worker container, the job will not
execute. Ensure that the Python version being used for job submission
and the container Python version match.
## Dependency resolution failures with pip {#dependency-resolution-failures}

### PIP Dependency Resolution Failures
During worker start-up, the worker might fail and, depending on the
runner, try to restart.

During worker start-up, dependencies are checked and installed in
the worker container before accepting work. If a pipeline requires
additional dependencies not already present in the runtime environment,
they are installed here. If there’s an issue during this process
(e.g. a dependency version cannot be found, or a worker cannot
connect to PyPI) the worker will fail and may try to restart
depending on the runner. Ensure that dependency versions provided in
your requirements.txt file exist and can be installed locally before
submitting jobs.
Before workers accept work, dependencies are checked and installed in
the worker container. If a pipeline requires
dependencies not already present in the runtime environment,
they are installed at this time.
When a problem occurs during this process, you might encounter
dependency resolution failures.

### Dependency Version Mismatches
Examples of problems include the following:

When additional dependencies like `torch`, `transformers`, etc. are not
specified via a requirements_file or preinstalled in a custom container
then the worker might fail to deserialize (unpickle) the user code.
This can result in `ModuleNotFound` errors. If dependencies are installed
but their versions don't match the versions in submission environment,
pipeline might have `AttributeError` messages.
- A dependency version can't be found.
- A worker can't connect to PyPI.

Ensure that the required dependencies at runtime and in the submission
environment are the same along with their versions. For better visibility,
debug logs are added specifying the dependencies at both stages starting in
Beam 2.52.0. For more information, see: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#control-dependencies
To resolve this issue, before submitting your job, ensure that the
dependency versions provided in your `requirements.txt` file exist
and that you can install them locally.

## Dependency version mismatches {#dependency-version}

When your pipeline has dependency version mismatches, you might
see `ModuleNotFound` errors or `AttributeError` messages.

- The `ModuleNotFound` errors occur when additional dependencies,
such as `torch` and `transformers`, are neither specified in a
`requirements_file` nor preinstalled in a custom container.
In this case, the worker might fail to deserialize (unpickle) the user code.

- Your pipeline might have `AttributeError` messages when dependencies
are installed but their versions don't match the versions in submission environment.

To resolve these problems, ensure that the required dependencies and their versions are the same
at runtime and in the submission environment. To help you identify these issues,
in Apache Beam 2.52.0 and later versions, debug logs specify the dependencies at both stages.
For more information, see
[Control the dependencies the pipeline uses](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#control-dependencies).
26 changes: 16 additions & 10 deletions website/www/site/content/en/get-started/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,17 +21,18 @@ limitations under the License.

# Get Started with Apache Beam

Learn to use Beam to create data processing pipelines that run on supported processing back-ends:
Learn how to use Beam to create data processing pipelines that run on supported processing back-ends.

## [Tour of Beam](https://tour.beam.apache.org)
## Tour of Beam

Learn Beam with an interactive tour with learning topics covering core Beam concepts
from simple ones to more advanced ones.
[Learn Beam with an interactive tour](https://tour.beam.apache.org).
Topics include core Beam concepts, from simple to advanced.
You can try examples, do exercises, and solve challenges along the learning journey.

## [Beam Overview](/get-started/beam-overview)
## Beam Overview

Learn about the Beam model, the currently available Beam SDKs and Runners, and Beam's native I/O connectors.
Read the [Apache Beam Overview](/get-started/beam-overview) to learn about the Beam model,
the currently available Beam SDKs and runners, and Beam's native I/O connectors.

## Quickstarts for Java, Python, Go, and TypeScript

Expand All @@ -49,10 +50,15 @@ See detailed walkthroughs of complete Beam pipelines.
- [WordCount](/get-started/wordcount-example): Simple example pipelines that demonstrate basic Beam programming, including debugging and testing
- [Mobile Gaming](/get-started/mobile-gaming-example): A series of more advanced pipelines that demonstrate use cases in the mobile gaming domain

## [Downloads and Releases](/get-started/downloads)
## Downloads and Releases

Find download links and information on the latest Beam releases, including versioning and release notes.
Find download links and information about the latest Beam releases, including versioning and release notes,
on the [Apache Beam Downloads](/get-started/downloads) page.

## [Support](/get-started/support)
## Support

Find resources, such as mailing lists and issue tracking, to help you use Beam. Ask questions and discuss topics via [Stack Overflow](https://stackoverflow.com/questions/tagged/apache-beam) or on Beam's [Slack Channel](https://apachebeam.slack.com).
- Find resources to help you use Beam, such as mailing lists and issue tracking,
on the [Support](/get-started/support) page.
- Ask questions and discuss topics on
[Stack Overflow](https://stackoverflow.com/questions/tagged/apache-beam)
or in the Beam [Slack Channel](https://apachebeam.slack.com).
2 changes: 1 addition & 1 deletion website/www/site/content/en/get-started/downloads.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# Apache Beam&#8482; Downloads
# Apache Beam<sup>®</sup> Downloads

> Beam SDK {{< param release_latest >}} is the latest released version.
Expand Down

0 comments on commit 2c7310f

Please sign in to comment.