forked from apache/beam
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'apache:master' into master
- Loading branch information
Showing
150 changed files
with
6,207 additions
and
378 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
name: IcebergIO Unit Tests | ||
|
||
on: | ||
push: | ||
tags: ['v*'] | ||
branches: ['master', 'release-*'] | ||
paths: | ||
- "sdks/java/io/iceberg/**" | ||
- ".github/workflows/IO_Iceberg.yml" | ||
pull_request_target: | ||
branches: ['master', 'release-*'] | ||
paths: | ||
- "sdks/java/io/iceberg/**" | ||
- 'release/trigger_all_tests.json' | ||
- '.github/trigger_files/IO_Iceberg.json' | ||
issue_comment: | ||
types: [created] | ||
schedule: | ||
- cron: '15 1/6 * * *' | ||
workflow_dispatch: | ||
|
||
#Setting explicit permissions for the action to avoid the default permissions which are `write-all` in case of pull_request_target event | ||
permissions: | ||
actions: write | ||
pull-requests: write | ||
checks: write | ||
contents: read | ||
deployments: read | ||
id-token: none | ||
issues: write | ||
discussions: read | ||
packages: read | ||
pages: read | ||
repository-projects: read | ||
security-events: read | ||
statuses: read | ||
|
||
# This allows a subsequently queued workflow run to interrupt previous runs | ||
concurrency: | ||
group: '${{ github.workflow }} @ ${{ github.event.issue.number || github.event.pull_request.head.label || github.sha || github.head_ref || github.ref }}-${{ github.event.schedule || github.event.comment.id || github.event.sender.login }}' | ||
cancel-in-progress: true | ||
|
||
env: | ||
GRADLE_ENTERPRISE_ACCESS_KEY: ${{ secrets.GE_ACCESS_TOKEN }} | ||
GRADLE_ENTERPRISE_CACHE_USERNAME: ${{ secrets.GE_CACHE_USERNAME }} | ||
GRADLE_ENTERPRISE_CACHE_PASSWORD: ${{ secrets.GE_CACHE_PASSWORD }} | ||
|
||
jobs: | ||
IO_Iceberg: | ||
name: ${{ matrix.job_name }} (${{ matrix.job_phrase }}) | ||
strategy: | ||
matrix: | ||
job_name: ["IO_Iceberg"] | ||
timeout-minutes: 60 | ||
if: | | ||
github.event_name == 'push' || | ||
github.event_name == 'pull_request_target' || | ||
(github.event_name == 'schedule' && github.repository == 'apache/beam') || | ||
github.event_name == 'workflow_dispatch' || | ||
github.event.comment.body == 'Run Java_Amqp_IO_Direct PreCommit' | ||
runs-on: [self-hosted, ubuntu-20.04, main] | ||
steps: | ||
- uses: actions/checkout@v4 | ||
- name: Setup repository | ||
uses: ./.github/actions/setup-action | ||
with: | ||
comment_phrase: ${{ matrix.job_phrase }} | ||
github_token: ${{ secrets.GITHUB_TOKEN }} | ||
github_job: ${{ matrix.job_name }} (${{ matrix.job_phrase }}) | ||
- name: Setup environment | ||
uses: ./.github/actions/setup-environment-action | ||
- name: run Amqp IO build script | ||
uses: ./.github/actions/gradle-command-self-hosted-action | ||
with: | ||
gradle-command: :sdks:java:io:iceberg:build | ||
arguments: | | ||
-PdisableSpotlessCheck=true \ | ||
-PdisableCheckStyle=true \ | ||
- name: Archive JUnit Test Results | ||
uses: actions/upload-artifact@v4 | ||
if: ${{ !success() }} | ||
with: | ||
name: JUnit Test Results | ||
path: "**/build/reports/tests/" | ||
- name: Publish JUnit Test Results | ||
uses: EnricoMi/publish-unit-test-result-action@v2 | ||
if: always() | ||
with: | ||
commit: '${{ env.prsha || env.GITHUB_SHA }}' | ||
comment_mode: ${{ github.event_name == 'issue_comment' && 'always' || 'off' }} | ||
files: '**/build/test-results/**/*.xml' | ||
- name: Archive SpotBugs Results | ||
uses: actions/upload-artifact@v4 | ||
if: always() | ||
with: | ||
name: SpotBugs Results | ||
path: '**/build/reports/spotbugs/*.html' | ||
- name: Publish SpotBugs Results | ||
uses: jwgmeligmeyling/[email protected] | ||
if: always() | ||
with: | ||
name: Publish SpotBugs | ||
path: '**/build/reports/spotbugs/*.html' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
16 changes: 16 additions & 0 deletions
16
learning/prompts/documentation-lookup-nolinks/01_basic_learning_apache_beam.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
Prompt: | ||
What is Apache Beam? | ||
|
||
Response: | ||
Apache Beam is an open-source unified programming model for data processing pipelines. It provides a set of language-specific SDKs that allow you to define and execute data processing workflows. | ||
Apache Beam allows you to write pipelines in one of the supported languages (Java, Python, Go, and Scala). The pipelines can be executed on one of the supported runners (Apache Flink, Apache Spark, and Google Cloud Dataflow). | ||
|
||
Learning Apache Beam is a valuable skill for data processing and analysis. Here are some resources to get you started: | ||
* The 'Get Started' section of the Apache Beam website is a perfect place to start. It provides an overview of Apache Beam and its key concepts. | ||
* For a hands-on introduction, you can start by exploring the WordCount example provided in the 'Get Started' section, which demonstrates the basic concepts of Apache Beam. | ||
* Tour of Beam is an interactive tour that teaches you core Beam concepts. It provides a sandbox environment to write and run pipelines while exploring different topics. | ||
* Beam Playground offers an interactive space to experiment with the Apache Beam transforms and examples without installing anything on your system. | ||
|
||
After you have a basic understanding of Apache Beam, you can start to build your own pipelines. The Apache Beam website has a number of resources that can help you get started, including Apache Beam Documentation and Apache Beam Community. For those interested in the source code, you can explore the Apache Beam repository on GitHub. | ||
|
||
By leveraging these resources, you can build a strong foundation in Apache Beam and start constructing your own data processing pipelines. Additionally, Apache Beam is an open-source project, and contributions are welcome. Whether you're fixing bugs, improving documentation, or adding new features, your work can benefit the entire Beam community. To learn more about how you can contribute, visit the 'Contribute' section of the Apache Beam website. |
18 changes: 18 additions & 0 deletions
18
learning/prompts/documentation-lookup-nolinks/02_basic_pipelines.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
Prompt: | ||
What is a pipeline in Apache Beam? | ||
|
||
Response: | ||
In Apache Beam, a pipeline encapsulates the entirety of a data processing task, including all the data and each step of the process. Essentially, it's a Directed Acyclic Graph of transformations (known as `PTransform`) applied to data collections (`PCollection`). | ||
|
||
The simplest pipelines in Apache Beam follow a linear flow of operations, typically adhering to a read-process-write pattern. However, pipelines can also be significantly more complex, featuring multiple input sources, output sinks, and operations (known as a `PTransform`) that can both read from and output to multiple `PCollection` objects. For more information about pipeline design and best practices, see the 'Common Pipeline Patterns' section in the Apache Beam documentation. | ||
|
||
To use Apache Beam, your driver program must first create an instance of the Beam SDK class `Pipeline` (typically in the `main()` function). | ||
|
||
```python | ||
import apache_beam as beam | ||
|
||
with beam.Pipeline() as pipeline: | ||
pass # Build your pipeline here. | ||
``` | ||
|
||
When you create your pipeline, you need to set configuration options. You can set your pipeline’s configuration options programmatically. You can also set the options ahead of time, or read them from the command line, and then pass them to the `Pipeline` object when you create the object. This option is often easier than setting the pipeline configuration options programmatically. |
31 changes: 31 additions & 0 deletions
31
learning/prompts/documentation-lookup-nolinks/03_basic_configuring_pipelines.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
Prompt: | ||
How do I configure pipeline in Apache Beam? | ||
|
||
Response: | ||
Configuring pipeline options in Apache Beam is crucial for specifying the execution environment, managing resources, and tailoring the pipeline's behavior to meet specific data processing needs and performance requirements. | ||
|
||
You can set pipeline options programmatically or pass them in using the command line. These options include runner-specific settings, job name, project ID (for cloud runners), machine types, number of workers, and more. | ||
|
||
Apache Beam offers a variety of standard pipeline options that allow you to customize and optimize your data processing pipelines. | ||
|
||
Beam SDKs include a command-line parser that you can use to set pipeline options. Use command-line arguments in the format `--<option>=<value>`. For example, the following command sets the `--runner` option `DirectRunner` and the `--project` option `my-project-id`: | ||
|
||
```bash | ||
python my-pipeline.py --runner=DirectRunner --project=my-project-id | ||
``` | ||
|
||
To set the pipeline options programmatically, use the `PipelineOptions` class. For example, the following code sets the `--runner` option to `DirectRunner` and the `--project` option to `my-project-id`: | ||
|
||
```python | ||
from apache_beam import Pipeline | ||
from apache_beam.options.pipeline_options import PipelineOptions | ||
|
||
options = PipelineOptions( | ||
project='my-project-id', | ||
runner='DirectRunner' | ||
) | ||
``` | ||
|
||
In addition to the standard pipeline options, you can add custom pipeline options. For a common pattern for configuring pipeline options, see the 'Pipeline option patterns' section in the Apache Beam documentation. | ||
|
||
The WordCount example pipeline in the 'Get Started' section of the Apache Beam documentation demonstrates how to set pipeline options at runtime by using command-line options. |
25 changes: 25 additions & 0 deletions
25
learning/prompts/documentation-lookup-nolinks/04_basic_pcollections.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
Prompt: | ||
What is a `PCollection` in Apache Beam? | ||
|
||
Response: | ||
A `PCollection` in Apache Beam is an unordered bag of elements. Each `PCollection` is a potentially distributed, homogeneous data set or data stream, and is owned by the specific `Pipeline` object for which it is created. It’s the primary data structure used in Apache Beam pipelines to handle large-scale data processing in batch and streaming modes. | ||
|
||
The following example shows how to create a `PCollection` using the `Create` transform: | ||
|
||
```python | ||
import apache_beam as beam | ||
|
||
with beam.Pipeline() as pipeline: | ||
pcollection = pipeline | beam.Create([...]) # Create a PCollection | ||
``` | ||
|
||
A `PCollection` can either be bounded or unbounded, making it versatile for different data source types. Bounded `PCollection` objects represent a finite data set, such as files or databases, ideal for batch processing. Unbounded `PCollection` objects, on the other hand, represent data streams that continuously grow over time, such as real-time event logs, suitable for stream processing. | ||
|
||
Beam’s computational patterns and transforms are focused on situations where distributed data-parallel computation is required. Therefore, a `PCollection` has the following key characteristics: | ||
- All elements must be of the same type (with support of structured types). | ||
- Every `PCollection` has a coder, which is a specification of the binary format of the elements. | ||
- Elements are immutable. They can't be altered after creation. | ||
- Random access to individual elements of the collection isn't allowed. | ||
- Beam encodes each element for distributed processing. | ||
|
||
For more information about `PCollection` objects in Apache Beam and usage patterns, see the 'Beam Programming Guide: PCollections' section in the Apache Beam documentation. |
Oops, something went wrong.