[FLINK-35745] add documentation for flink lineage #25762

HuangZhenQiu · 2024-12-08T06:17:57Z

What is the purpose of the change

Add documentation for native lineage support in Flink. Mainly for connector developers.

Brief change log

Add data_lineage.md under docs/internals for both English and Chinese
Improve the existing content in job_status_listerner.md

Verifying this change

Build the docs locally and verified end to end

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (o)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (docs)

HuangZhenQiu · 2024-12-08T06:19:24Z

flinkbot · 2024-12-08T06:22:25Z

CI report:

52bf157 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

davidradl · 2024-12-09T11:03:54Z

docs/content/docs/internals/data_lineage.md

+-->
+
+# Native Lineage Support
+Data lineage has gain more and more criticality in data ecosystem. As Apache Flink is widely used for data ingestion and ETL in Streaming Data Lake, we need 


NITs
Data lineage has gain more and more criticality in data ecosystem
->
how about
As organisations look to govern their data ecosystems; understanding data lineage, where data is coming from and going to, becomes critical.

NIT : Lake - -> Lakes

davidradl · 2024-12-09T11:05:26Z

docs/content/docs/internals/data_lineage.md

+  - `Regulatory Compliance`: Ensuring adherence to data privacy and compliance regulations by tracking data flow and transformations throughout its lifecycle.
+  - `Data Optimization`: Identifying redundant data processing steps and optimizing data flows to improve efficiency.
+
+Apache Flink provides a native lineage support for the community requirement by providing an internal lineage data model and [Job Status Listener]({{< ref "docs/deployment/advanced/job_status_listener" >}}) for


NIT: I suggest removing for the community requirement

davidradl · 2024-12-09T11:06:40Z

docs/content/docs/internals/data_lineage.md

+  - `Data Optimization`: Identifying redundant data processing steps and optimizing data flows to improve efficiency.
+
+Apache Flink provides a native lineage support for the community requirement by providing an internal lineage data model and [Job Status Listener]({{< ref "docs/deployment/advanced/job_status_listener" >}}) for
+developer to integrate lineage metadata into external lineage system, for example [OpenLineage](https://openlineage.io). When a job is created in Flink runtime, the JobCreatedEvent 


NITs:
developer -> the developer
in Flink -> in the Flink

davidradl · 2024-12-09T11:07:09Z

docs/content/docs/internals/data_lineage.md

+
+Apache Flink provides a native lineage support for the community requirement by providing an internal lineage data model and [Job Status Listener]({{< ref "docs/deployment/advanced/job_status_listener" >}}) for
+developer to integrate lineage metadata into external lineage system, for example [OpenLineage](https://openlineage.io). When a job is created in Flink runtime, the JobCreatedEvent 
+contains the Lineage Graph metadata will be sent to Job Status Listeners.


NIT will -> that will

davidradl · 2024-12-09T11:07:54Z

docs/content/docs/internals/data_lineage.md

+contains the Lineage Graph metadata will be sent to Job Status Listeners.
+
+# Lineage Data Model
+Flink native lineage interfaces are defined in two layers. The first layer is the generic interface for all Flink jobs and connector, and the second layer defines


I suggest a picture showing the layers at a component level.

davidradl · 2024-12-09T11:08:31Z

docs/content/docs/internals/data_lineage.md

+
+# Lineage Data Model
+Flink native lineage interfaces are defined in two layers. The first layer is the generic interface for all Flink jobs and connector, and the second layer defines
+the extended interfaces for Table and DataStream independently. The interface and class relationship are defined in the diagram below.


NIT: relationship -> relationships

davidradl · 2024-12-09T11:09:50Z

docs/content/docs/internals/data_lineage.md

+
+{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}}
+
+By default, Table related lineage interfaces or classes are mainly used in Flink Table Runtime, thus Flink users doesn't need to touch these interfaces. Flink community will gradually support all


I m curious why it is mainly in sentence are mainly used in Flink Table Runtime
NIT doesn't - > do not
NIT Flink community -> The Flink community

davidradl · 2024-12-09T11:12:25Z

docs/content/docs/internals/data_lineage.md

+{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}}
+
+By default, Table related lineage interfaces or classes are mainly used in Flink Table Runtime, thus Flink users doesn't need to touch these interfaces. Flink community will gradually support all
+of common connectors, such as Kafka, JDBC, Cassandra, Hive and so on. If you have customized connector defined, you need to have customized source/sink implements the LineageVertexProvider interface.


NITs:
of common -> of the common
such as -> including
remove and so on.
have customized -> have a customized
implements the -> implementations of the

davidradl · 2024-12-09T11:13:21Z

docs/content/docs/internals/data_lineage.md

+
+By default, Table related lineage interfaces or classes are mainly used in Flink Table Runtime, thus Flink users doesn't need to touch these interfaces. Flink community will gradually support all
+of common connectors, such as Kafka, JDBC, Cassandra, Hive and so on. If you have customized connector defined, you need to have customized source/sink implements the LineageVertexProvider interface.
+Within a LineageVertex, a list of Lineage Dataset is defined as metadata for Flink source/sink. 


NITs:
Lineage Dataset is -> Lineage Datasets are
for Flink -> for the Flink

davidradl · 2024-12-09T11:16:41Z

docs/content/docs/internals/data_lineage.md

+For the interface details, please refer to [FLIP-314](https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener).
+
+# Naming Conventions
+For each of Lineage Dataset, we need to define its own name and namespace to distinguish different data store and corresponding instance used in the connector of a Flink application. 


NITs:
each of Lineage Dataset -> each of the Lineage Datasets
remove own
comma after namespace
data store -> data stores
instance -> instances
what is a data store can this be a link to the definition

Maybe "corresponding instance used in the connector of a Flink application." -> corresponding dynamic table associated with a Flink connector.

davidradl · 2024-12-09T11:18:13Z

docs/content/docs/internals/data_lineage.md

+| Data Store | Connector Type  | Namespace                              | Name                                                     | 
+|------------|-----------------|----------------------------------------|----------------------------------------------------------|
+| Kafka      | Kafka Connector | kafka://{bootstrap server host}:{port} | topic                                                    |
+| MySQL      | JDBC Connector  | mysql://{host}:{port}                  | {database}.{table}                                       | 


Does JDBC need apache/flink-connector-jdbc#149 to be merged?

davidradl · 2024-12-09T11:20:50Z

docs/content/docs/internals/data_lineage.md

+under the License.
+-->
+
+# Native Lineage Support


is this open lineage - if so we should say that and link the spec in the text

davidradl · 2024-12-09T11:26:14Z

docs/content/docs/internals/data_lineage.md

+| DB2        | JDBC Connector  | db2://{host}:{port}                    | {database}.{table}                                       | 
+| CrateDB    | JDBC Connector  | cratedb://{host}:{port}                | {database}.{table}                                       | 
+
+It is a running table. More and more naming info will be added after lineage integration is finished for a specific connector.


I am struggling with the sentence "It is a running table. More and more naming info will be added after lineage integration is finished for a specific connector." I am not sure hat you trying to say. when you say running table , do you mean that lineage relates to how data flows though a dynamic table at runtime.
I am not sure what "more an more naming" means. I assume when the connector adds the lineage capability, it associates a name with a table source / sink vertex. Is there more we need to say around this?

I wonder if the connector information should be authored in the appropriate connector repo and brought into the core Flink docs.

[FLINK-35745] add documentation for flink lineage

52bf157

HuangZhenQiu mentioned this pull request Dec 8, 2024

[FLINK-36625] add lineage helper class for connector integration #25712

Open

flinkbot added the component=Documentation label Dec 8, 2024

davidradl reviewed Dec 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-35745] add documentation for flink lineage #25762

[FLINK-35745] add documentation for flink lineage #25762

HuangZhenQiu commented Dec 8, 2024

HuangZhenQiu commented Dec 8, 2024

flinkbot commented Dec 8, 2024 •

edited

Loading

davidradl Dec 9, 2024

davidradl Dec 9, 2024

davidradl Dec 9, 2024 •

edited

Loading

davidradl Dec 9, 2024

davidradl Dec 9, 2024

davidradl Dec 9, 2024 •

edited

Loading

davidradl Dec 9, 2024

davidradl Dec 9, 2024 •

edited

Loading

davidradl Dec 9, 2024

davidradl Dec 9, 2024

davidradl Dec 9, 2024 •

edited

Loading

davidradl Dec 9, 2024

davidradl Dec 9, 2024

davidradl Dec 9, 2024 •

edited

Loading


		{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}}

		By default, Table related lineage interfaces or classes are mainly used in Flink Table Runtime, thus Flink users doesn't need to touch these interfaces. Flink community will gradually support all

[FLINK-35745] add documentation for flink lineage #25762

Are you sure you want to change the base?

[FLINK-35745] add documentation for flink lineage #25762

Conversation

HuangZhenQiu commented Dec 8, 2024

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

HuangZhenQiu commented Dec 8, 2024

flinkbot commented Dec 8, 2024 • edited Loading

CI report:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidradl Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidradl Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidradl Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidradl Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidradl Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

flinkbot commented Dec 8, 2024 •

edited

Loading

davidradl Dec 9, 2024 •

edited

Loading

davidradl Dec 9, 2024 •

edited

Loading

davidradl Dec 9, 2024 •

edited

Loading

davidradl Dec 9, 2024 •

edited

Loading

davidradl Dec 9, 2024 •

edited

Loading