Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-35745] add documentation for flink lineage #25762

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

HuangZhenQiu
Copy link
Contributor

What is the purpose of the change

Add documentation for native lineage support in Flink. Mainly for connector developers.

Brief change log

  • Add data_lineage.md under docs/internals for both English and Chinese
  • Improve the existing content in job_status_listerner.md

Verifying this change

  • Build the docs locally and verified end to end

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (o)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (docs)

@HuangZhenQiu
Copy link
Contributor Author

Screenshot 2024-12-07 at 10 18 35 PM

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 8, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

-->

# Native Lineage Support
Data lineage has gain more and more criticality in data ecosystem. As Apache Flink is widely used for data ingestion and ETL in Streaming Data Lake, we need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NITs
Data lineage has gain more and more criticality in data ecosystem
->
how about
As organisations look to govern their data ecosystems; understanding data lineage, where data is coming from and going to, becomes critical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT : Lake - -> Lakes

- `Regulatory Compliance`: Ensuring adherence to data privacy and compliance regulations by tracking data flow and transformations throughout its lifecycle.
- `Data Optimization`: Identifying redundant data processing steps and optimizing data flows to improve efficiency.

Apache Flink provides a native lineage support for the community requirement by providing an internal lineage data model and [Job Status Listener]({{< ref "docs/deployment/advanced/job_status_listener" >}}) for
Copy link
Contributor

@davidradl davidradl Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I suggest removing for the community requirement

- `Data Optimization`: Identifying redundant data processing steps and optimizing data flows to improve efficiency.

Apache Flink provides a native lineage support for the community requirement by providing an internal lineage data model and [Job Status Listener]({{< ref "docs/deployment/advanced/job_status_listener" >}}) for
developer to integrate lineage metadata into external lineage system, for example [OpenLineage](https://openlineage.io). When a job is created in Flink runtime, the JobCreatedEvent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NITs:
developer -> the developer
in Flink -> in the Flink


Apache Flink provides a native lineage support for the community requirement by providing an internal lineage data model and [Job Status Listener]({{< ref "docs/deployment/advanced/job_status_listener" >}}) for
developer to integrate lineage metadata into external lineage system, for example [OpenLineage](https://openlineage.io). When a job is created in Flink runtime, the JobCreatedEvent
contains the Lineage Graph metadata will be sent to Job Status Listeners.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT will -> that will

contains the Lineage Graph metadata will be sent to Job Status Listeners.

# Lineage Data Model
Flink native lineage interfaces are defined in two layers. The first layer is the generic interface for all Flink jobs and connector, and the second layer defines
Copy link
Contributor

@davidradl davidradl Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a picture showing the layers at a component level.


# Lineage Data Model
Flink native lineage interfaces are defined in two layers. The first layer is the generic interface for all Flink jobs and connector, and the second layer defines
the extended interfaces for Table and DataStream independently. The interface and class relationship are defined in the diagram below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: relationship -> relationships


{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}}

By default, Table related lineage interfaces or classes are mainly used in Flink Table Runtime, thus Flink users doesn't need to touch these interfaces. Flink community will gradually support all
Copy link
Contributor

@davidradl davidradl Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I m curious why it is mainly in sentence are mainly used in Flink Table Runtime
NIT doesn't - > do not
NIT Flink community -> The Flink community

{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}}

By default, Table related lineage interfaces or classes are mainly used in Flink Table Runtime, thus Flink users doesn't need to touch these interfaces. Flink community will gradually support all
of common connectors, such as Kafka, JDBC, Cassandra, Hive and so on. If you have customized connector defined, you need to have customized source/sink implements the LineageVertexProvider interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NITs:
of common -> of the common
such as -> including
remove and so on.
have customized -> have a customized
implements the -> implementations of the


By default, Table related lineage interfaces or classes are mainly used in Flink Table Runtime, thus Flink users doesn't need to touch these interfaces. Flink community will gradually support all
of common connectors, such as Kafka, JDBC, Cassandra, Hive and so on. If you have customized connector defined, you need to have customized source/sink implements the LineageVertexProvider interface.
Within a LineageVertex, a list of Lineage Dataset is defined as metadata for Flink source/sink.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NITs:
Lineage Dataset is -> Lineage Datasets are
for Flink -> for the Flink

For the interface details, please refer to [FLIP-314](https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener).

# Naming Conventions
For each of Lineage Dataset, we need to define its own name and namespace to distinguish different data store and corresponding instance used in the connector of a Flink application.
Copy link
Contributor

@davidradl davidradl Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NITs:
each of Lineage Dataset -> each of the Lineage Datasets
remove own
comma after namespace
data store -> data stores
instance -> instances
what is a data store can this be a link to the definition

Maybe "corresponding instance used in the connector of a Flink application." -> corresponding dynamic table associated with a Flink connector.

| Data Store | Connector Type | Namespace | Name |
|------------|-----------------|----------------------------------------|----------------------------------------------------------|
| Kafka | Kafka Connector | kafka://{bootstrap server host}:{port} | topic |
| MySQL | JDBC Connector | mysql://{host}:{port} | {database}.{table} |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does JDBC need apache/flink-connector-jdbc#149 to be merged?

under the License.
-->

# Native Lineage Support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this open lineage - if so we should say that and link the spec in the text

| DB2 | JDBC Connector | db2://{host}:{port} | {database}.{table} |
| CrateDB | JDBC Connector | cratedb://{host}:{port} | {database}.{table} |

It is a running table. More and more naming info will be added after lineage integration is finished for a specific connector.
Copy link
Contributor

@davidradl davidradl Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am struggling with the sentence "It is a running table. More and more naming info will be added after lineage integration is finished for a specific connector." I am not sure hat you trying to say. when you say running table , do you mean that lineage relates to how data flows though a dynamic table at runtime.
I am not sure what "more an more naming" means. I assume when the connector adds the lineage capability, it associates a name with a table source / sink vertex. Is there more we need to say around this?

I wonder if the connector information should be authored in the appropriate connector repo and brought into the core Flink docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants