Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: Add concept document for Bidirectional Data Transfer #1398

Merged
merged 4 commits into from
Jul 10, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Overview: Endpoint Topologies

Bidirectional data transfers involve transmissions that can be sent by either the provider or consumer during the
transfer's lifetime. The provider sends data over a forward channel, while the client uses a response channel to send
data related to the forward transmission. For example, a provider sends parts data over the forward channel, while the
consumer sends data related to errors in the forward transmission via the response channel.

Bidirectional data transfers should be modeled using a single Dataspace Protocol *offer* and *contract agreement*. In
other words, a single offer represents the ability to send both forward and response messages, while an active contract
agreement can be used to initiate the transfer.

Bidirectional flows can be implemented using a variety of wire protocols, for example, HTTP or a messaging layer.
However, all scenarios correspond to one of two endpoint topologies:

- The consumer offers the forward channel endpoint, and the provider offers the response channel endpoint.
- The provider offers both the forward and response channel endpoints.

The Dataspace Protocol (DSP) defines two categories of data transfer: *push* and *pull*. The endpoint topologies
correlate to these categories as follows:

| Provider Push | Consumer Pull |
|---------------------------------------------------------------------------------------------|------------------------------------------------------------|
| Consumer offers the forward channel endpoint; provider offers the response channel endpoint | Provider offers the forward and response channel endpoints |

**In each case, the provider always offers the response channel.**

## The Data Plane

The Data Plane establishes data transfer communication channels and endpoints using a *wire protocol*. There are many
ways to do this, two of which are described below.

**HTTP Endpoints**

The forward and response channels are separate endpoints. The endpoints may be static, where all messages in a
particular direction are sent to the same endpoint, which then uses a correlation mechanism to process them, for
example, `https://test.com/forwardChannel` and `https//test.com/responseChannel`. Or, the endpoints may be dynamic,
where a path part contains a correlation ID, for example, `https://test.com/transferId/forwardChannel`
and `https://test.com/transferId/responseChannel`.

**Queues and Pub/Sub**

In this scenario, the forward channel is a *queue* or a pub/sub *topic* while the response channel is a *queue*. This is
a typical architecture used when designing systems with Message-Oriented-Middleware.

### Required Changes to the Data Plane Framework

The required changes to the Data Plane Framework to support bidirectional data transfers are minimal.

#### Response Endpoint `DataAddress`

The `DataAddress` in the `DataFlowResponseMessage` must contain a `https://w3id.org/edc/v0.0.1/ns/responseChannel`
property of type `DataAddress`. This `DataAddress` follows the same format as the outer `DataAddress` and represents the
response channel endpoint. For example, it may contain authorization data the consumer uses to access the response
channel endpoint.

#### The DataPlaneManager

The `DataPlaneManagerImpl` and its collaborators will need to be refactored to generate response
channel `DataAddresses`:

- `DataPlaneManagerImpl` must be modified to return an EDR in the case of a provider PUSH. This EDR will only contain
a `https://w3id.org/edc/v0.0.1/ns/responseChannel` entry. The manager will delegate to `DataPlaneAuthorizationService`
to generate the response.
- `DataPlaneAuthorizationServiceImpl` must be enhanced to support `responseChannel` generation. This should be keyed off
of the transfer type. As part of this process, a `DataPlaneAuthorizationServiceImpl.createEndpointDataReference` must
generate a `responseChannel` endpoint by delegating to a new
method `PublicEndpointGeneratorService.generateCallbackFor(sourceDataAddress).` Access Tokens can be generated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is sourceDataAddress in this case? I stuggle with the synchronicity between push and pull scenario here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To understand this requires specialized EDC knowledge. The sourceDataAddress is the reference to the backend asset being transferred. This address is internal to the EDC deployment and not available externally (e.g. to a consumer). The endpoint generation service is responsible for interpreting the address and mapping a publically available endpoint that is associated with retrieving the data.

from `DataPlaneAccessTokenService`.

#### Technical Considerations
jimmarino marked this conversation as resolved.
Show resolved Hide resolved

The above changes can work with both DSP pull and push scenarios. However, it is important to note a potential race
condition that could be introduced in PUSH transfers. Namely, provider-pushed data could potentially arrive before the
DSP start message containing the response channel `DataAddress` is received by the client. This is due to the nature of
asynchronous communications. In this case, the client would either need to skip sending a response or store the response
messages to send when it receives the response channel `DataAddress`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some remarks:

  1. The concept concentrates on the interaction between two connectors on how they exchange the data. I would expect some information on how the data is processed, i.e., how it is stored and connected to the transfer, as well as how it can be retrieved from the connectors.
  2. In your remark on the race conditions in the push case, it becomes clear that there are separate messages/streams, one that pushes the data to the consumer and one that contains the return channel. I assume, that the latter is part of the negotiations, the two data planes do to start the transfer. Is this the relation to DataPlaneManagerImpl which would, in the push case, create an EDR token. I assume that this addition means, the EDR representating the response channel is simply an addon on the existing protocol message.
  3. Is there any implication on a combination where two data planes communicate, but only one of them is capable of handling a response channel. If the consumer side wants to use that, I do not see an issue, but in the other direction, the consumer gets out of a sudden more information, does it handle that gracefully?

Copy link
Contributor Author

@jimmarino jimmarino Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments:

  1. How the data is processed does not concern the fact that communication is bi-directional. Bi-directional channels do not need to be concerned with what happens to the data after it is received or any qualities of service (e.g., reliability) associated with the particular wire protocol.
  2. The transfer type is advertised in the DCAT Distribution linked to the Offer, and that carries the fact that the wire protocol is bi-directional. Hence, there is a need for Catena-X (or another dataspace/project, etc.) to standardize a transfer type. The response channel endpoint information is contained within the forward DataAddress
  3. A client data plane must support one of the wire protocols associated with an offer via DCAT Distributions. Otherwise, it will not have access to the data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand that right, a data transfer with a response channel would require a new transfer type, i.e., it duplicates the amount of transfer types, right?

As far as I understood the original requirement, the channel is about giving feedback on the received data. E.g., to indicate, that the data quality is poor. Is this really related to the data transfer, as actually, there is an observed mismatch on the consumer side between the expectations based on the offer and the concrete data received. Wouldn't that be actually a concept on the DSP level, as the feedback is about contract fulfillment.

If the data is broken or incomplete, the consumer could simply reinitiate the data transfer, so that is not really a reason to use the response channel, right? So it is really about a higher level concept on the received data, imho.

On the other hand side, there could be many data transfers on the same contract, so if one transfer lead to poor quality, there is reason to not mark the whole contract with the feedback issue.

Still, the concept only describes a form of sending data back to the provider, but the intention of the requirement was to give feedback on the received data. In my opinion, this still requires a reaction on the data on the provider side. Something like a label on the data transfer or a special state. Even, if the message is not formalized at all, an indicator, that there is feedback on the data transfer should be part of the concept. In the current state, the relation between the send data and the metadata on the feedback channel gets lost after it is received.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the concept is about how to represent a bidirectional data transfer. It does not involve qualities of service such as reliability, which involve retransmission (for example, all reliable messaging protocols require idempotency). Qualities of service are implemented by the underlying wire protocol used for the forward and response channels, for example, AMQP. The response channel would never be used to send quality of service information back to the prodivder. Rather, one use could be to send information about errors in the data sent via the forward channel.

The scope of this concept should be only to describe how forward and back channels are established between a consumer and producer. It should not discuss what purposes clients and producers use those channels for. That is the job of the particular transfer protocol that would use this feature.

The response channel lifetime is tied to the forward channel. For example, when the forward channel is closed, the
response channel will also be closed.

## Catena-X Standardization and Tractus-X Support

To achieve interoperability, Catena-X would need to standardize a bidirectional transfer type similar to its support of
HTTP push/pull and S3 types. This could then be implemented in Tractus-X EDC.

Loading