Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configuration documentation #5

Merged
merged 11 commits into from
Aug 10, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 233 additions & 0 deletions docs/configuration_data_model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
# Conduit Configuration Data Model

## Requirements

This section enumerates the full set of features we expect to give to these models. Only the ones with the (**MVP**) tag are to be included in the MVP.

### Persona: UI user

1. Test Connection (**MVP**)
1. Discover Schema (**MVP**)
1. Discover Schema with complex configuration (e.g. multi-nested file systems)
1. Sync Data
1. Full refresh
1. Append only - no concept of a primary key, simply ads new data to the end of a table. (**MVP**)
1. Full deltas - detects when a record is already present in the data set and updates it.
1. Historical mode - detects when a record is already present, groups it on a primary key, but retains old and new versions of the record. ([fivetran historical mode docs](https://fivetran.com/docs/getting-started/feature/history-mode))
1. Support for "pull" connections. (**MVP**)
1. These are all connections that can be polled.
1. Support for "push" connections.
1. Fivetran supports push connections that accept data when the data provider emits the data (instead of polling for it).
1. Scheduled syncs
1. Every X minutes / hours / days (**MVP**)
1. Full linux crontab scheduling
1. Ability to use any singer tap / target by providing existing config, catalog, and state. (**MVP**)???
1. Transformations - allow basic transformations e.g. upper-casing, column name changes, hashing of values etc. Otherwise, data will be transported "as is".
1. Determine when a record was last synced in the target warehouse

### Persona: OSS Contributor

1. Add a source _without_ needing to write HTML. They should be responsible for only 2 things:
1. Define Configuration: define a json object which describes which properties need to be collected by a user. Then the UI figures out how to render it.
cgardens marked this conversation as resolved.
Show resolved Hide resolved
1. Implement: `testConnection`, `discoverSchema`, and `sync`. These functions should only rely on the configurations defined in the json and should return objects that match the interfaces that are described below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should are more granularity in the sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you say more? you think the configuration should be more granular? or are you talking about splitting up the steps in the sync step more? if the latter, we can figure that out in sherif's work doc.

1. (Note: Not doing this means that we need to create custom html pages for each integration.)
1. Support "easy" integration of singer taps
1. A well-documented path that is easy to follow if you were the creator of a singer tap / target.
1. Documentation on how to contribute. Also describes the interface that the contributor must code against. (**MVP**)

## User Flow

The basic flow will go as follows:
* Insert credentials for a source.
* Receive feedback on whether Dataline was able to reach the source with the given credentials.
* Insert credentials for a destination.
* Receive feedback on whether Dataline was able to reach the destination with the given credentials.
* Show intent to connect source to destination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this point mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol. yeah awkward phrasing. all steps so far have just been source or destination. this step is where you say i want to connect source X to destination Y.

* Receives schema of the source.
* Selects which part of the schema will be synced.
* Triggers a manual sync or inputs schedule on which syncs should take place.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case the user triggers a manual sync, is this saying the line would be a one-time transient transfer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would just attempt to run a sync (using whatever existing configuration is, full_refresh or append).


## Source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a section about supported sources/destinations? here are some imo good MVP candidates:
Sources:

  1. Postgres
  2. S3 CSV
  3. MySQL

Destinations:

  1. BigQuery
  2. RedShift
  3. Postgres
  4. MySQL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have one SaaS source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goal here is to describe how configuration works, can we keep this conversation in the reqs doc: https://docs.google.com/document/d/1X6M3qhbg9E9adykdI8KmO3xV7mr0XK7O3jLbN5Z7ydw/edit#?


### Source Types

#### SourceConnectionConfiguration

Any credentials needed to establish a connection with the data source. This configuration will look difference for each source. Dataline only enforces that it is valid json-schema. Here is an example of one might look like for a postgres tap.

```json
{
"description": "all configuration information needed for creating a connection.",
"type": "object",
"required": ["host", "port", "user"],
"properties": {
"host": {
"type": "string",
"format": "hostname"
},
"port": {
"type": "integer"
},
"user": {
"type": "string",
"minLength": 1,
"maxLength": 63
},
"password": {
"type": "string",
"minLength": 1,
"maxLength": 63
},
"database": {
"type": "string"
},
"sshConnection": {
"type": "object",
"oneOf": [
{
"title": "https",
"type": "null"
},
{
"title": "ssh",
"properties": {
"sshHost": {
"title": "ssh host",
"type": "string"
},
"sshPort": {
"title": "ssh port",
"type": "integer"
},
"sshUser": {
"title": "ssh user",
"type": "string"
},
"publicKey": {
"title": "public key",
"type": "string"
}
}
}
]
}
}
}
```

#### StandardConnectionStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it prefixed with Standard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the configuration for a connection is not standard, but i was imagining that when you implement a test connection check you need to return something against a standard interface (i.e. it connected or it didn't). is there something you feel is non standard here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is the question just what standard means? it means it's a configuration / interface that is the same for all taps or targets.


This is the output of the `testConnection` method. It is the same schema for ALL taps.

The type declaration can be found [here](../conduit-config/src/main/resources/json/StandardConnectionStatus.json).

#### StandardDiscoveryOutput

This is the output of the `discoverSchema` method. It is the same schema for ALL taps.

The schema for the `schema` field. This will get reused elsewhere.

The type declaration can be found [here](../conduit-config/src/main/resources/json/StandardDiscoveryOutput.json).

### Source Methods

The source object needs to be able to do 2 things:

#### testConnection

Tests that the docker image can reach that source given the information provided by the user.

```
testConnection(SourceConnectionConfiguration) => StandardConnectionStatus
```

#### discoverSchema

Detects the schema that exists in the data source. We want the output to be standardized for easy consumption by the UI.

(note: if irrelevant to an integration, this can be a no op)

(note: we will need to write a converter to and from singer catalog.json)

```
discoverSchema(SourceConnectionConfiguration) => StandardDiscoveryOutput
```

## Destination

### Destination Types

#### DestinationConnectionConfiguration

Same as [SourceConnectionConfiguration](#SourceConnectionConfiguration) but for the destination.

### Destination Methods

#### testConnection

Tests that the docker image can reach that destination given the information provided by the user.

```
testConnection(DestinationConnectionConfiguration) => StandardConnectionStatus
```

## Connection

### Connection Types

#### StandardSyncConfiguration

Configuration that is the SAME for all tap / target combinations. Describes the sync mode (full refresh or append) as well what part of the schema will be synced.

The type declaration can be found [here](../conduit-config/src/main/resources/json/StandardSyncConfiguration.json).

(note: we may need to add some notion that some sources or destinations are only compatible with full_refresh)

#### StandardSyncSummary

This object tracks metadata on where the run ended. Our hope is that it can replace the State object (see [below](#State)) entirely. The reason to define this type now is so that in the UI we can provide feedback to the user on where the sync has gotten to.

The type declaration can be found [here](../conduit-config/src/main/resources/json/StandardSyncSummary.json).

#### State

This field will be treated as a json blob that will _only_ be used inside the implementation of the integration. This is our escape strategy to handle any special state that needs to be tracked specially for specific taps.

#### StandardScheduleConfiguration

This object defines the schedule for a given connection. It is the same for all taps / targets.

The type declaration can be found [here](../conduit-config/src/main/resources/json/StandardSyncSchedule.json).

### Connection Methods

The connected source object needs to be able to do 2 things:

### (manual) sync

This includes detecting if there is in fact new data to sync. if there is, it transfers it to the destination.

```
sync(
SourceConnectionConfiguration,
DestinationConnectionConfiguration,
StandardSyncConfiguration,
StandardSyncSummary,
State
) => [StandardSyncSummary, State]
```

#### scheduleSync

This feature will require some additional configuration that will be standard across all pull sources. syncs triggered by scheduled sync will consume all of the same configuration as the manual sync.

```
scheduleSync(
StandardScheduleConfiguration,
SourceConnectionConfiguration,
DestinationConnectionConfiguration,
StandardSyncConfiguration,
StandardSyncOutput,
State
) => void
```