Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgres cdc docs #2784

Merged
merged 11 commits into from
Apr 7, 2021
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@
* [Incremental Sync](architecture/incremental.md)
* [Workers & Jobs](architecture/jobs.md)
* [Technical Stack](architecture/tech-stack.md)
* [Change Data Capture (CDC)](architecture/cdc.md)
* [Contributing to Airbyte](contributing-to-airbyte/README.md)
* [Code of Conduct](contributing-to-airbyte/code-of-conduct.md)
* [Developing Locally](contributing-to-airbyte/developing-locally.md)
Expand Down
33 changes: 33 additions & 0 deletions docs/architecture/cdc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Change Data Capture (CDC)

## What is log-based incremental replication?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear!

Many common databases support writing all record changes to log files for the purpose of replication. A consumer of these log files (such as Airbyte) can read these logs and keep track of the current position within the logs and to read all record changes coming from `DELETE`/`INSERT`/`UPDATE` statements.

## Syncing
The orchestration for syncing is similar to non-CDC database sources. After selecting a sync interval, syncs are launched regularly. We read data from the log up to the time that the sync was started. We do not treat CDC sources as infinite streaming sources. You should ensure that your schedule for running these syncs is frequent enough to consume the logs that are generated. The first time the sync is run, a snapshot of the current state of the data will be taken. This is done using `SELECT` statements and is effectively a Full Refresh. Subsequent syncs will use the logs to determine which changes took place since the last sync and update those. Airbyte keeps track of the current log position between syncs.

A single sync might have some tables configured for Full Refresh replication and others for Incremental. If CDC is configured at the source level, all tables with Incremental selected will use CDC. All Full Refresh tables will replicate using the same process as non-CDC sources. However, these tables will still include CDC metadata columns by default.

The Airbyte Protocol outputs records from sources. Records from `UPDATE` statements appear the same way as records from `INSERT` statements. We support different options for how to sync this data into destinations using primary keys, so you can choose to append this data, delete in place, etc.

We add some metadata columns for CDC sources:
* `ab_cdc_lsn` is the point in the log where the record was retrieved
* `ab_cdc_updated_at` is the timestamp for the database transaction that resulted in this record change and is present for records from `DELETE`/`INSERT`/`UPDATE` statements
* `ab_cdc_deleted_at` is the timestamp for the database transaction that resulted in this record change and is only present for records from `DELETE` statements

## Limitations
* CDC incremental is only supported for tables with primary keys. A CDC source can still choose to replicate tables without primary keys as Full Refresh or a non-CDC source can be configured for the same database to replicate the tables without primary keys using standard incremental replication.
* Data must be in tables, not views.
* The modifications you are trying to capture must be made using `DELETE`/`INSERT`/`UPDATE`. For example, changes made from `TRUNCATE`/`ALTER` won't appear in logs and therefore in your destination.
davinchia marked this conversation as resolved.
Show resolved Hide resolved
* We do not support schema changes automatically for CDC sources. We recommend resetting and resyncing data if you make a schema change.
* There are database-specific limitations. See the documentation pages for individual connectors for more information.
* The records produced by `DELETE` statements only contain primary keys. All other data fields are unset.

## Current Support
* [Postgres](../integrations/sources/postgres.md)

## Coming Soon
* [MySQL](../integrations/sources/mysql.md)
* [SQL Server / MSSQL](../integrations/sources/mssql.md)
* Oracle DB
* Please [create a ticket](https://github.com/airbytehq/airbyte/issues/new/choose) if you need CDC support on another database!
3 changes: 3 additions & 0 deletions docs/faq/technical-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,6 @@ Depending on your Docker network configuration, you may not be able to connect t

If you are running into connection refused errors when running Airbyte via Docker Compose on Mac, try using `host.docker.internal` as the host. On Linux, you may have to modify `docker-compose.yml` and add a host that maps to your local machine using [`extra_hosts`](https://docs.docker.com/compose/compose-file/compose-file-v3/#extra_hosts).

## **Do you support change data capture (CDC) or logical replication for databases?**

We currently support [CDC for Postgres 10+](../integrations/sources/postgres.md). We are adding support for a few other databases April/May 2021.
81 changes: 79 additions & 2 deletions docs/integrations/sources/postgres.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ Postgres data types are mapped to the following data types when synchronizing da
| :--- | :--- |
| Full Refresh Sync | Yes |
| Incremental - Append Sync | Yes |
| Replicate Incremental Deletes | Coming soon |
| Logical Replication \(WAL\) | Coming soon |
| Replicate Incremental Deletes | Yes |
| Logical Replication \(WAL\) | Yes |
| SSL Support | Yes |
| SSH Tunnel Connection | Coming soon |

Expand Down Expand Up @@ -97,5 +97,82 @@ GRANT SELECT ON ALL TABLES IN SCHEMA <schema_name> TO airbyte;
ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO airbyte;
```

#### 3. Set up CDC \(Optional\)

Please read the section on CDC below for more information.
jrhizor marked this conversation as resolved.
Show resolved Hide resolved

#### 4. That's it!

Your database user should now be ready for use with Airbyte.

## Change Data Capture (CDC) / Logical Replication / WAL Replication
We use [logical replication](https://www.postgresql.org/docs/10/logical-replication.html) of the Postgres write-ahead log (WAL) to incrementally capture deletes using the `pgoutput` plugin.

We do not require installing custom plugins like `wal2json` or `test_decoding`. We use `pgoutput`, which is included in Postgres 10+ by default.

Please read the [CDC docs](../../architecture/cdc.md) for an overview of how Airbyte approaches CDC.

### Should I use CDC for Postgres?
jrhizor marked this conversation as resolved.
Show resolved Hide resolved
* If you need a record of deletions and can accept the limitations posted below, you should to use CDC for Postgres.
* If your data set is small and you just want snapshot of your table in the destination, consider using Full Refresh replication for your table instead of CDC.
* If the limitations prevent you from using CDC and your goal is to maintain a snapshot of your table in the destination, consider using non-CDC incremental and occasionally reset the data and re-sync.

### CDC Limitations
* Make sure to read our [CDC docs](../../architecture/cdc.md) to see limitations that impact all databases using CDC replication.
* CDC is only available for Postgres 10+.
* Airbyte requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot.
jrhizor marked this conversation as resolved.
Show resolved Hide resolved
* Log-based replication only works for master instances of Postgres.
* Using logical replication increases disk space used on the database server. The additional data is stored until it is consumed.
jrhizor marked this conversation as resolved.
Show resolved Hide resolved

### Setting up CDC for Postgres

Follow one of these guides to enable logical replication:
* [Bare Metal, VMs (EC2/GCE/etc), Docker, etc.](#setting-up-cdc-on-bare-metal-vms-ec2gceetc-docker-etc)
* [AWS Postgres RDS or Aurora](#setting-up-cdc-on-aws-postgres-rds-or-aurora)
* [Azure Database for Postgres](#setting-up-cdc-on-azure-database-for-postgres)

Then, the Airbyte user for your instance needs to be granted `REPLICATION` and `LOGIN` permissions. Since we are using embedded Debezium under the hood for Postgres, we recommend reading the [permissioning section of the Debezium docs](https://debezium.io/documentation/reference/connectors/postgresql.html#postgresql-permissions) for more information on what is required.

Finally, you will need to create a replication slot. Here is the query used to create a replication slot called `airbyte_slot`:
```
SELECT pg_create_logical_replication_slot('airbyte_slot', 'pgoutput');`
```

This slot **must** use `pgoutput`.

After providing the name of this slot when configuring the source, you should be ready to sync data with CDC!

### Setting up CDC on Bare Metal, VMs (EC2/GCE/etc), Docker, etc.
Three settings must be configured in the `postgresql.conf` file for your database:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any help we can give people on where they can find this file? i know it always takes me forever to find it.

```
wal_level = logical
max_wal_senders = 1
max_replication_slots = 1
```

* `wal_level` is the type of coding used within the Postgres write-ahead log. This must be set to `logical` for Airbyte CDC.
* `max_wal_senders` is the maximum number of processes used for handling WAL changes. This must be at least one.
* `max_replication_slots` is the maximum number of replication slots that are allowed to stream WAL changes. This must one if Airbyte will be the only service reading subscribing to WAL changes or more if other services are also reading from the WAL.

After setting these values you will need to restart your instance.

Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres).

### Setting up CDC on AWS Postgres RDS or Aurora
jrhizor marked this conversation as resolved.
Show resolved Hide resolved
* Go to the `Configuration` tab for your DB cluster.
* Find your cluster parameter group. You will either edit the parameters for this group or create a copy of this parameter group to edit. If you create a copy you will need to change your cluster's parameter group before restarting.
* Within the parameter group page, search for `rds.logical_replication`. Select this row and click on the `Edit parameters` button. Set this value to `1`.
* Wait for a maintenance window to automatically restart the instance or restart it manually.
* Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres).

### Setting up CDC on Azure Database for Postgres
Use either the Azure CLI to:
```
az postgres server configuration set --resource-group group --server-name server --name azure.replication_support --value logical
az postgres server restart --resource-group group --name server
```

Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres).

### Setting up CDC on other platforms
If you encounter one of those not listed below, please consider [contributing to our docs](https://github.com/airbytehq/airbyte/tree/master/docs) and providing setup instructions.