diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index 1b5b90fc58ae7..5b6265f055c3b 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -89,6 +89,7 @@ * [High-level View](architecture/high-level-view.md) * [Workers & Jobs](architecture/jobs.md) * [Technical Stack](architecture/tech-stack.md) + * [Change Data Capture (CDC)](architecture/cdc.md) * [Contributing to Airbyte](contributing-to-airbyte/README.md) * [Code of Conduct](contributing-to-airbyte/code-of-conduct.md) * [Developing Locally](contributing-to-airbyte/developing-locally.md) diff --git a/docs/architecture/cdc.md b/docs/architecture/cdc.md new file mode 100644 index 0000000000000..5639e526946d1 --- /dev/null +++ b/docs/architecture/cdc.md @@ -0,0 +1,33 @@ +# Change Data Capture (CDC) + +## What is log-based incremental replication? +Many common databases support writing all record changes to log files for the purpose of replication. A consumer of these log files (such as Airbyte) can read these logs while keeping track of the current position within the logs in order to read all record changes coming from `DELETE`/`INSERT`/`UPDATE` statements. + +## Syncing +The orchestration for syncing is similar to non-CDC database sources. After selecting a sync interval, syncs are launched regularly. We read data from the log up to the time that the sync was started. We do not treat CDC sources as infinite streaming sources. You should ensure that your schedule for running these syncs is frequent enough to consume the logs that are generated. The first time the sync is run, a snapshot of the current state of the data will be taken. This is done using `SELECT` statements and is effectively a Full Refresh. Subsequent syncs will use the logs to determine which changes took place since the last sync and update those. Airbyte keeps track of the current log position between syncs. + +A single sync might have some tables configured for Full Refresh replication and others for Incremental. If CDC is configured at the source level, all tables with Incremental selected will use CDC. All Full Refresh tables will replicate using the same process as non-CDC sources. However, these tables will still include CDC metadata columns by default. + +The Airbyte Protocol outputs records from sources. Records from `UPDATE` statements appear the same way as records from `INSERT` statements. We support different options for how to sync this data into destinations using primary keys, so you can choose to append this data, delete in place, etc. + +We add some metadata columns for CDC sources: +* `ab_cdc_lsn` is the point in the log where the record was retrieved +* `ab_cdc_updated_at` is the timestamp for the database transaction that resulted in this record change and is present for records from `DELETE`/`INSERT`/`UPDATE` statements +* `ab_cdc_deleted_at` is the timestamp for the database transaction that resulted in this record change and is only present for records from `DELETE` statements + +## Limitations +* CDC incremental is only supported for tables with primary keys. A CDC source can still choose to replicate tables without primary keys as Full Refresh or a non-CDC source can be configured for the same database to replicate the tables without primary keys using standard incremental replication. +* Data must be in tables, not views. +* The modifications you are trying to capture must be made using `DELETE`/`INSERT`/`UPDATE`. For example, changes made from `TRUNCATE`/`ALTER` won't appear in logs and therefore in your destination. +* We do not support schema changes automatically for CDC sources. We recommend resetting and resyncing data if you make a schema change. +* There are database-specific limitations. See the documentation pages for individual connectors for more information. +* The records produced by `DELETE` statements only contain primary keys. All other data fields are unset. + +## Current Support +* [Postgres](../integrations/sources/postgres.md) + +## Coming Soon +* [MySQL](../integrations/sources/mysql.md) +* [SQL Server / MSSQL](../integrations/sources/mssql.md) +* Oracle DB +* Please [create a ticket](https://github.com/airbytehq/airbyte/issues/new/choose) if you need CDC support on another database! \ No newline at end of file diff --git a/docs/faq/technical-support.md b/docs/faq/technical-support.md index 5414277ac6768..b5025bbb86f5c 100644 --- a/docs/faq/technical-support.md +++ b/docs/faq/technical-support.md @@ -71,6 +71,10 @@ Depending on your Docker network configuration, you may not be able to connect t If you are running into connection refused errors when running Airbyte via Docker Compose on Mac, try using `host.docker.internal` as the host. On Linux, you may have to modify `docker-compose.yml` and add a host that maps to your local machine using [`extra_hosts`](https://docs.docker.com/compose/compose-file/compose-file-v3/#extra_hosts). +## **Do you support change data capture (CDC) or logical replication for databases?** + +We currently support [CDC for Postgres 10+](../integrations/sources/postgres.md). We are adding support for a few other databases April/May 2021. + ## **Can I disable analytics in Airbyte?** Yes, you can control what's sent outside of Airbyte for analytics purposes. diff --git a/docs/integrations/sources/postgres.md b/docs/integrations/sources/postgres.md index 1b6f1055bb2e9..ba977efb62612 100644 --- a/docs/integrations/sources/postgres.md +++ b/docs/integrations/sources/postgres.md @@ -51,8 +51,8 @@ Postgres data types are mapped to the following data types when synchronizing da | :--- | :--- | | Full Refresh Sync | Yes | | Incremental - Append Sync | Yes | -| Replicate Incremental Deletes | Coming soon | -| Logical Replication \(WAL\) | Coming soon | +| Replicate Incremental Deletes | Yes | +| Logical Replication \(WAL\) | Yes | | SSL Support | Yes | | SSH Tunnel Connection | Coming soon | @@ -97,5 +97,93 @@ GRANT SELECT ON ALL TABLES IN SCHEMA TO airbyte; ALTER DEFAULT PRIVILEGES IN SCHEMA GRANT SELECT ON TABLES TO airbyte; ``` +#### 3. Set up CDC \(Optional\) + +Please read [the section on CDC below](#setting-up-cdc-for-postgres) for more information. + +#### 4. That's it! + Your database user should now be ready for use with Airbyte. +## Change Data Capture (CDC) / Logical Replication / WAL Replication +We use [logical replication](https://www.postgresql.org/docs/10/logical-replication.html) of the Postgres write-ahead log (WAL) to incrementally capture deletes using the `pgoutput` plugin. + +We do not require installing custom plugins like `wal2json` or `test_decoding`. We use `pgoutput`, which is included in Postgres 10+ by default. + +Please read the [CDC docs](../../architecture/cdc.md) for an overview of how Airbyte approaches CDC. + +### Should I use CDC for Postgres? +* If you need a record of deletions and can accept the limitations posted below, you should to use CDC for Postgres. +* If your data set is small and you just want snapshot of your table in the destination, consider using Full Refresh replication for your table instead of CDC. +* If the limitations prevent you from using CDC and your goal is to maintain a snapshot of your table in the destination, consider using non-CDC incremental and occasionally reset the data and re-sync. +* If your table has a primary key but doesn't have a reasonable cursor field for incremental syncing (i.e. `updated_at`), CDC allows you to sync your table incrementally. + +### CDC Limitations +* Make sure to read our [CDC docs](../../architecture/cdc.md) to see limitations that impact all databases using CDC replication. +* CDC is only available for Postgres 10+. +* Airbyte requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot. Instructions on how to set up a replication slot can be found below. +* Log-based replication only works for master instances of Postgres. +* Using logical replication increases disk space used on the database server. The additional data is stored until it is consumed. + * We recommend setting frequent syncs for CDC in order to ensure that this data doesn't fill up your disk space. + * If you stop syncing a CDC-configured Postgres instance to Airbyte, you should delete the replication slot. Otherwise, it may fill up your disk space. + +### Setting up CDC for Postgres + +Follow one of these guides to enable logical replication: +* [Bare Metal, VMs (EC2/GCE/etc), Docker, etc.](#setting-up-cdc-on-bare-metal-vms-ec2gceetc-docker-etc) +* [AWS Postgres RDS or Aurora](#setting-up-cdc-on-aws-postgres-rds-or-aurora) +* [Azure Database for Postgres](#setting-up-cdc-on-azure-database-for-postgres) + +Then, the Airbyte user for your instance needs to be granted `REPLICATION` and `LOGIN` permissions. Since we are using embedded Debezium under the hood for Postgres, we recommend reading the [permissioning section of the Debezium docs](https://debezium.io/documentation/reference/connectors/postgresql.html#postgresql-permissions) for more information on what is required. + +Finally, you will need to create a replication slot. Here is the query used to create a replication slot called `airbyte_slot`: +``` +SELECT pg_create_logical_replication_slot('airbyte_slot', 'pgoutput');` +``` + +This slot **must** use `pgoutput`. + +After providing the name of this slot when configuring the source, you should be ready to sync data with CDC! + +### Setting up CDC on Bare Metal, VMs (EC2/GCE/etc), Docker, etc. +Some settings must be configured in the `postgresql.conf` file for your database. You can find the location of this file using `psql -U postgres -c 'SHOW config_file'` withe the correct `psql` credentials specified. Alternatively, a custom file can be specified when running postgres with the `-c` flag. For example `postgres -c config_file=/etc/postgresql/postgresql.conf` runs Postgres with the config file at `/etc/postgresql/postgresql.conf`. + +If you are syncing data from a server using the `postgres` Docker image, you will need to mount a file and change the command to run Postgres with the set config file. If you're just testing CDC behavior, you may want to use a modified version of a [sample `postgresql.conf`](https://github.com/postgres/postgres/blob/master/src/backend/utils/misc/postgresql.conf.sample). + +* `wal_level` is the type of coding used within the Postgres write-ahead log. This must be set to `logical` for Airbyte CDC. +* `max_wal_senders` is the maximum number of processes used for handling WAL changes. This must be at least one. +* `max_replication_slots` is the maximum number of replication slots that are allowed to stream WAL changes. This must one if Airbyte will be the only service reading subscribing to WAL changes or more if other services are also reading from the WAL. + +Here is what these settings would look like in `postgresql.conf`: +``` +wal_level = logical +max_wal_senders = 1 +max_replication_slots = 1 +``` + +After setting these values you will need to restart your instance. + +Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres). + +### Setting up CDC on AWS Postgres RDS or Aurora +* Go to the `Configuration` tab for your DB cluster. +* Find your cluster parameter group. You will either edit the parameters for this group or create a copy of this parameter group to edit. If you create a copy you will need to change your cluster's parameter group before restarting. +* Within the parameter group page, search for `rds.logical_replication`. Select this row and click on the `Edit parameters` button. Set this value to `1`. +* Wait for a maintenance window to automatically restart the instance or restart it manually. +* Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres). + +### Setting up CDC on Azure Database for Postgres +Use either the Azure CLI to: +``` +az postgres server configuration set --resource-group group --server-name server --name azure.replication_support --value logical +az postgres server restart --resource-group group --name server +``` + +Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres). + +### Setting up CDC on Google CloudSQL + +Unfortunately, logical replication is not configurable for Google CloudSQL. You can indicate your support for this feature on the [Google Issue Tracker](https://issuetracker.google.com/issues/120274585). + +### Setting up CDC on other platforms +If you encounter one of those not listed below, please consider [contributing to our docs](https://github.com/airbytehq/airbyte/tree/master/docs) and providing setup instructions.