-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
postgres cdc docs #2784
Merged
Merged
postgres cdc docs #2784
Changes from 1 commit
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
c66f4bd
cdc docs
jrhizor e58cafa
Update docs/integrations/sources/postgres.md
jrhizor 7499b02
address gcp
jrhizor a85534b
learn too english
jrhizor bdc9777
add link
jrhizor 560ee45
add more disk space warnings
jrhizor 8cb1921
add additional cdc use case
jrhizor deebb27
add information on how to find postgresql.conf
jrhizor 6652379
add how to find the file
jrhizor 971a0e0
Merge branch 'jrhizor/debezium' into jrhizor/cdc-docs
jrhizor b5b4819
Merge branch 'jrhizor/debezium' into jrhizor/cdc-docs
jrhizor File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Change Data Capture (CDC) | ||
|
||
## What is log-based incremental replication? | ||
Many common databases support writing all record changes to log files for the purpose of replication. A consumer of these log files (such as Airbyte) can read these logs and keep track of the current position within the logs and to read all record changes coming from `DELETE`/`INSERT`/`UPDATE` statements. | ||
|
||
## Syncing | ||
The orchestration for syncing is similar to non-CDC database sources. After selecting a sync interval, syncs are launched regularly. We read data from the log up to the time that the sync was started. We do not treat CDC sources as infinite streaming sources. You should ensure that your schedule for running these syncs is frequent enough to consume the logs that are generated. The first time the sync is run, a snapshot of the current state of the data will be taken. This is done using `SELECT` statements and is effectively a Full Refresh. Subsequent syncs will use the logs to determine which changes took place since the last sync and update those. Airbyte keeps track of the current log position between syncs. | ||
|
||
A single sync might have some tables configured for Full Refresh replication and others for Incremental. If CDC is configured at the source level, all tables with Incremental selected will use CDC. All Full Refresh tables will replicate using the same process as non-CDC sources. However, these tables will still include CDC metadata columns by default. | ||
|
||
The Airbyte Protocol outputs records from sources. Records from `UPDATE` statements appear the same way as records from `INSERT` statements. We support different options for how to sync this data into destinations using primary keys, so you can choose to append this data, delete in place, etc. | ||
|
||
We add some metadata columns for CDC sources: | ||
* `ab_cdc_lsn` is the point in the log where the record was retrieved | ||
* `ab_cdc_updated_at` is the timestamp for the database transaction that resulted in this record change and is present for records from `DELETE`/`INSERT`/`UPDATE` statements | ||
* `ab_cdc_deleted_at` is the timestamp for the database transaction that resulted in this record change and is only present for records from `DELETE` statements | ||
|
||
## Limitations | ||
* CDC incremental is only supported for tables with primary keys. A CDC source can still choose to replicate tables without primary keys as Full Refresh or a non-CDC source can be configured for the same database to replicate the tables without primary keys using standard incremental replication. | ||
* Data must be in tables, not views. | ||
* The modifications you are trying to capture must be made using `DELETE`/`INSERT`/`UPDATE`. For example, changes made from `TRUNCATE`/`ALTER` won't appear in logs and therefore in your destination. | ||
davinchia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* We do not support schema changes automatically for CDC sources. We recommend resetting and resyncing data if you make a schema change. | ||
* There are database-specific limitations. See the documentation pages for individual connectors for more information. | ||
* The records produced by `DELETE` statements only contain primary keys. All other data fields are unset. | ||
|
||
## Current Support | ||
* [Postgres](../integrations/sources/postgres.md) | ||
|
||
## Coming Soon | ||
* [MySQL](../integrations/sources/mysql.md) | ||
* [SQL Server / MSSQL](../integrations/sources/mssql.md) | ||
* Oracle DB | ||
* Please [create a ticket](https://github.com/airbytehq/airbyte/issues/new/choose) if you need CDC support on another database! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,8 +51,8 @@ Postgres data types are mapped to the following data types when synchronizing da | |
| :--- | :--- | | ||
| Full Refresh Sync | Yes | | ||
| Incremental - Append Sync | Yes | | ||
| Replicate Incremental Deletes | Coming soon | | ||
| Logical Replication \(WAL\) | Coming soon | | ||
| Replicate Incremental Deletes | Yes | | ||
| Logical Replication \(WAL\) | Yes | | ||
| SSL Support | Yes | | ||
| SSH Tunnel Connection | Coming soon | | ||
|
||
|
@@ -97,5 +97,82 @@ GRANT SELECT ON ALL TABLES IN SCHEMA <schema_name> TO airbyte; | |
ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO airbyte; | ||
``` | ||
|
||
#### 3. Set up CDC \(Optional\) | ||
|
||
Please read the section on CDC below for more information. | ||
jrhizor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### 4. That's it! | ||
|
||
Your database user should now be ready for use with Airbyte. | ||
|
||
## Change Data Capture (CDC) / Logical Replication / WAL Replication | ||
We use [logical replication](https://www.postgresql.org/docs/10/logical-replication.html) of the Postgres write-ahead log (WAL) to incrementally capture deletes using the `pgoutput` plugin. | ||
|
||
We do not require installing custom plugins like `wal2json` or `test_decoding`. We use `pgoutput`, which is included in Postgres 10+ by default. | ||
|
||
Please read the [CDC docs](../../architecture/cdc.md) for an overview of how Airbyte approaches CDC. | ||
|
||
### Should I use CDC for Postgres? | ||
jrhizor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* If you need a record of deletions and can accept the limitations posted below, you should to use CDC for Postgres. | ||
* If your data set is small and you just want snapshot of your table in the destination, consider using Full Refresh replication for your table instead of CDC. | ||
* If the limitations prevent you from using CDC and your goal is to maintain a snapshot of your table in the destination, consider using non-CDC incremental and occasionally reset the data and re-sync. | ||
|
||
### CDC Limitations | ||
* Make sure to read our [CDC docs](../../architecture/cdc.md) to see limitations that impact all databases using CDC replication. | ||
* CDC is only available for Postgres 10+. | ||
* Airbyte requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot. | ||
jrhizor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Log-based replication only works for master instances of Postgres. | ||
* Using logical replication increases disk space used on the database server. The additional data is stored until it is consumed. | ||
jrhizor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Setting up CDC for Postgres | ||
|
||
Follow one of these guides to enable logical replication: | ||
* [Bare Metal, VMs (EC2/GCE/etc), Docker, etc.](#setting-up-cdc-on-bare-metal-vms-ec2gceetc-docker-etc) | ||
* [AWS Postgres RDS or Aurora](#setting-up-cdc-on-aws-postgres-rds-or-aurora) | ||
* [Azure Database for Postgres](#setting-up-cdc-on-azure-database-for-postgres) | ||
|
||
Then, the Airbyte user for your instance needs to be granted `REPLICATION` and `LOGIN` permissions. Since we are using embedded Debezium under the hood for Postgres, we recommend reading the [permissioning section of the Debezium docs](https://debezium.io/documentation/reference/connectors/postgresql.html#postgresql-permissions) for more information on what is required. | ||
|
||
Finally, you will need to create a replication slot. Here is the query used to create a replication slot called `airbyte_slot`: | ||
``` | ||
SELECT pg_create_logical_replication_slot('airbyte_slot', 'pgoutput');` | ||
``` | ||
|
||
This slot **must** use `pgoutput`. | ||
|
||
After providing the name of this slot when configuring the source, you should be ready to sync data with CDC! | ||
|
||
### Setting up CDC on Bare Metal, VMs (EC2/GCE/etc), Docker, etc. | ||
Three settings must be configured in the `postgresql.conf` file for your database: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there any help we can give people on where they can find this file? i know it always takes me forever to find it. |
||
``` | ||
wal_level = logical | ||
max_wal_senders = 1 | ||
max_replication_slots = 1 | ||
``` | ||
|
||
* `wal_level` is the type of coding used within the Postgres write-ahead log. This must be set to `logical` for Airbyte CDC. | ||
* `max_wal_senders` is the maximum number of processes used for handling WAL changes. This must be at least one. | ||
* `max_replication_slots` is the maximum number of replication slots that are allowed to stream WAL changes. This must one if Airbyte will be the only service reading subscribing to WAL changes or more if other services are also reading from the WAL. | ||
|
||
After setting these values you will need to restart your instance. | ||
|
||
Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres). | ||
|
||
### Setting up CDC on AWS Postgres RDS or Aurora | ||
jrhizor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Go to the `Configuration` tab for your DB cluster. | ||
* Find your cluster parameter group. You will either edit the parameters for this group or create a copy of this parameter group to edit. If you create a copy you will need to change your cluster's parameter group before restarting. | ||
* Within the parameter group page, search for `rds.logical_replication`. Select this row and click on the `Edit parameters` button. Set this value to `1`. | ||
* Wait for a maintenance window to automatically restart the instance or restart it manually. | ||
* Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres). | ||
|
||
### Setting up CDC on Azure Database for Postgres | ||
Use either the Azure CLI to: | ||
``` | ||
az postgres server configuration set --resource-group group --server-name server --name azure.replication_support --value logical | ||
az postgres server restart --resource-group group --name server | ||
``` | ||
|
||
Finally, [follow the rest of steps above](#setting-up-cdc-for-postgres). | ||
|
||
### Setting up CDC on other platforms | ||
If you encounter one of those not listed below, please consider [contributing to our docs](https://github.com/airbytehq/airbyte/tree/master/docs) and providing setup instructions. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear!