Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB #1354
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Load data fast from Postgres to Postgres with ConnectorX and DuckDB for bigger Postgres tables (GBs)
This examples shows you how to export and import data from postgres to postgres in a fast way with ConnectorX and DuckDB as the default export will generate Insert_statement in the normalization phase which is super slow for large tables.
As it's an initial load, we create a separate schema with a timestamp initially and then replace the existing schema with the new one.
Related Issues
9999-12-31 23:59:59.999999
. We replaced them with NULL instead.Additional Context
initial load only
This approach is tested and works well for initial load (
--replace
), the incremental load (--merge
) might need some adjustments (loading of load-tables of dlt, setting up first run after initial.load, etc.).
run it:
Start with the following command but be aware that you need to define the environment variables in a
.env
and adjust the table names (table1 and table2):Default behavior of generating insert-statements
I'm unsure if the normalization phase only generates insert statements when connector-x and parquet are used. We also used different source connectors like Oracle/MSSQL, and these were again bottlenecked in the normalization state with neverending creating insert-statement with a large table. Might there be a better way, as I would think this problem would appear to everyone using dlt (except if you only have small tables)? I'm sure I also didn't know all the right ways of using it, but I couldn't figure it out; that's why I created this workaround with DuckDB, which works very well for the initial load, but now we need to load the initial load every day, which was not the idea.
Working example
Right now, you'd need to define env-variables with existing Postgres DBs. It would be nice to update the example with an existing public Postgres DB so that people could easily test it.
I'm happy for feedback, but most importantly, I wanted to share the code here so people who have the same problem can have an example of how to improve speed with DuckDB.