Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB #1354

sspaeti · 2024-05-14T08:02:03Z

Description

Load data fast from Postgres to Postgres with ConnectorX and DuckDB for bigger Postgres tables (GBs)

This examples shows you how to export and import data from postgres to postgres in a fast way with ConnectorX and DuckDB as the default export will generate Insert_statement in the normalization phase which is super slow for large tables.

As it's an initial load, we create a separate schema with a timestamp initially and then replace the existing schema with the new one.

Related Issues

Installing DuckDB wasn't as straight forward. There was a strange problem on our servers only (locally wasn't any problem). More details on Failed to download extension "httpfs" at …/linux_arm64_gcc4/… duckdb/duckdb#8035 (comment).
Problem with Data Type TIME with nanoseconds: Slack Discussion and Debugging and INTERNAL Error: Time With Time Zone logicalType is set but unit is not defined duckdb/duckdb#12041
- Also we had to convert Timestamp columns with nanosecond when they had 9999-12-31 23:59:59.999999. We replaced them with NULL instead.

Additional Context

initial load only

This approach is tested and works well for initial load (--replace), the incremental load (--merge) might need some adjustments (loading of load-tables of dlt, setting up first run after initial.
load, etc.).

run it:

Start with the following command but be aware that you need to define the environment variables in a .env and adjust the table names (table1 and table2):

postgres_to_postgres_with_x_and_duckdb.py --replace

Default behavior of generating insert-statements

I'm unsure if the normalization phase only generates insert statements when connector-x and parquet are used. We also used different source connectors like Oracle/MSSQL, and these were again bottlenecked in the normalization state with neverending creating insert-statement with a large table. Might there be a better way, as I would think this problem would appear to everyone using dlt (except if you only have small tables)? I'm sure I also didn't know all the right ways of using it, but I couldn't figure it out; that's why I created this workaround with DuckDB, which works very well for the initial load, but now we need to load the initial load every day, which was not the idea.

Working example

Right now, you'd need to define env-variables with existing Postgres DBs. It would be nice to update the example with an existing public Postgres DB so that people could easily test it.

I'm happy for feedback, but most importantly, I wanted to share the code here so people who have the same problem can have an example of how to improve speed with DuckDB.

netlify · 2024-05-14T08:02:20Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`fd2d809`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66431a7ddd2efd0008a2ac8c

rudolfix · 2024-05-15T16:54:35Z

@sspaeti this is SO good! @AstrakhantsevaAA please take a look. let's merge this

as a side note: after a discussion with you I realized how inefficient we are when dealing with parquet and databases. so we started implementing some shortcuts. in case of postgres I discovered that you can actually stream parquet as csv via psycopg2 and it is at least 10x faster than insert (and normalization is skipped). Surely not as fast as what you have here but on the other hand is able to support merge loads.
https://dlthub.com/docs/dlt-ecosystem/destinations/postgres#fast-loading-with-arrow-tables-and-csv

the next one is mssql and bcp. but we'd need to run external tool for that which is quite obnoxious

ogierpaulmck · 2024-05-16T19:48:00Z

Hi, +1 on how to load mssql very fast!

sspaeti · 2024-05-17T07:52:12Z

Hi, +1 on how to load mssql very fast!

@ogierpaulmck This example easily works for MSSQL, Oracle, and other SQLAlchemy-supported connections, too. We use it now with exactly these two too. The only thing you need to change is:

connection string to MSSQL or anything else.
1. e.g. MSSQL we use conn = f"mssql://{MSSQL_USERNAME}:{quote_plus(MSSQL_PASSWORD)}@{MSSQL_HOST}: {MSSQL_PORT}/{table_definition['source_db']}?TrustServerCertificate=yes&Encrypt=yes"
2. Oracle: oracle+oracledb://{POSTGRES_USERNAME}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/?service_name={POSTGRES_DATABASE}
LIMIT {CHUNKSIZE} OFFSET {offset} in the read_sql_x_chunked need to be adjusted to each database.
1. E.g. Oracle uses: OFFSET {CHUNKSIZE * i} ROWS FETCH NEXT {CHUNKSIZE} ROWS ONLY

@ dlt team: Maybe we could make this example even more generic for SQL Alchemy connections.

add example

fd2d809

sspaeti changed the title ~~Fast Postgres to Postgres with Connector-X and DuckDB~~ Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB May 14, 2024

AstrakhantsevaAA added ci from fork run ci workflows on a pr even if they are from a fork and removed ci from fork run ci workflows on a pr even if they are from a fork labels May 16, 2024

sh-rp added the ci from fork run ci workflows on a pr even if they are from a fork label May 21, 2024

AstrakhantsevaAA changed the base branch from devel to example/fast_postgres May 21, 2024 10:15

AstrakhantsevaAA merged commit da06e9c into dlt-hub:example/fast_postgres May 21, 2024
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB #1354

Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB #1354

sspaeti commented May 14, 2024 •

edited

Loading

netlify bot commented May 14, 2024 •

edited

Loading

rudolfix commented May 15, 2024 •

edited

Loading

ogierpaulmck commented May 16, 2024

sspaeti commented May 17, 2024 •

edited

Loading

Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB #1354

Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB #1354

Conversation

sspaeti commented May 14, 2024 • edited Loading

Description

Related Issues

Additional Context

initial load only

run it:

Default behavior of generating insert-statements

Working example

netlify bot commented May 14, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix commented May 15, 2024 • edited Loading

ogierpaulmck commented May 16, 2024

sspaeti commented May 17, 2024 • edited Loading

sspaeti commented May 14, 2024 •

edited

Loading

netlify bot commented May 14, 2024 •

edited

Loading

rudolfix commented May 15, 2024 •

edited

Loading

sspaeti commented May 17, 2024 •

edited

Loading