Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB #1354

Conversation

sspaeti
Copy link
Contributor

@sspaeti sspaeti commented May 14, 2024

Description

Load data fast from Postgres to Postgres with ConnectorX and DuckDB for bigger Postgres tables (GBs)

This examples shows you how to export and import data from postgres to postgres in a fast way with ConnectorX and DuckDB as the default export will generate Insert_statement in the normalization phase which is super slow for large tables.

As it's an initial load, we create a separate schema with a timestamp initially and then replace the existing schema with the new one.

Related Issues

Additional Context

initial load only

This approach is tested and works well for initial load (--replace), the incremental load (--merge) might need some adjustments (loading of load-tables of dlt, setting up first run after initial.
load, etc.).

run it:

Start with the following command but be aware that you need to define the environment variables in a .env and adjust the table names (table1 and table2):

postgres_to_postgres_with_x_and_duckdb.py --replace

Default behavior of generating insert-statements

I'm unsure if the normalization phase only generates insert statements when connector-x and parquet are used. We also used different source connectors like Oracle/MSSQL, and these were again bottlenecked in the normalization state with neverending creating insert-statement with a large table. Might there be a better way, as I would think this problem would appear to everyone using dlt (except if you only have small tables)? I'm sure I also didn't know all the right ways of using it, but I couldn't figure it out; that's why I created this workaround with DuckDB, which works very well for the initial load, but now we need to load the initial load every day, which was not the idea.

Working example

Right now, you'd need to define env-variables with existing Postgres DBs. It would be nice to update the example with an existing public Postgres DB so that people could easily test it.

I'm happy for feedback, but most importantly, I wanted to share the code here so people who have the same problem can have an example of how to improve speed with DuckDB.

Copy link

netlify bot commented May 14, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit fd2d809
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66431a7ddd2efd0008a2ac8c

@sspaeti sspaeti changed the title Fast Postgres to Postgres with Connector-X and DuckDB Fast Postgres to Postgres Initial Load Example with Connector-X and DuckDB May 14, 2024
@rudolfix
Copy link
Collaborator

rudolfix commented May 15, 2024

@sspaeti this is SO good! @AstrakhantsevaAA please take a look. let's merge this

as a side note: after a discussion with you I realized how inefficient we are when dealing with parquet and databases. so we started implementing some shortcuts. in case of postgres I discovered that you can actually stream parquet as csv via psycopg2 and it is at least 10x faster than insert (and normalization is skipped). Surely not as fast as what you have here but on the other hand is able to support merge loads.
https://dlthub.com/docs/dlt-ecosystem/destinations/postgres#fast-loading-with-arrow-tables-and-csv

the next one is mssql and bcp. but we'd need to run external tool for that which is quite obnoxious

@AstrakhantsevaAA AstrakhantsevaAA added ci from fork run ci workflows on a pr even if they are from a fork and removed ci from fork run ci workflows on a pr even if they are from a fork labels May 16, 2024
@ogierpaulmck
Copy link

Hi, +1 on how to load mssql very fast!

@sspaeti
Copy link
Contributor Author

sspaeti commented May 17, 2024

Hi, +1 on how to load mssql very fast!

@ogierpaulmck This example easily works for MSSQL, Oracle, and other SQLAlchemy-supported connections, too. We use it now with exactly these two too. The only thing you need to change is:

  1. connection string to MSSQL or anything else.
    1. e.g. MSSQL we use conn = f"mssql://{MSSQL_USERNAME}:{quote_plus(MSSQL_PASSWORD)}@{MSSQL_HOST}: {MSSQL_PORT}/{table_definition['source_db']}?TrustServerCertificate=yes&Encrypt=yes"
    2. Oracle: oracle+oracledb://{POSTGRES_USERNAME}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/?service_name={POSTGRES_DATABASE}
  2. LIMIT {CHUNKSIZE} OFFSET {offset} in the read_sql_x_chunked need to be adjusted to each database.
    1. E.g. Oracle uses: OFFSET {CHUNKSIZE * i} ROWS FETCH NEXT {CHUNKSIZE} ROWS ONLY

@ dlt team: Maybe we could make this example even more generic for SQL Alchemy connections.

@sh-rp sh-rp added the ci from fork run ci workflows on a pr even if they are from a fork label May 21, 2024
@AstrakhantsevaAA AstrakhantsevaAA changed the base branch from devel to example/fast_postgres May 21, 2024 10:15
@AstrakhantsevaAA AstrakhantsevaAA merged commit da06e9c into dlt-hub:example/fast_postgres May 21, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci from fork run ci workflows on a pr even if they are from a fork
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants