allow to import external files into pipeline extract storage and load them directly #1471

rudolfix · 2024-06-16T22:11:16Z

Background
We often deal with large amount of structured data that we want to load into destination. Ie. parquet or csv files (let's skip jsonl files for now). Currently dlt will load those into arrow tables (or list of dicts) during extract, save them as parquet or jsonl and then move or rewrite them during normalize. In case of structured data we could easily sniff the schema and just pass the files around, only rewriting the files if destination file format requires so. For example:

glob a folder with csvs (may be in the bucket) -> sniff the schema -> import (copy or link) the file into extract folder -> regular further processing

This is not exactly dlt main use case but seems as a natural extension to #338

Tasks
Overall implementation idea is to create a resource similar to filesystem that would handle schema sniffing, copying of the file and importing

- support globs like in filesystem. assume one glob -> one table schema
- support parquet and csv import files
- copy bucket files into extract storage, link local files into extract storage
- sniff csv schema. we already have sniffers in filesystem. use duckdb, pandas or pyarrow - this may be configurable
- support csv files in extract storage. this means we need to be able to move or rewrite them in normalize

A challenge: csvs may come in different formats. we could use a good sniffer (ie. duckdb) to sniff it. The problem is to make the destinations to understand it. Mind that the reason for this feature is to not rewrite the files. Some ideas:

loader_file_format may be both Literal or TypedDict with actual settings. so we can have separator, escape characters, date formats for csv in it
we could allow people to provide part of the COPY COMMAND SQL for their particular csv.

Future work:

leave the files on the buckets and create reference jobs instead to copy them into destination directly from the bucket

PR1
Part of this ticket is implemented in #998

- import any file from within (and outside of) a resource by using dlt.mark.with_file_import
- any file may be imported and passed to normalizer which will send it to loader
- it is up to user to sniff the schema, dlt still loads data to tables even is schema does not exist (bring your own table)
- variety of csv formats supported in snowflake and postgres via embedded config (challenge above solved)

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to dlt core library Jun 16, 2024

github-project-automation bot moved this to Todo in dlt core library Jun 16, 2024

rudolfix changed the title ~~allow to import external files into pipeline extract storage and loading them directly~~ allow to import external files into pipeline extract storage and load them directly Jun 16, 2024

rudolfix moved this from Todo to Planned in dlt core library Jun 19, 2024

rudolfix moved this from Planned to In Progress in dlt core library Jun 24, 2024

rudolfix self-assigned this Jun 24, 2024

rudolfix mentioned this issue Jun 24, 2024

allows naming conventions to be changed #998

Merged

rudolfix moved this from In Progress to Todo in dlt core library Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow to import external files into pipeline extract storage and load them directly #1471

allow to import external files into pipeline extract storage and load them directly #1471

rudolfix commented Jun 16, 2024 •

edited

Loading

allow to import external files into pipeline extract storage and load them directly #1471

allow to import external files into pipeline extract storage and load them directly #1471

Comments

rudolfix commented Jun 16, 2024 • edited Loading

rudolfix commented Jun 16, 2024 •

edited

Loading