Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow to import external files into pipeline extract storage and load them directly #1471

Open
rudolfix opened this issue Jun 16, 2024 · 0 comments
Assignees

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Jun 16, 2024

Background
We often deal with large amount of structured data that we want to load into destination. Ie. parquet or csv files (let's skip jsonl files for now). Currently dlt will load those into arrow tables (or list of dicts) during extract, save them as parquet or jsonl and then move or rewrite them during normalize. In case of structured data we could easily sniff the schema and just pass the files around, only rewriting the files if destination file format requires so. For example:

glob a folder with csvs (may be in the bucket) -> sniff the schema -> import (copy or link) the file into extract folder -> regular further processing

This is not exactly dlt main use case but seems as a natural extension to #338

Tasks
Overall implementation idea is to create a resource similar to filesystem that would handle schema sniffing, copying of the file and importing

    • support globs like in filesystem. assume one glob -> one table schema
    • support parquet and csv import files
    • copy bucket files into extract storage, link local files into extract storage
    • sniff csv schema. we already have sniffers in filesystem. use duckdb, pandas or pyarrow - this may be configurable
    • support csv files in extract storage. this means we need to be able to move or rewrite them in normalize

A challenge: csvs may come in different formats. we could use a good sniffer (ie. duckdb) to sniff it. The problem is to make the destinations to understand it. Mind that the reason for this feature is to not rewrite the files. Some ideas:

  • loader_file_format may be both Literal or TypedDict with actual settings. so we can have separator, escape characters, date formats for csv in it
  • we could allow people to provide part of the COPY COMMAND SQL for their particular csv.

Future work:

  • leave the files on the buckets and create reference jobs instead to copy them into destination directly from the bucket

PR1
Part of this ticket is implemented in #998

    • import any file from within (and outside of) a resource by using dlt.mark.with_file_import
    • any file may be imported and passed to normalizer which will send it to loader
    • it is up to user to sniff the schema, dlt still loads data to tables even is schema does not exist (bring your own table)
    • variety of csv formats supported in snowflake and postgres via embedded config (challenge above solved)
@rudolfix rudolfix changed the title allow to import external files into pipeline extract storage and loading them directly allow to import external files into pipeline extract storage and load them directly Jun 16, 2024
@rudolfix rudolfix moved this from Todo to Planned in dlt core library Jun 19, 2024
@rudolfix rudolfix moved this from Planned to In Progress in dlt core library Jun 24, 2024
@rudolfix rudolfix self-assigned this Jun 24, 2024
@rudolfix rudolfix moved this from In Progress to Todo in dlt core library Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant