You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background
We often deal with large amount of structured data that we want to load into destination. Ie. parquet or csv files (let's skip jsonl files for now). Currently dlt will load those into arrow tables (or list of dicts) during extract, save them as parquet or jsonl and then move or rewrite them during normalize. In case of structured data we could easily sniff the schema and just pass the files around, only rewriting the files if destination file format requires so. For example:
glob a folder with csvs (may be in the bucket) -> sniff the schema -> import (copy or link) the file into extract folder -> regular further processing
This is not exactly dlt main use case but seems as a natural extension to #338
Tasks
Overall implementation idea is to create a resource similar to filesystem that would handle schema sniffing, copying of the file and importing
support globs like in filesystem. assume one glob -> one table schema
support parquet and csv import files
copy bucket files into extract storage, link local files into extract storage
sniff csv schema. we already have sniffers in filesystem. use duckdb, pandas or pyarrow - this may be configurable
support csv files in extract storage. this means we need to be able to move or rewrite them in normalize
A challenge: csvs may come in different formats. we could use a good sniffer (ie. duckdb) to sniff it. The problem is to make the destinations to understand it. Mind that the reason for this feature is to not rewrite the files. Some ideas:
loader_file_format may be both Literal or TypedDict with actual settings. so we can have separator, escape characters, date formats for csv in it
we could allow people to provide part of the COPY COMMAND SQL for their particular csv.
Future work:
leave the files on the buckets and create reference jobs instead to copy them into destination directly from the bucket
rudolfix
changed the title
allow to import external files into pipeline extract storage and loading them directly
allow to import external files into pipeline extract storage and load them directly
Jun 16, 2024
Background
We often deal with large amount of structured data that we want to load into destination. Ie. parquet or csv files (let's skip jsonl files for now). Currently
dlt
will load those into arrow tables (or list of dicts) during extract, save them as parquet or jsonl and then move or rewrite them during normalize. In case of structured data we could easily sniff the schema and just pass the files around, only rewriting the files if destination file format requires so. For example:glob a folder with csvs (may be in the bucket) -> sniff the schema -> import (copy or link) the file into extract folder -> regular further processing
This is not exactly
dlt
main use case but seems as a natural extension to #338Tasks
Overall implementation idea is to create a resource similar to
filesystem
that would handle schema sniffing, copying of the file and importingfilesystem
. assume one glob -> one table schemafilesystem
. use duckdb, pandas or pyarrow - this may be configurableA challenge: csvs may come in different formats. we could use a good sniffer (ie. duckdb) to sniff it. The problem is to make the destinations to understand it. Mind that the reason for this feature is to not rewrite the files. Some ideas:
loader_file_format
may be both Literal or TypedDict with actual settings. so we can have separator, escape characters, date formats for csv in itFuture work:
reference
jobs instead to copy them into destination directly from the bucketPR1
Part of this ticket is implemented in #998
dlt.mark.with_file_import
dlt
still loads data to tables even is schema does not exist (bring your own table)The text was updated successfully, but these errors were encountered: