Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toggle to Prevent Deletion of Files Loaded into the Data Warehouse #1603

Closed
neuromantik33 opened this issue Jul 17, 2024 · 0 comments · Fixed by #1717
Closed

Toggle to Prevent Deletion of Files Loaded into the Data Warehouse #1603

neuromantik33 opened this issue Jul 17, 2024 · 0 comments · Fixed by #1717
Assignees
Labels
community This issue came from slack community workspace support This issue is monitored by Solution Engineer

Comments

@neuromantik33
Copy link
Contributor

Feature description

We request the addition of a feature toggle to the DLT data load tool library that allows users to prevent the deletion of files once they are loaded into the final data warehouse. This feature will help create a log of all loaded files and facilitate testing processes.

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

Our primary design goal for most of our data ingestion requirements includes the following stages:

  1. Extract (dlt): Extract data from third-party sources and store it in blob storage in various unstructured formats (JSON, Parquet, CSV).
  2. Load (dlt): Load the extracted data from blob storage into the input schema of our data warehouse. Move unstructured data to an archive path to prevent vendor lock-in and support full loads.
  3. Transform (dbt): Apply tests and business transformations to the tables created by dlt using dbt.

Currently, the staging storage used by dlt to facilitate the loading process is not a true log/archive and is deleted after the load operation. I would like to propose a feature toggle that allows users to deactivate the deletion of any files loaded into the final data warehouse.

Proposed solution

Any classes implementing

class SupportsStagingDestination:
    """Adds capability to support a staging destination for the load"""

    def should_load_data_to_staging_dataset_on_staging_destination(
        self, table: TTableSchema
    ) -> bool:
        return False

    def should_truncate_table_before_load_on_staging_destination(self, table: TTableSchema) -> bool:
        # the default is to truncate the tables on the staging destination...
        return True

should have an option to override the default behavior of should_truncate_table_before_load_on_staging_destination (key is yet to be determined). With of course the default being False in order not to break any existing clients.

Related issues

No response

@VioletM VioletM assigned VioletM and unassigned VioletM Jul 17, 2024
@VioletM VioletM added the community This issue came from slack community workspace label Jul 17, 2024
@VioletM VioletM added the support This issue is monitored by Solution Engineer label Jul 26, 2024
@rudolfix rudolfix moved this from Todo to Planned in dlt core library Jul 29, 2024
@rudolfix rudolfix moved this from Planned to In Progress in dlt core library Aug 5, 2024
@VioletM VioletM linked a pull request Aug 26, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from In Progress to Done in dlt core library Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace support This issue is monitored by Solution Engineer
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants