From 80ca47417fe694c5537cedb6eb594bcd3ce2491d Mon Sep 17 00:00:00 2001 From: HulmaNaseer <42720638+HulmaNaseer@users.noreply.github.com> Date: Fri, 13 Dec 2024 11:47:40 +0100 Subject: [PATCH] explicitly adding docs for destination item size control (#2118) * explicitly adding docs for destination item size control * alena's feedback * revised for explicit note * Update docs/website/docs/reference/performance.md --------- Co-authored-by: hulmanaseer00 <163604758+hulmanaseer00@users.noreply.github.com> Co-authored-by: Alena Astrakhantseva --- docs/website/docs/reference/performance.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/website/docs/reference/performance.md b/docs/website/docs/reference/performance.md index ab171ac069..1e58080200 100644 --- a/docs/website/docs/reference/performance.md +++ b/docs/website/docs/reference/performance.md @@ -48,9 +48,7 @@ Some file formats (e.g., Parquet) do not support schema changes when writing a s Below, we set files to rotate after 100,000 items written or when the filesize exceeds 1MiB. - - - + ### Disabling and enabling file compression Several [text file formats](../dlt-ecosystem/file-formats/) have `gzip` compression enabled by default. If you wish that your load packages have uncompressed files (e.g., to debug the content easily), change `data_writer.disable_compression` in config.toml. The entry below will disable the compression of the files processed in the `normalize` stage. @@ -148,7 +146,10 @@ As before, **if you have just a single table with millions of records, you shoul -Since the normalize stage uses a process pool to create load packages concurrently, adjusting the `file_max_items` and `file_max_bytes` settings can significantly impact load behavior. By setting a lower value for `file_max_items`, you reduce the size of each data chunk sent to the destination database, which can be particularly useful for managing memory constraints on the database server. Without explicit configuration of `file_max_items`, `dlt` writes all data rows into one large intermediary file, attempting to insert all data from this single file. Configuring `file_max_items` ensures data is inserted in manageable chunks, enhancing performance and preventing potential memory issues. +The **normalize** stage in `dlt` uses a process pool to create load packages concurrently, and the settings for `file_max_items` and `file_max_bytes` play a crucial role in determining the size of data chunks. Lower values for these settings reduce the size of each chunk sent to the destination database, which is particularly helpful for managing memory constraints on the database server. By default, `dlt` writes all data rows into one large intermediary file, attempting to load all data at once. Configuring these settings enables file rotation, splitting the data into smaller, more manageable chunks. This not only improves performance but also minimizes memory-related issues when working with large tables containing millions of records. + +#### Controlling destination items size +The intermediary files generated during the **normalize** stage are also used in the **load** stage. Therefore, adjusting `file_max_items` and `file_max_bytes` in the **normalize** stage directly impacts the size and number of data chunks sent to the destination, influencing loading behavior and performance. ### Parallel pipeline config example The example below simulates the loading of a large database table with 1,000,000 records. The **config.toml** below sets the parallelization as follows: