Bettertar #119

mikeknep · 2023-06-06T17:19:02Z

User-facing change (minor)

The outputs archives for transforms and synthetics (which are singleton objects we treat sort of like "buckets" holding all the outputs from multiple run_transforms / generate calls) are now "archives of archives" instead of archives of files organized in subdirectories:

# prev

synthetics_outputs.tar.gz
- t1/
  - synth_customer.csv
  - synth_location.csv
  - reports, etc.
- t2/
  - synth_customer.csv
  - synth_location.csv
  - reports, etc.

# new

synthetics_outputs.tar.gz
- t1.tar.gz
- t2.tar.gz

(where "t1" and "t2" are run identifiers—can be user supplied but default to current timestamp)

Performance details

The previous approach to creating regular archive files (everything used a single add_to_tar function) was woefully inefficient. In the case of archives that are only made once with no nesting (e.g. source data, synthetics training datasets) we now have archive_items. Using the same set of non-trivial-sized tables mentioned in #117 :

from pathlib import Path

# old
tarpath = Path("./addtotar.tar.gz")
for f in files:
    fpath = Path(f)
    add_to_tar(tarpath, fpath, fpath.name)

# new
archive_items(Path("./archiveitems.tar.gz"), files)

Old: 20m 25.6s 🐢
New: 4m 41.5s 🚀

In the case of the "bucket archives" with results from multiple runs described above (run transforms multiple times, generate multiple times), the new archive_nested_dir is significantly better as more and more runs are added:

# contains 6 CSV files with sizes from 39–225 MB
test_dir = "test_dir"

# old
archive_of_subdirs = Path("archive_of_subdirs.tar.gz")
add_to_tar(archive_of_subdirs, test_dir, "t1")
add_to_tar(archive_of_subdirs, test_dir, "t2")
add_to_tar(archive_of_subdirs, test_dir, "t3")
add_to_tar(archive_of_subdirs, test_dir, "t4")


# new
archive_of_archives = Path("archive_of_archives.tar.gz")
archive_nested_dir(archive_of_archives, test_dir, "t1")
archive_nested_dir(archive_of_archives, test_dir, "t2")
archive_nested_dir(archive_of_archives, test_dir, "t3")
archive_nested_dir(archive_of_archives, test_dir, "t4")

Individual times for each operation:

Old
- 1m 7s
- 2m 17s
- 3m 27s
- 4m 38s
New
- 1m 12s
- 1m 18s
- 1m 24s
- 1m 30s

gracecvking

👍

mikeknep requested review from pimlock, tylersbray and gracecvking June 6, 2023 17:19

mikeknep added 2 commits June 6, 2023 14:35

More efficient archiving

68c0490

Update restore per changes to archiving implementation

c1c8a02

mikeknep force-pushed the bettertar branch from 7cb3a6a to c1c8a02 Compare June 6, 2023 19:36

gracecvking approved these changes Jun 7, 2023

View reviewed changes

pimlock approved these changes Jun 7, 2023

View reviewed changes

mikeknep merged commit 738361a into main Jun 7, 2023

mikeknep deleted the bettertar branch June 7, 2023 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bettertar #119

Bettertar #119

mikeknep commented Jun 6, 2023

gracecvking left a comment

Bettertar #119

Bettertar #119

Conversation

mikeknep commented Jun 6, 2023

User-facing change (minor)

Performance details

gracecvking left a comment

Choose a reason for hiding this comment