Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bettertar #119

Merged
merged 2 commits into from
Jun 7, 2023
Merged

Bettertar #119

merged 2 commits into from
Jun 7, 2023

Conversation

mikeknep
Copy link
Contributor

@mikeknep mikeknep commented Jun 6, 2023

User-facing change (minor)

The outputs archives for transforms and synthetics (which are singleton objects we treat sort of like "buckets" holding all the outputs from multiple run_transforms / generate calls) are now "archives of archives" instead of archives of files organized in subdirectories:

# prev

synthetics_outputs.tar.gz
- t1/
  - synth_customer.csv
  - synth_location.csv
  - reports, etc.
- t2/
  - synth_customer.csv
  - synth_location.csv
  - reports, etc.
# new

synthetics_outputs.tar.gz
- t1.tar.gz
- t2.tar.gz

(where "t1" and "t2" are run identifiers—can be user supplied but default to current timestamp)

Performance details

The previous approach to creating regular archive files (everything used a single add_to_tar function) was woefully inefficient. In the case of archives that are only made once with no nesting (e.g. source data, synthetics training datasets) we now have archive_items. Using the same set of non-trivial-sized tables mentioned in #117 :

from pathlib import Path

# old
tarpath = Path("./addtotar.tar.gz")
for f in files:
    fpath = Path(f)
    add_to_tar(tarpath, fpath, fpath.name)

# new
archive_items(Path("./archiveitems.tar.gz"), files)
  • Old: 20m 25.6s 🐢
  • New: 4m 41.5s 🚀

In the case of the "bucket archives" with results from multiple runs described above (run transforms multiple times, generate multiple times), the new archive_nested_dir is significantly better as more and more runs are added:

# contains 6 CSV files with sizes from 39–225 MB
test_dir = "test_dir"

# old
archive_of_subdirs = Path("archive_of_subdirs.tar.gz")
add_to_tar(archive_of_subdirs, test_dir, "t1")
add_to_tar(archive_of_subdirs, test_dir, "t2")
add_to_tar(archive_of_subdirs, test_dir, "t3")
add_to_tar(archive_of_subdirs, test_dir, "t4")


# new
archive_of_archives = Path("archive_of_archives.tar.gz")
archive_nested_dir(archive_of_archives, test_dir, "t1")
archive_nested_dir(archive_of_archives, test_dir, "t2")
archive_nested_dir(archive_of_archives, test_dir, "t3")
archive_nested_dir(archive_of_archives, test_dir, "t4")

Individual times for each operation:

  • Old
    • 1m 7s
    • 2m 17s
    • 3m 27s
    • 4m 38s
  • New
    • 1m 12s
    • 1m 18s
    • 1m 24s
    • 1m 30s

Copy link
Contributor

@gracecvking gracecvking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mikeknep mikeknep merged commit 738361a into main Jun 7, 2023
@mikeknep mikeknep deleted the bettertar branch June 7, 2023 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants