Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIMX 401 - expect and support v2 parquet dataset Transmogrifier output #84

Merged
merged 4 commits into from
Feb 3, 2025

Conversation

ghukill
Copy link
Contributor

@ghukill ghukill commented Jan 24, 2025

Purpose and background context

High level, this PR allows Transmogrifie ABDiff to run diffs when Transmogrifier is generating "v2" output of parquet datasets vs the "v1" output of JSON and TXT files. It was considered if ABDiff should support both v1 and v2 output, but decided the complexity would be outweighed by the short (none?) time we'd need to run v1 vs v2 output.

The primary update of expecting parquet dataset output from each Transmog container can be found in the first commit.

Another commit exists that allows for building a Transmogrifier image from a local instance of Transmog (instead of always cloning from Github).

It is recommended to review code by filtering to these commits.

Notes on transforming vs collating

When looking at code changes, there was quite a bit of code removal for this update. A big change was not needing to create an intermediate dataset of transformed files from JSON and TXT files, that could then enter our parquet flow of collating, diffing, etc, because we already have parquet output from Transmogrifier.

The high level flow now:

  1. run_ab_transforms still kicks off multiple Transmogrifier containers
  2. they write to two distinct parquet datasets: a/dataset and b/dataset
  3. then for collate_ab_transforms, we're just reading and joining transformed records from these two a and b datasets (and basically ignoring all other columns)

The discerning code change reader may notice a couple of things:

  1. some validation has been removed from run_ab_transforms()
  2. some natural memory management has been removed in collate_ab_transforms(), where the chunking of reads influenced the number of records joined in DuckDB
  3. simplification of docstrings

My overall sense of working on these changes, was that the code and documentation felt a little brittle and hard to work with given the scale of change. My thinking is that some regression in terms of memory management and validation is worth keeping this app functional with changing dimensions of Transmogrifier and TIMDEX. I believe the best tests of this app will be constant use. This only works because this app is not deployed in a way that we rely on it unsupervised; it's an internal tool.

How can a reviewer manually see the effects of these changes?

Set AWS Dev1 credentials (admin or TIMDEX) in .env file.

Additionally, set PRESERVE_ARTIFACTS=true in the .env file to observe the output of each step instead of cleaning it up.

Create a new job, where both commits are "v2" dataset friendly:

pipenv run abdiff --verbose init-job -d output/timx-401 \
-a 48979ba -b 419a468

Perform a run:

pipenv run abdiff --verbose run-diff -d output/timx-401 \
-i s3://timdex-extract-dev-222053980223/gismit/gismit-2025-01-23-full-extracted-records-to-index.jsonl,s3://timdex-extract-dev-222053980223/alma/alma-2022-09-12-daily-extracted-records-to-index_01.xml

Here is the rough structure of the /transformed directory under the run (where UUIDs will be different), where you can see the two a and b datasets written to by Transmogrifier containers. This leans into a strength of the parquet datasets as well, that we can have one dataset for both a and b and all the running Transmog containers can write in parallel to them:

output/timx-401/runs/2025-01-24_20-03-09/transformed
├── a
│   └── dataset
│       ├── year=2022
│       │   └── month=09
│       │       └── day=12
│       │           └── 65911cf0-6124-4434-8ab7-aa7589c4c147-0.parquet
│       └── year=2025
│           └── month=01
│               └── day=23
│                   └── 391a67d1-4b8f-46d5-9226-2961b37ce707-0.parquet
└── b
    └── dataset
        ├── year=2022
        │   └── month=09
        │       └── day=12
        │           └── dfe3c14e-900a-4c01-b54b-b82b532af8b9-0.parquet
        └── year=2025
            └── month=01
                └── day=23
                    └── d0775e6e-b5f0-4161-b15b-8ad8b44bd334-0.parquet

View the job:

pipenv run abdiff --verbose view-job -d output/timx-401

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

Transmogrifier, and TIMDEX more generally, are getting updated to use
a parquet dataset as the storage architecture for files during ETL.  Transmogrifier
will now start writing to a parquet dataset instead of outputting JSON and TXT
files.  As such, this application needs the ability to read this new Transmog output to
continue providing A/B diff analysis.

Some spike time was spent looking into supporting both "v1" output from Transmog (JSON/TXT files)
and the new parquet dataset writing, but it was determined it's quite a bit of overhead for
functionality that will likely not be used in the coming weeks when we go full v2
parquet dataset writing.

How this addresses that need:

The application has been updated to expect parquet datasets as the output from
each A/B Transmog version.  The changes are mostly localized to the run_ab_transforms()
and collate_ab_transforms() core functions.

In many ways, it was a simplification of code.  A considerable amount of effort was
spent getting the records from v1 JSON/TXT files into a parquet dataset for more
efficient processing.

Side effects of this change:
* Output of JSON and TXT files from v1 Transmogrifier is no longer supported

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-401
Why these changes are being introduced:

Formerly, only git commit SHAs from the remote Transmogrifier github repository
could be used when initializing a job.  Sometimes during development on Transmogrifier,
it would be helpful to be able to run this application on local commits (potentially ones
that will never see the Github light of day).

How this addresses that need:
* Adds new CLI arguments --location-a and --location-b that define where Transmogrifier
should be cloned from, supporting local paths

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-401
@ghukill ghukill marked this pull request as ready for review January 24, 2025 20:14
@ghukill ghukill force-pushed the TIMX-401-expect-v2-datasets branch from 7fd87ef to d0eeaf8 Compare January 24, 2025 20:25
@coveralls
Copy link

coveralls commented Jan 24, 2025

Pull Request Test Coverage Report for Build 13115481845

Details

  • 30 of 31 (96.77%) changed or added relevant lines in 4 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 86.031%

Changes Missing Coverage Covered Lines Changed/Added Lines %
abdiff/core/run_ab_transforms.py 9 10 90.0%
Files with Coverage Reduction New Missed Lines %
abdiff/core/run_ab_transforms.py 1 97.94%
Totals Coverage Status
Change from base Build 11898377358: -0.3%
Covered Lines: 776
Relevant Lines: 902

💛 - Coveralls

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just some minor change requests; clarify some docstrings + add FEATURE FLAG comment. 🤔

try:
run_id = read_run_json(run_directory)["run_timestamp"]
except FileNotFoundError:
logger.warning("Run JSON not found, probably testing, minting unique run UUID.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what is meant by "probably testing"? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a good example of the non-pure functional nature of these functions. This can't be run unless a Run JSON object exists, as now we need a run_id to pass to Transmogrifier.

Most of the time, we will have a Run JSON file of course, as this function will be run as part of a Run. But to test its pure functional inputs and outputs, that's technically not required.

We could pass the run_id around in the functions... but not sure what we're achieving by doing so. The Job and Run JSON objects -- I realized through this work -- kind of stand-in for an object "state", thus complicating the functional nature of these functions.

In summary, this felt like the least bad option for the moment: acknowledge via logging that it's unusual for a Run JSON file to not exist, but allow it to continue as all we needed was a minted run_id.

abdiff/core/run_ab_transforms.py Outdated Show resolved Hide resolved
abdiff/core/collate_ab_transforms.py Outdated Show resolved Hide resolved

with duckdb.connect(":memory:") as con:
for i, transformed_file in enumerate(transformed_file_names):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ✨

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well.... kind of! If you recall, this batching by transformed file was a strangely elegant way to manage memory for very large runs.

I think it's quite possible that we may encounter memory issues again, now that we're asking DuckDB to perform joins on effectively all records from all input files that make up the run. Counter-point, the parquet datasets are structured differently, and perhaps this will be more efficient now. As mentioned in the PR and elsewhere, my vote would be to move forward with simplications in code, and handle performance or memory issues if and when they arise going forward.

@jonavellecuerdo jonavellecuerdo self-requested a review February 3, 2025 14:49
@ghukill ghukill removed the request for review from ehanson8 February 3, 2025 14:50
@ghukill ghukill merged commit b1a4438 into main Feb 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants