TIMX 401 - expect and support v2 parquet dataset Transmogrifier output #84

ghukill · 2025-01-24T19:45:25Z

Purpose and background context

High level, this PR allows Transmogrifie ABDiff to run diffs when Transmogrifier is generating "v2" output of parquet datasets vs the "v1" output of JSON and TXT files. It was considered if ABDiff should support both v1 and v2 output, but decided the complexity would be outweighed by the short (none?) time we'd need to run v1 vs v2 output.

The primary update of expecting parquet dataset output from each Transmog container can be found in the first commit.

Another commit exists that allows for building a Transmogrifier image from a local instance of Transmog (instead of always cloning from Github).

It is recommended to review code by filtering to these commits.

Notes on transforming vs collating

When looking at code changes, there was quite a bit of code removal for this update. A big change was not needing to create an intermediate dataset of transformed files from JSON and TXT files, that could then enter our parquet flow of collating, diffing, etc, because we already have parquet output from Transmogrifier.

The high level flow now:

run_ab_transforms still kicks off multiple Transmogrifier containers
they write to two distinct parquet datasets: a/dataset and b/dataset
then for collate_ab_transforms, we're just reading and joining transformed records from these two a and b datasets (and basically ignoring all other columns)

The discerning code change reader may notice a couple of things:

My overall sense of working on these changes, was that the code and documentation felt a little brittle and hard to work with given the scale of change. My thinking is that some regression in terms of memory management and validation is worth keeping this app functional with changing dimensions of Transmogrifier and TIMDEX. I believe the best tests of this app will be constant use. This only works because this app is not deployed in a way that we rely on it unsupervised; it's an internal tool.

How can a reviewer manually see the effects of these changes?

Set AWS Dev1 credentials (admin or TIMDEX) in .env file.

Additionally, set PRESERVE_ARTIFACTS=true in the .env file to observe the output of each step instead of cleaning it up.

Create a new job, where both commits are "v2" dataset friendly:

pipenv run abdiff --verbose init-job -d output/timx-401 \
-a 48979ba -b 419a468

Perform a run:

pipenv run abdiff --verbose run-diff -d output/timx-401 \
-i s3://timdex-extract-dev-222053980223/gismit/gismit-2025-01-23-full-extracted-records-to-index.jsonl,s3://timdex-extract-dev-222053980223/alma/alma-2022-09-12-daily-extracted-records-to-index_01.xml

Here is the rough structure of the /transformed directory under the run (where UUIDs will be different), where you can see the two a and b datasets written to by Transmogrifier containers. This leans into a strength of the parquet datasets as well, that we can have one dataset for both a and b and all the running Transmog containers can write in parallel to them:

output/timx-401/runs/2025-01-24_20-03-09/transformed
├── a
│   └── dataset
│       ├── year=2022
│       │   └── month=09
│       │       └── day=12
│       │           └── 65911cf0-6124-4434-8ab7-aa7589c4c147-0.parquet
│       └── year=2025
│           └── month=01
│               └── day=23
│                   └── 391a67d1-4b8f-46d5-9226-2961b37ce707-0.parquet
└── b
    └── dataset
        ├── year=2022
        │   └── month=09
        │       └── day=12
        │           └── dfe3c14e-900a-4c01-b54b-b82b532af8b9-0.parquet
        └── year=2025
            └── month=01
                └── day=23
                    └── d0775e6e-b5f0-4161-b15b-8ad8b44bd334-0.parquet

View the job:

pipenv run abdiff --verbose view-job -d output/timx-401

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-401

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: Transmogrifier, and TIMDEX more generally, are getting updated to use a parquet dataset as the storage architecture for files during ETL. Transmogrifier will now start writing to a parquet dataset instead of outputting JSON and TXT files. As such, this application needs the ability to read this new Transmog output to continue providing A/B diff analysis. Some spike time was spent looking into supporting both "v1" output from Transmog (JSON/TXT files) and the new parquet dataset writing, but it was determined it's quite a bit of overhead for functionality that will likely not be used in the coming weeks when we go full v2 parquet dataset writing. How this addresses that need: The application has been updated to expect parquet datasets as the output from each A/B Transmog version. The changes are mostly localized to the run_ab_transforms() and collate_ab_transforms() core functions. In many ways, it was a simplification of code. A considerable amount of effort was spent getting the records from v1 JSON/TXT files into a parquet dataset for more efficient processing. Side effects of this change: * Output of JSON and TXT files from v1 Transmogrifier is no longer supported Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-401

Why these changes are being introduced: Formerly, only git commit SHAs from the remote Transmogrifier github repository could be used when initializing a job. Sometimes during development on Transmogrifier, it would be helpful to be able to run this application on local commits (potentially ones that will never see the Github light of day). How this addresses that need: * Adds new CLI arguments --location-a and --location-b that define where Transmogrifier should be cloned from, supporting local paths Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-401

coveralls · 2025-01-24T20:27:39Z

Pull Request Test Coverage Report for Build 13115481845

Details

30 of 31 (96.77%) changed or added relevant lines in 4 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.3%) to 86.031%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
abdiff/core/run_ab_transforms.py	9	10	90.0%

Files with Coverage Reduction	New Missed Lines	%
abdiff/core/run_ab_transforms.py	1	97.94%

Totals
Change from base Build 11898377358:	-0.3%
Covered Lines:	776
Relevant Lines:	902

💛 - Coveralls

jonavellecuerdo

Looking good! Just some minor change requests; clarify some docstrings + add FEATURE FLAG comment. 🤔

jonavellecuerdo · 2025-01-31T14:58:18Z

abdiff/core/run_ab_transforms.py

+    try:
+        run_id = read_run_json(run_directory)["run_timestamp"]
+    except FileNotFoundError:
+        logger.warning("Run JSON not found, probably testing, minting unique run UUID.")


Can you clarify what is meant by "probably testing"? 🤔

This is actually a good example of the non-pure functional nature of these functions. This can't be run unless a Run JSON object exists, as now we need a run_id to pass to Transmogrifier.

Most of the time, we will have a Run JSON file of course, as this function will be run as part of a Run. But to test its pure functional inputs and outputs, that's technically not required.

We could pass the run_id around in the functions... but not sure what we're achieving by doing so. The Job and Run JSON objects -- I realized through this work -- kind of stand-in for an object "state", thus complicating the functional nature of these functions.

In summary, this felt like the least bad option for the moment: acknowledge via logging that it's unusual for a Run JSON file to not exist, but allow it to continue as all we needed was a minted run_id.

abdiff/core/run_ab_transforms.py

abdiff/core/collate_ab_transforms.py

jonavellecuerdo · 2025-01-31T20:02:39Z

abdiff/core/collate_ab_transforms.py


    with duckdb.connect(":memory:") as con:
-        for i, transformed_file in enumerate(transformed_file_names):


Well.... kind of! If you recall, this batching by transformed file was a strangely elegant way to manage memory for very large runs.

I think it's quite possible that we may encounter memory issues again, now that we're asking DuckDB to perform joins on effectively all records from all input files that make up the run. Counter-point, the parquet datasets are structured differently, and perhaps this will be more efficient now. As mentioned in the PR and elsewhere, my vote would be to move forward with simplications in code, and handle performance or memory issues if and when they arise going forward.

ghukill added 2 commits January 7, 2025 14:36

ghukill marked this pull request as ready for review January 24, 2025 20:14

ghukill requested review from jonavellecuerdo and ehanson8 January 24, 2025 20:15

update safety checks and test env vars

d0eeaf8

ghukill force-pushed the TIMX-401-expect-v2-datasets branch from 7fd87ef to d0eeaf8 Compare January 24, 2025 20:25

jonavellecuerdo requested changes Jan 31, 2025

View reviewed changes

Add feature flagging comment and update docstrings

a5ca7f4

jonavellecuerdo self-requested a review February 3, 2025 14:49

jonavellecuerdo approved these changes Feb 3, 2025

View reviewed changes

ghukill removed the request for review from ehanson8 February 3, 2025 14:50

ghukill merged commit b1a4438 into main Feb 3, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIMX 401 - expect and support v2 parquet dataset Transmogrifier output #84

TIMX 401 - expect and support v2 parquet dataset Transmogrifier output #84

ghukill commented Jan 24, 2025 •

edited

Loading

coveralls commented Jan 24, 2025 •

edited

Loading

jonavellecuerdo left a comment

jonavellecuerdo Jan 31, 2025

ghukill Feb 3, 2025

jonavellecuerdo Jan 31, 2025

ghukill Feb 3, 2025


		with duckdb.connect(":memory:") as con:
		for i, transformed_file in enumerate(transformed_file_names):

TIMX 401 - expect and support v2 parquet dataset Transmogrifier output #84

TIMX 401 - expect and support v2 parquet dataset Transmogrifier output #84

Conversation

ghukill commented Jan 24, 2025 • edited Loading

Purpose and background context

Notes on transforming vs collating

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

coveralls commented Jan 24, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13115481845

Details

💛 - Coveralls

jonavellecuerdo left a comment

Choose a reason for hiding this comment

jonavellecuerdo Jan 31, 2025

Choose a reason for hiding this comment

ghukill Feb 3, 2025

Choose a reason for hiding this comment

jonavellecuerdo Jan 31, 2025

Choose a reason for hiding this comment

ghukill Feb 3, 2025

Choose a reason for hiding this comment

ghukill commented Jan 24, 2025 •

edited

Loading

coveralls commented Jan 24, 2025 •

edited

Loading