-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIMX 401 - expect and support v2 parquet dataset Transmogrifier output #84
Conversation
Why these changes are being introduced: Transmogrifier, and TIMDEX more generally, are getting updated to use a parquet dataset as the storage architecture for files during ETL. Transmogrifier will now start writing to a parquet dataset instead of outputting JSON and TXT files. As such, this application needs the ability to read this new Transmog output to continue providing A/B diff analysis. Some spike time was spent looking into supporting both "v1" output from Transmog (JSON/TXT files) and the new parquet dataset writing, but it was determined it's quite a bit of overhead for functionality that will likely not be used in the coming weeks when we go full v2 parquet dataset writing. How this addresses that need: The application has been updated to expect parquet datasets as the output from each A/B Transmog version. The changes are mostly localized to the run_ab_transforms() and collate_ab_transforms() core functions. In many ways, it was a simplification of code. A considerable amount of effort was spent getting the records from v1 JSON/TXT files into a parquet dataset for more efficient processing. Side effects of this change: * Output of JSON and TXT files from v1 Transmogrifier is no longer supported Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-401
Why these changes are being introduced: Formerly, only git commit SHAs from the remote Transmogrifier github repository could be used when initializing a job. Sometimes during development on Transmogrifier, it would be helpful to be able to run this application on local commits (potentially ones that will never see the Github light of day). How this addresses that need: * Adds new CLI arguments --location-a and --location-b that define where Transmogrifier should be cloned from, supporting local paths Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-401
7fd87ef
to
d0eeaf8
Compare
Pull Request Test Coverage Report for Build 13115481845Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Just some minor change requests; clarify some docstrings + add FEATURE FLAG
comment. 🤔
try: | ||
run_id = read_run_json(run_directory)["run_timestamp"] | ||
except FileNotFoundError: | ||
logger.warning("Run JSON not found, probably testing, minting unique run UUID.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what is meant by "probably testing"? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a good example of the non-pure functional nature of these functions. This can't be run unless a Run JSON object exists, as now we need a run_id
to pass to Transmogrifier.
Most of the time, we will have a Run JSON file of course, as this function will be run as part of a Run. But to test its pure functional inputs and outputs, that's technically not required.
We could pass the run_id
around in the functions... but not sure what we're achieving by doing so. The Job and Run JSON objects -- I realized through this work -- kind of stand-in for an object "state", thus complicating the functional nature of these functions.
In summary, this felt like the least bad option for the moment: acknowledge via logging that it's unusual for a Run JSON file to not exist, but allow it to continue as all we needed was a minted run_id
.
|
||
with duckdb.connect(":memory:") as con: | ||
for i, transformed_file in enumerate(transformed_file_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ✨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well.... kind of! If you recall, this batching by transformed file was a strangely elegant way to manage memory for very large runs.
I think it's quite possible that we may encounter memory issues again, now that we're asking DuckDB to perform joins on effectively all records from all input files that make up the run. Counter-point, the parquet datasets are structured differently, and perhaps this will be more efficient now. As mentioned in the PR and elsewhere, my vote would be to move forward with simplications in code, and handle performance or memory issues if and when they arise going forward.
Purpose and background context
High level, this PR allows Transmogrifie ABDiff to run diffs when Transmogrifier is generating "v2" output of parquet datasets vs the "v1" output of JSON and TXT files. It was considered if ABDiff should support both v1 and v2 output, but decided the complexity would be outweighed by the short (none?) time we'd need to run v1 vs v2 output.
The primary update of expecting parquet dataset output from each Transmog container can be found in the first commit.
Another commit exists that allows for building a Transmogrifier image from a local instance of Transmog (instead of always cloning from Github).
It is recommended to review code by filtering to these commits.
Notes on transforming vs collating
When looking at code changes, there was quite a bit of code removal for this update. A big change was not needing to create an intermediate dataset of transformed files from JSON and TXT files, that could then enter our parquet flow of collating, diffing, etc, because we already have parquet output from Transmogrifier.
The high level flow now:
run_ab_transforms
still kicks off multiple Transmogrifier containersa/dataset
andb/dataset
collate_ab_transforms
, we're just reading and joining transformed records from these twoa
andb
datasets (and basically ignoring all other columns)The discerning code change reader may notice a couple of things:
run_ab_transforms()
collate_ab_transforms()
, where the chunking of reads influenced the number of records joined in DuckDBMy overall sense of working on these changes, was that the code and documentation felt a little brittle and hard to work with given the scale of change. My thinking is that some regression in terms of memory management and validation is worth keeping this app functional with changing dimensions of Transmogrifier and TIMDEX. I believe the best tests of this app will be constant use. This only works because this app is not deployed in a way that we rely on it unsupervised; it's an internal tool.
How can a reviewer manually see the effects of these changes?
Set AWS Dev1 credentials (admin or TIMDEX) in
.env
file.Additionally, set
PRESERVE_ARTIFACTS=true
in the.env
file to observe the output of each step instead of cleaning it up.Create a new job, where both commits are "v2" dataset friendly:
Perform a run:
Here is the rough structure of the
/transformed
directory under the run (where UUIDs will be different), where you can see the twoa
andb
datasets written to by Transmogrifier containers. This leans into a strength of the parquet datasets as well, that we can have one dataset for botha
andb
and all the running Transmog containers can write in parallel to them:View the job:
Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)