-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue #3899] Task to Add new record into OpportunityVersion table #4061
base: main
Are you sure you want to change the base?
Conversation
select(Opportunity).where(Opportunity.opportunity_id == opp.opportunity_id) | ||
).scalar_one() | ||
# Get Json | ||
opportunity_v1 = SCHEMA.dump(opportunity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You made a utility for loading these records into that table, we should be using that as much as possible - I know it's used below, but re-converting something into a schema multiple times is a bit wasteful.
We'd likely need to add logic to that method of "only add if there is a diff", but we should keep that contained in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the logic to check for detecting diffs to save_opportunity_version function. And the task only checks if there is a change in the OpportunityChangeAudit table.
select(JobLog) | ||
.where(JobLog.job_type == self.cls_name()) | ||
.where( | ||
or_(JobLog.job_status == JobStatus.COMPLETED, JobLog.job_status == JobStatus.FAILED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the job failed, we probably don't want to count it.
latest_time = ( | ||
latest_job.created_at | ||
if latest_job | ||
else get_now_us_eastern_datetime() - timedelta(hours=24) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DB values are stored in UTC, use utcnow
whenever you do a comparison with now against something in the DB.
I think if there hasn't been a prior job, we should actually just grab everything (ie. pick a datetime of like 1970-01-01
) since we want to backfill these records anyways.
updated_opportunities = self.db_session.scalars( | ||
select(OpportunityChangeAudit).where(OpportunityChangeAudit.updated_at > latest_time) | ||
).all() | ||
|
||
for opp in updated_opportunities: | ||
# Get Opportunity object | ||
opportunity = self.db_session.execute( | ||
select(Opportunity).where(Opportunity.opportunity_id == opp.opportunity_id) | ||
).scalar_one() | ||
# Get Json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to write a query when the relationship already is that:
updated_opportunities = self.db_session.scalars( | |
select(OpportunityChangeAudit).where(OpportunityChangeAudit.updated_at > latest_time) | |
).all() | |
for opp in updated_opportunities: | |
# Get Opportunity object | |
opportunity = self.db_session.execute( | |
select(Opportunity).where(Opportunity.opportunity_id == opp.opportunity_id) | |
).scalar_one() | |
# Get Json | |
opportunity_change_audits = self.db_session.scalars( | |
select(OpportunityChangeAudit).where(OpportunityChangeAudit.updated_at > latest_time) | |
).all() | |
for opp_change_audit in opportunity_change_audits: | |
opportunity = opp_change_audit.opportunity | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right! Updated!
latest_versioned_opp = self.db_session.execute( | ||
select(OpportunityVersion).where( | ||
OpportunityVersion.opportunity_id == opp.opportunity_id | ||
) | ||
).scalar_one_or_none() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there are multiple versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordered by created_at desc
… 3899/task-add-opp-version-record
latest_opp_version = db_session.execute( | ||
select(OpportunityVersion) | ||
.where(OpportunityVersion.opportunity_id == opportunity.opportunity_id) | ||
.order_by(OpportunityVersion.created_at.desc()) | ||
).scalar_one_or_none() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional recommendation:
latest_opp_version = db_session.execute( | |
select(OpportunityVersion) | |
.where(OpportunityVersion.opportunity_id == opportunity.opportunity_id) | |
.order_by(OpportunityVersion.created_at.desc()) | |
).scalar_one_or_none() | |
latest_opp_version = db_session.execute( | |
select(OpportunityVersion) | |
.where(OpportunityVersion.opportunity_id == opportunity.opportunity_id) | |
.order_by(OpportunityVersion.created_at.desc()) | |
.options(selectinload("*")) | |
).scalar_one_or_none() |
Won't have any obvious affect on how the code works, but makes it so SQLAlchemy will select all parts of an opportunity in one set of queries rather than lazy-loading every relationship which each individually fire a select query off to the database. When we have a job that'll iterate over tens of thousands of records, it'll make it much faster (I tested adding it with set-current-opportunities and I want to say it was 70x faster?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed! Updated , thanks
opportunity_existing = latest_opp_version.opportunity_data if latest_opp_version else {} | ||
|
||
diffs = diff_nested_dicts(opportunity_new, opportunity_existing) | ||
if diffs: | ||
# Add new OpportunityVersion instance to the database session | ||
opportunity_version = OpportunityVersion( | ||
opportunity_id=opportunity.opportunity_id, | ||
opportunity_data=opportunity_new, | ||
) | ||
|
||
db_session.add(opportunity_version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can put a small optimization/short circuit in the case where there is no existing since there will always be a diff.
Something like:
opportunity_existing = latest_opp_version.opportunity_data if latest_opp_version else {} | |
diffs = diff_nested_dicts(opportunity_new, opportunity_existing) | |
if diffs: | |
# Add new OpportunityVersion instance to the database session | |
opportunity_version = OpportunityVersion( | |
opportunity_id=opportunity.opportunity_id, | |
opportunity_data=opportunity_new, | |
) | |
db_session.add(opportunity_version) | |
diffs = {} | |
if latest_opp_version: | |
diffs = diff_nested_dicts(opportunity_new, latest_opp_version.opportunity_data) | |
if diffs or latest_opp_version is None: | |
# Add new OpportunityVersion instance to the database session | |
opportunity_version = OpportunityVersion( | |
opportunity_id=opportunity.opportunity_id, | |
opportunity_data=opportunity_new, | |
) | |
db_session.add(opportunity_version) |
@task_blueprint.cli.command( | ||
"store-opportunity-version", | ||
help="Store a new opportunity version if an opportunity has been updated", | ||
) | ||
@flask_db.with_db_session() | ||
def store_opportunity_version(db_session: db.Session) -> None: | ||
StoreOpportunityVersionTask(db_session).run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a way to run this as a task directly, we want to add it as a 4th step of the existing transform jobs: https://github.com/HHS/simpler-grants-gov/blob/main/api/src/data_migration/command/load_transform.py
def __init__(self, db_session: db.Session) -> None: | ||
super().__init__(db_session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not changing the init function, can exclude it and just use the one from task
def __init__(self, db_session: db.Session) -> None: | |
super().__init__(db_session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
updated_opportunities_change_audit = self.db_session.scalars( | ||
select(OpportunityChangeAudit).where(OpportunityChangeAudit.updated_at > latest_time) | ||
).all() | ||
|
||
for oca in updated_opportunities_change_audit: | ||
# Get Opportunity object | ||
opportunity = self.db_session.execute( | ||
select(Opportunity).where(Opportunity.opportunity_id == oca.opportunity_id) | ||
).scalar_one() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to query for the opportunity separately - the change audit object has a relationship to it. Also like I commented elsewhere, performance will be a lot better adding the selectinload bit.
updated_opportunities_change_audit = self.db_session.scalars( | |
select(OpportunityChangeAudit).where(OpportunityChangeAudit.updated_at > latest_time) | |
).all() | |
for oca in updated_opportunities_change_audit: | |
# Get Opportunity object | |
opportunity = self.db_session.execute( | |
select(Opportunity).where(Opportunity.opportunity_id == oca.opportunity_id) | |
).scalar_one() | |
updated_opportunities_change_audit = self.db_session.scalars( | |
select(OpportunityChangeAudit) | |
.where(OpportunityChangeAudit.updated_at > latest_time) | |
.options(selectinload("*")) | |
).all() | |
for oca in updated_opportunities_change_audit: | |
opportunity = oca.opportunity |
StoreOpportunityVersionTask(db_session).run() | ||
|
||
|
||
SCHEMA = OpportunityV1Schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this is used here anymore
SCHEMA = OpportunityV1Schema() |
api/tests/src/db/models/factories.py
Outdated
model = task_models.JobLog | ||
|
||
job_id = Generators.UuidObj | ||
job_type = factory.LazyAttribute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the sort of field that if you want to use a factory for it, you should pass the value in yourself since a random value is going to be weird with tests (since you usually care a lot about the exact value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed. Can optionally do a validation?
api/tests/src/db/models/factories.py
Outdated
job_type = factory.LazyAttribute( | ||
lambda _: random.choice(["StoreOpportunityVersionTask", "SetCurrentOpportunitiesTask"]) | ||
) | ||
job_status = factory.Iterator(JobStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably say to just make this always the success value since most tests would only care about that (or want to override it to started/failed).
latest_job = self.db_session.scalars( | ||
select(JobLog) | ||
.where(JobLog.job_type == self.cls_name()) | ||
.where(or_(JobLog.job_status == JobStatus.COMPLETED)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.where(or_(JobLog.job_status == JobStatus.COMPLETED)) | |
.where(JobLog.job_status == JobStatus.COMPLETED) |
… 3899/task-add-opp-version-record
@click.option( | ||
"--store-version/--no-store-version", default=True, help="run StoreOpportunityVersionTask" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us default this to not being enabled, I think we should test the behavior out manually first to be safe.
@pytest.fixture(autouse=True) | ||
def clear_db(self, db_session): | ||
opportunities = db_session.query(Opportunity).all() | ||
for opp in opportunities: | ||
db_session.delete(opp) | ||
|
||
db_session.execute(delete(OpportunityVersion)) | ||
db_session.execute(delete(OpportunityChangeAudit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick, technically those two deletes do nothing because they had to have been deleted for the opportunity deletes to work (the cascade="all, delete-orphan"
on the relationships tells SQLAlchemy to cleanup orphaned records if the opportunity itself is deleted).
@pytest.fixture(autouse=True) | |
def clear_db(self, db_session): | |
opportunities = db_session.query(Opportunity).all() | |
for opp in opportunities: | |
db_session.delete(opp) | |
db_session.execute(delete(OpportunityVersion)) | |
db_session.execute(delete(OpportunityChangeAudit)) | |
@pytest.fixture(autouse=True) | |
def clear_db(self, db_session): | |
opportunities = db_session.query(Opportunity).all() | |
for opp in opportunities: | |
db_session.delete(opp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats right ! Deleted
for opp_change_audit in opportunity_change_audits: | ||
opportunity = opp_change_audit.opportunity | ||
|
||
# Store to OpportunityVersion table | ||
save_opportunity_version(self.db_session, opportunity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I hadn't thought of until now, but has been briefly mentioned before - if an opportunity is a draft, we don't want to store anything for it yet in the version table.
I think it would make more sense to put that check in the save_opportunity_version function so when we use it elsewhere, we don't need to implement that twice, but I'm not 100% on that behavior.
Should be just a simple if statement, but apologies for not mentioning it sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NP, added check.
api/src/util/dict_util.py
Outdated
@@ -74,5 +74,7 @@ def diff_nested_dicts(dict1: dict, dict2: dict) -> list: | |||
|
|||
def _convert_iterables_to_set(data: Any) -> Any: | |||
if isinstance(data, (list, tuple)): | |||
if data and isinstance(data[0], (dict, list)): # test with applicant typen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this comment still be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Removed
… 3899/task-add-opp-version-record
… 3899/task-add-opp-version-record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor nitpicks, LGTM otherwise
@click.option( | ||
"--store-version/--no-store-version", default=False, help="run StoreOpportunityVersionTask" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit - put this option between the set-current and insert-chunk-size options. Since these options are big "turn on/off parts of the job" and the other options are configurational.
|
||
schema_data = SCHEMA.dump(opportunity) | ||
if not opportunity.is_draft: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - flip the logic here so the whole function doesn't need an indentation level
if not opportunity.is_draft: | |
if opportunity.is_draft: | |
return | |
... |
Summary
Fixes #{3899}
Time to review: 15 mins
Changes proposed
StoreOpportunityVersionTask
has been implemented to callsaved_opportunity_version
whenever there’s an update in theOpportunityChangeAudit
table for any opportunity since the last task run. Added as a 4th step to theload_transform
job.The
saved_opportunity_version
utility has been updated to invoke thediff_nested_dicts
function. If there’s a difference between the latest saved opportunity version and the current opportunity, a new record is created in theOpportunityVersion
table.Added
JobLogFactory
Added tests.