-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get last non-null value instead of latest XBRL filing. #3545
Conversation
fe95010
to
d4807c4
Compare
@jdangerx the few logs and/or metrics make this change seem pretty beneficial! before your changeLog from corrections from Correcting 95.02% of all non-null reported values (439/462) out of a total of 143284 original records. out_ferc1__yearly_rate_base.row_type_xbrl.value_counts()
row_type_xbrl
reported_value 442993
calculated_value 10204
subdimension_correction 205
correction 75 calculation_metrics = defs.load_asset_value("_core_ferc1__calculation_metric_checks")
errors = calculation_metrics[
calculation_metrics.filter(like="is_error").any(axis=1)
]
len(errors)
> 36 post changeLog from corrections from Correcting 32.94% of all non-null reported values (28/85) out of a total of 143284 original records. out_ferc1__yearly_rate_base.row_type_xbrl.value_counts()
row_type_xbrl
reported_value 442993
calculated_value 10204
correction 178
subdimension_correction 15 calculation_metrics = defs.load_asset_value("_core_ferc1__calculation_metric_checks")
errors = calculation_metrics[
calculation_metrics.filter(like="is_error").any(axis=1)
]
len(errors)
> 24 some comparisonit looks like it fully cleaned up a bunch of utilities calced values in the core_ferc1__yearly_utility_plant_summary_sched200 table newly_good_idx = (
calculation_metrics[calculation_metrics.error_frequency != 0]
.index.difference(calculation_metrics_new[calculation_metrics_new.error_frequency != 0].index)
)
calculation_metrics.loc[newly_good_idx].reset_index(level="group_value").index.value_counts()
> group table_name
> utility_id_ferc1 core_ferc1__yearly_utility_plant_summary_sched200 157 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i love how simple this change is. i have two very minor suggestions.
src/pudl/io_managers.py
Outdated
dupe_mask = original.duplicated(subset=xbrl_context_cols, keep=False) | ||
duped_groups = original.loc[dupe_mask].groupby( | ||
xbrl_context_cols, as_index=False | ||
) | ||
never_duped = original.loc[~dupe_mask] | ||
return pd.concat([never_duped, duped_groups.last()], ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omigosh this is so nice and simple!
Can you add dropna=True
into the groupby
(even though it is set that was as a default) just to make is more clear how that last is getting the non-null values?
This means that if a utility reports a non-null value, then later | ||
either reports a null value for it or simply omits it from the report, | ||
we keep the old non-null value, which may be erroneous. This appears to | ||
be fairly rare, affecting < 0.005% of reported values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a little defensive check in the here to make sure that expected % of values stays semi-constant? I assume this rate is ~ table dependent but i assume the worst offenders % would be a good high bar mark to test against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this is a little tricky - we'd have to calculate both the best snapshot
and the apply diffs
methodologies, stack them all up, then compare the two - which we can do, but seems like a lot of complexity to add for this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea i didn't realize how complicated this request was going to be. I don't think what you have here is too too complicated but it is a lot more than was here before. I'd be inclined to keep it in but if you feel otherwise I think it would be fine to revert to not checking this.
d4807c4
to
7f32384
Compare
): | ||
"""Compare deduplication methodologies. | ||
|
||
By cross-referencing these we can make sure that the apply-diff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😓 this turned out to be surprisingly annoying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm yeah it is surprisingly annoying!! I'm now semi regretting suggesting it but i think this looks okay. i like that it makes it clear that there are a few options and
This suggestion is probably oos for this pr but i wonder if this type of check that assesses methodological choices we've made in validation tests when it gets a lil too complicated. I've personally struggled with this distinction and the tension around testing with a lot of extra code that is not necessary for the actually thing you want to consistently produce. but i don't think this is tooo much!
cc @e-belfer |
7f32384
to
0f9e95b
Compare
Oy, the integration tests failed with a bunch of error thresholds being broken:
What seems a little strange is that the same dang dataframe is getting reported in every failing test - that seems surprising to me. I don't know quite what to make of it but I'll see if I can get this breaking locally for now. @zaneselvans this seems like something you might have some insight into? |
I think the reason the same dataframe is coming up repeatedly is that it's not a test failure, it's a defensive check which is erroring out in the ETL phase, and so you're getting identical "errors" raised in every test which depends on the fixture that's generating the errors. |
Ahhh that makes sense. OK! I'll go look at where those checks are (presumably) hard coded and see if I can make heads or tails of it. |
I'd probably try to materialize the exploded tables locally, which I think would exercise this check, and then you could fiddle with the parameters that set the acceptable error levels. However, this could reveal real issues. @cmgosnell will be the most familiar with what the error levels are supposed to look like and why the new data might be setting off these alarm bells. |
OK, so I compared the calculation reconciliation metrics across this branch & the main branch for the fast ETL:
So the error frequency is newly a little higher, and the null calcs/reported values are both less frequent. I'm not sure how to convince myself that this is OK vs. continuing to dig deeper - @cmgosnell maybe let's grab some time in the morning tomorrow to chat about this? |
0f9e95b
to
dccd775
Compare
If that's the only thing tripping it up, and it's only happening in the fast and not the full ETL, then upping the allowed error frequency to 0.06 seems fine. Those thresholds are arbitrary, and really intended to detect changes so they can be evaluated for significance / tracked down if they're wacked. But I'm worried there might be a long series of little error tolerance tweaks that need to be made after this one is fixed. |
I ran the fast ETL through the Dagster UI locally and thankfully @cmgosnell do you think we should check the other error metrics for the other FERC 1 assets, and ratchet them down if we have significantly lower error rates with the new diff-based extractions? Do we expect this to have broad impacts across the board? I can't remember if we had an easy way of generating all the error tolerance outputs. @jdangerx I also noticed 2 failing asset checks in the fast ETL, which was a little surprising:
If they're failing, should they be causing the fast ETL to fail as well? |
dccd775
to
f8f2d2b
Compare
I'm glad y'all figured out how to tweak the error tolerances! ( i don't super love how it ended up, structurally/ergonomically ) It makes sense to me that more non-null values in data would result in at least slightly worse error metircs. It is very standard for the fast ETL to require higher error tolerances and that is not a big change. Believe it or not the XBRL data is less calculable with its own calculations as the old data. So it makes sense that adding more XBRL values -> more errors. I was hoping to spend a lil time after this gets merged in to see if null frequency tolerance can be turned down a bit on any of the tables. We do have a way to check all of the metrics all at once. that's this asset: |
Yes... we have them set as "blocking" and also had a bunch of extra logic in there to stop the fast ETL from failing. Strange... |
The change in how we select XBRL records to extract in PR #3545 also changed how many records appear in a few tables, but only by 1 or 2 records. If anything it's surprising how few records were added/lost, so I've just adopted the new row counts in these tables: * plant in service * small plants * all FERC 1 plants
The change in how we select XBRL records to extract in PR #3545 also changed how many records appear in a few tables, but only by 1 or 2 records. If anything it's surprising how few records were added/lost, so I've just adopted the new row counts in these tables: * plant in service * small plants * all FERC 1 plants
Overview
Closes #3309.
What problem does this address?
A bunch of utilities sent diff-only reports to update their 2021 data, which led to some important XBRL tables being mostly-null.
What did you change?
After much debate and investigation, we decided to change the XBRL deduplication to take the latest non-null value (across multiple filings) instead of taking the value from the latest single filing.
Testing
How did you make sure this worked? How can a reviewer verify this?
@cmgosnell I'd love if you could take a look and see if this affects rate base table in the expected way.
To-do list