-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why is basically nothing in the rate base table from core_ferc1__yearly_depreciation_by_function_sched219
in 2021?
#3309
Comments
core_ferc1__yearly_depreciation_by_function_sched219
in 2021?core_ferc1__yearly_depreciation_by_function_sched219
in 2021?
@cmgosnell: To see the values reported in this table:
These are the same utilities observed in the transformed table:
There are 0 corrections records in the rate base table for this table:
That suggests the issue isn't pruning but rather actually happening in the generation of the correction records originally. Which makes sense because we currently aren't adding sub-total to total corrections, so we are effectively losing all "total" records in this table. |
@zaneselvans i think i would like your help in thinking about the error metrics bc i changed the We could force the metric checks to behave like they used to (i think), by somehow only checking these values when the reported and/or calculated values are the non-null. But I think it makes more sense to adjust the tolerance values because this new change actually feels more representative of the data itself. |
Is this issue still outstanding? I'm working on an update to our net_plance_balance table, which compiles values of original cost, accumulated depreciation, and the net plant balance by FERC technology then adds an estimate of renewables net plant balance as further detail of the "other" category. I'll then use that renewables breakout as further detail of your rate base table - but not expecting that you'll integrate our estimate of renewables because it's modeled rather than reported data. I'm using I checked the raw xbrl table |
@cmgosnell that work seems to be focused on corrections. But I think there's an issue in coverting the raw xbrl table into the PUDL table |
hey @jrea-rmi sorry for the delay on getting back to this! This is a good find without a super straightforward solution. I'm going to lay out what I understand to be the problem and a possible solution but i will want @jdangerx to verify my assessment here because your much more familiar with the raw xbrl data than I am. The high level problem here is that FERC does not replace records when a utility comes back and files an update, so we filter for the most recent updated record ( If this loss of detail is a blocker, I could imaging making an exception for this this table/year only where we grab the slightly less fresh data. but only if a few things are true:
I'm going to try to answer the first two questions. but I'd love @jdangerx 's perspective on this!! |
Great. This loss of detail is definitely important for us! If there isn't an update to any of the detailed by function values, I sure hope the totals are the same. And if they are, then it seems reasonable to use the penultimate record. Would that be easier than combining updated/previous records/filings? |
So for each XBRL context (utility name, report year, ... - the primary key, basically) we could have multiple sets of facts from multiple XBRL filings. Currently we treat the most recent filing as completely authoritative - if the most recent filing reports everything as nulls, we just trust that that's more accurate. It would be very easy to say "sort all the facts for this context by publication time, and take the latest non-null reported value for each." The new way is preferable in cases like this, where the newer filings only contain the updated values. But the old way is preferable if a value was erroneously reported as non-null, i.e. someone put a number in the wrong field. I sort of think that cases like this are more common, so would be happy to switch the behavior over - but this might uncover some other weird behavior. I don't think there's a good blanket decision to make here. To capture this data nuance, we could try to dynamically apply the two different approaches:
This gives us some broadly correct behavior while then allowing us to apply spot fixes if necessary... that refactor wouldn't be too big of a time sink, I think - the code is pretty easy to test and there's a natural place to pull out the 'deduplicate a group' logic. If we want to go for that I bet we could get it done in 5-10h - @cmgosnell I'm happy to crank on it if you have too much stuff on your plate. |
@jdangerx if you feel like you have capacity to take this on that would be swell! I would be a little reluctant to change the default behavior, but it might be a good idea. It sounds like it cooould end up being simpler. And maybe i'm off on this but could this kind of nulls perhaps be a result of the sql-ification of the xbrl data? as in... can the utils re-report one specific fact? and we take that fact and slurp it into a db table and it looks like it has a bunch of nulls in the same record. I'm not sure if there is a way to distinguish a null bc its really null vs a null bc the respondent only updated a few values. I think you have more context on how simple and/or appropriate applying either approach would be so I'm happy to defer to you on this one. |
Did some more digging & thinking, screed ahead! Here are the possible ways to deduplicate:
There are a bunch of tables with this data loss issue - if I switch from "last snapshot" to "first snapshot" dataset-wide, I see 3 tables where we have at least 1 completely null column in 2021 due to the "last snapshot" methodology. This is a lower bound on the data loss since I didn't look for more subtle forms.
The "first snapshot" approach means we miss out on all updates whether they are snapshot updates or diff updates. It also means we miss out on all the problems we have dealing with updated data and multiple filings, but I think the value of getting updated data outweighs that benefit. This data loss we've found is due to applying "last snapshot" everywhere, even when the latest filing was a diff. So we shouldn't blanket apply that either. We could apply "apply diffs" everywhere, which gets us the latest / most complete data possible. But...
We could apply "best snapshot" everywhere, which would keep the 1:1 mapping from table row to XBRL filing. But...
Weighing all these, my conclusions are:
My pitch is this:
To fix this data loss issue in the rate base table, the minimal change is the first bullet. The next two bullets are useful infrastructural investments which will make future XBRL-based work easier. My current plan, then, is to make a small PR for the first bullet and file issues/put TODOs in the code for the other two. If we think that further investments are worthwhile right now I'm happy to take those on too - but that depends on our priorities, project budget, etc. @cmgosnell @jonrea do we have specific expectations for how much more time we want to spend on this project? |
wow, great thinking here. I agree that the best way to do this is the "apply diff" method, and that it will be important to be able to differentiate between non-reported facts and reported-null facts. In terms of RMI priorities, this is at least near the top. For the next month or two, we want to keep our overall monthly budget within or below what it's been previously. Let's review that in our next check-in on April 5, but definitely worth putting some effort in between now and then. |
After talking a bit more with @cmgosnell - let's do a little bit more investigation to better understand the impact of changing our deduplication approach:
Then we can think about what the best option for handling this data is. |
To find the values that are actually reported as null values, I looked for What I saw was that these numbers tended not to change much between filings for the most part. I'm not sure if any of these are values that have changed from non-null to null. But it does seem like we should probably take a look at if there is a nice way to tell the difference between diff filings and snapshot filings, since some of these numbers seem non-trivial. Here are the filings I saw where the number of null values increased between filings over time. The timestamps starting with something like 165... are 2021 filings and 168... are 2022 filings.
We see a few fields dominating those null values:
`` |
It's kind of wild that there's so much complexity coming from these refilings / partial updates. @campbellpryde @austinmatherne-wk is this a normal thing for XBRL? Is there a canonical way that it's dealt with? Or is FERC somehow making this extra difficult? |
To be fair, that’s sort of our fault for not distinguishing between non-reported values and reported null values in our SQLite conversion. Though the FERC filings might be weird in one or another way beyond that…
|
TL;DR: apply-diffs is the best short term solution. The medium-term step after that is moving the XBRL deduplication logic into our transformation framework so we can apply spot fixes. In the future we should be building better tools to help us quickly identify and investigate suspicious data because there will always be some level of manual spot fixing needed. I compared best-snapshot with apply-diffs for 20 tables and found 55 values which went from For the first category, I largely relied on tracing the individual numeric values through the raw XBRL files. Most of the changes were legitimate updates. There were a few instances where a non-null value for field I had one question about some of the values, specifically in the We see a de-nullification in the That value shows up as the following facts in Silver Run filings: Silver_Run_Electric,_Llc_form1_Q4_1650068815.xbrl (2021 filing 1) Silver_Run_Electric,_Llc_form1_Q4_1654217180.xbrl (2021 filing 2) Silver_Run_Electric,_Llc_form1_Q4_1681780102.xbrl (2022 filing 1) Silver_Run_Electric,_Llc_form1_Q4_1687981013.xbrl (2022 filing 2) So this looks like they initially reported that value as "distribution" but later updated it to "transmission." I suppose either report could be in error there, but maybe @jonrea you have better insight into which one is correct. I spot checked the following other tables:
I think, before we tackle the FERC 2023 data in earnest, we should make some tools that identify suspicious values and help us quickly investigate them / turn them into spot fixes. But we don't need to do that now. |
Silver Run Electric is a transmission company, so the 2021 filing 2 with value assigned to transmission is correct. I agree with using apply-diffs short term, would want to go ahead with that as quickly as we can! |
see this comment that id-ed the problem. track it down & fix it. presumably its from the og table transform.
Tasks
The text was updated successfully, but these errors were encountered: