-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operating generators missing 'generator operating date' value in out_eia__monthly_generators and out_eia__yearly_generators pudl tables #3340
Comments
Off the top of my head I would guess that this is the result of multiple inconsistent operating dates being reported in different years/months of the EIA data, such that none of them was the clear choice for the correct value, but we will look into it Monday! |
@mariacastillo21 Thanks for the report, Maria! Just wanted to give you a status update, we're about to merge two PRs that will affect this table and its output (the latest quarter of 860M data and #3331), so once they're merged I'll attempt to reproduce this behavior and debug it, and then we can discuss next steps if any changes are required to fix this bug. |
Zane's hunch was correct. Of the generators you sent me, 56 are marked as having inconsistent operating dates in the harvesting process, which means that we've intentionally set the operating date to NA. On average, the most consistent operating date for each record showed up in about 58% of all harvested records for that generator. Some background on the harvesting process: Here's an example of what this looks like for a single generator - two different operational months get reported, and so we wind up with a null value. For most generators this will be a change in the order of months, but for a smaller handful the time differences reported are many years apart: If the goal is complete data rather than consistent data, there are a few ways to resolve this issue:
I'd be curious to hear from you and @arengel which of these three options would make the most sense for your purposes.
Concerning! Unfortunately, I'm not able to reproduce this, and I'm seeing that D1 and D2, and GT1 all wind up with debug-harvesting-generator-date.zip - Here's the notebook I was using to work through this question - though note that you'll need to have some of the intermediate dagster assets locally to be able to run it. I'm happy to share pickled outputs for any steps you're curious about. (Hopefully this all makes sense! The harvesting is a complex process, let me know if anything is unclear or you have more questions). |
Thanks @e-belfer for the sleuthing on this! Looking at your chart about the distribution of differences in operating dates, I wonder if we can separate this into two simpler approaches. One for generators where the difference is less than a year or two and the other where it is longer. I think the former can be done simply and systematically taking either the max operating date or the operating date in the most recent release (logic being that data quality improves over time) and applying it when there isn't a X% consistent result. I worry that if we did something like this in cases where the difference is large, we'll end up with generators reported as operating (and with net generation) in periods well before their operating date. And given that the set of generators where that could happen is very small, we figure out what the operating dates should be for those generators and set them as overrides. (The process for fixing dates should also probably check if plants are listed as operating before their operating date to catch new data that needs overrides if something like that doesn't already exist). One concern though with taking the max operating date or most recently reported one (and potentially built into the 70% consistency check for generators that first began operation well before reporting began) is that you could end up with the date when the reporter decided that some change to the generator warranted a new operating date, not the first date when some part of that generator first operated. I don't know the form well enough to know if this is a real potential issue. |
That makes sense to me @arengel. Rescuing operation dates within a year seems like a reasonable first step, a relatively quick fix, and should address the bulk of the problem that Maria is pointing to. Further rescues beyond that might reflect substantial changes in equipment over time that shouldn't be flattened without manual investigation. |
Architecturally, we really need to refactor the harvesting process to cleanly allow different column-specific harvesting functions, since the right methods vary depending on the values being harvested. We've known this for a long time. If you're in there anyway @e-belfer and have thoughts on how we might move in that direction please take some notes! We got some ways toward this with |
As a result of the PR #3419, the following plant/generators from your original spreadsheet should get rescued, with the following operating dates used. This doesn't capture all of the plants, but a fair proportion of them (44 of 59).
|
Describe the bug
About 5 GW of generator capacity — 59 operating generators across 26 plants — from 2001 to 2023 have missing generator operating dates in both monthly and annual eia 860 pudl tables. All of these generators have an operating date reported for them per the latest (Dec 2023) 860m.
Bug Severity
How badly is this bug affecting you?
To Reproduce
Steps to reproduce the behavior -- ideally including a code snippet that causes the error to appear.
Source : Pudl_ sqlite downloaded from AWS build on 2/2/23.
Tables :_out_eia__monthly_generators and out_eia__yearly_generators
Comparison source : https://www.eia.gov/electricity/data/eia860m/xls/december_generator2023.xlsx
Here is a spreadsheet containing the plant and generator ids, annual report dates (where gen operating date is missing), and operational status column (for verification that this generator was marked as operating at the time).
Expected behavior
Generator operating date column to be filled in with value from raw EIA 860m
Software Environment?
Additional context
Three generators with non-expected dates filled in:
cc'ing @arengel for visibility
The text was updated successfully, but these errors were encountered: