Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operating generators missing 'generator operating date' value in out_eia__monthly_generators and out_eia__yearly_generators pudl tables #3340

Closed
mariacastillo21 opened this issue Feb 3, 2024 · 7 comments · Fixed by #3419
Assignees
Labels
bug Things that are just plain broken. data-loss data that we expect should exist seem to be missing or dropped in pudl tables eia860 Anything having to do with EIA Form 860 harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies

Comments

@mariacastillo21
Copy link

mariacastillo21 commented Feb 3, 2024

Describe the bug

About 5 GW of generator capacity — 59 operating generators across 26 plants — from 2001 to 2023 have missing generator operating dates in both monthly and annual eia 860 pudl tables. All of these generators have an operating date reported for them per the latest (Dec 2023) 860m.

Bug Severity

How badly is this bug affecting you?

  • Medium: We could manually enter in these operating dates from raw EIA 860m, but would prefer to have them in the PUDL pipeline.

To Reproduce

Steps to reproduce the behavior -- ideally including a code snippet that causes the error to appear.
Source : Pudl_ sqlite downloaded from AWS build on 2/2/23.
Tables :_out_eia__monthly_generators and out_eia__yearly_generators
Comparison source : https://www.eia.gov/electricity/data/eia860m/xls/december_generator2023.xlsx

Here is a spreadsheet containing the plant and generator ids, annual report dates (where gen operating date is missing), and operational status column (for verification that this generator was marked as operating at the time).

Expected behavior

Generator operating date column to be filled in with value from raw EIA 860m

Software Environment?

  • Operating System: MacOS Ventura 13.2.1
  • Python version and distribution: Anaconda Python 3.10.6

Additional context

Three generators with non-expected dates filled in:

  • two generators (plant id 75, generator id D1 and D2) have operating month marked as 0
  • one generator (plant id 1298, generator idGT1) has operating month marked at 88

cc'ing @arengel for visibility

@mariacastillo21 mariacastillo21 added the bug Things that are just plain broken. label Feb 3, 2024
@zaneselvans zaneselvans added eia860 Anything having to do with EIA Form 860 harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies labels Feb 3, 2024
@zaneselvans
Copy link
Member

Off the top of my head I would guess that this is the result of multiple inconsistent operating dates being reported in different years/months of the EIA data, such that none of them was the clear choice for the correct value, but we will look into it Monday!

@zaneselvans zaneselvans added the data-loss data that we expect should exist seem to be missing or dropped in pudl tables label Feb 3, 2024
@e-belfer e-belfer moved this from New to Backlog in Catalyst Megaproject Feb 6, 2024
@e-belfer e-belfer self-assigned this Feb 6, 2024
@e-belfer
Copy link
Member

e-belfer commented Feb 6, 2024

@mariacastillo21 Thanks for the report, Maria! Just wanted to give you a status update, we're about to merge two PRs that will affect this table and its output (the latest quarter of 860M data and #3331), so once they're merged I'll attempt to reproduce this behavior and debug it, and then we can discuss next steps if any changes are required to fix this bug.

@e-belfer e-belfer moved this from Backlog to In progress in Catalyst Megaproject Feb 8, 2024
@e-belfer
Copy link
Member

e-belfer commented Feb 9, 2024

Zane's hunch was correct. Of the generators you sent me, 56 are marked as having inconsistent operating dates in the harvesting process, which means that we've intentionally set the operating date to NA. On average, the most consistent operating date for each record showed up in about 58% of all harvested records for that generator.

Some background on the harvesting process:
We harvest annual 860 data and the last month of 860M generator data, since the final annual data is generally more reliable. By default, if a static value isn't reported at least 70% of the time, we drop it (i.e., if half the time a generator reports an operating date of "1991-01-01" and half the time "2001-01-01", we decide neither is reliable).

Here's an example of what this looks like for a single generator - two different operational months get reported, and so we wind up with a null value.
image

For most generators this will be a change in the order of months, but for a smaller handful the time differences reported are many years apart:
image

If the goal is complete data rather than consistent data, there are a few ways to resolve this issue:

  • turn down the "strictness" threshold of the operating date (e.g., from 70 to 50%) in the harvesting process, which would rescue some of these edge cases
  • take the max value of operating date always when harvesting (e.g., if all years of 860 report "1991-01-01" but the last 860M data reports "1997-06-01", we take that value).
  • take the max value of operating date only when harvesting doesn't return a 70%+ consistent result

I'd be curious to hear from you and @arengel which of these three options would make the most sense for your purposes.

Three generators with non-expected dates filled in:

two generators (plant id 75, generator id D1 and D2) have operating month marked as 0
one generator (plant id 1298, generator idGT1) has operating month marked at 88

Concerning! Unfortunately, I'm not able to reproduce this, and I'm seeing that D1 and D2, and GT1 all wind up with NaT as their harvested operating date. Can you share the code you are using to see this issue, and let me know what table/column these values are observed in?

debug-harvesting-generator-date.zip - Here's the notebook I was using to work through this question - though note that you'll need to have some of the intermediate dagster assets locally to be able to run it. I'm happy to share pickled outputs for any steps you're curious about.

(Hopefully this all makes sense! The harvesting is a complex process, let me know if anything is unclear or you have more questions).

@arengel
Copy link
Collaborator

arengel commented Feb 9, 2024

Thanks @e-belfer for the sleuthing on this!

Looking at your chart about the distribution of differences in operating dates, I wonder if we can separate this into two simpler approaches. One for generators where the difference is less than a year or two and the other where it is longer. I think the former can be done simply and systematically taking either the max operating date or the operating date in the most recent release (logic being that data quality improves over time) and applying it when there isn't a X% consistent result.

I worry that if we did something like this in cases where the difference is large, we'll end up with generators reported as operating (and with net generation) in periods well before their operating date. And given that the set of generators where that could happen is very small, we figure out what the operating dates should be for those generators and set them as overrides. (The process for fixing dates should also probably check if plants are listed as operating before their operating date to catch new data that needs overrides if something like that doesn't already exist).

One concern though with taking the max operating date or most recently reported one (and potentially built into the 70% consistency check for generators that first began operation well before reporting began) is that you could end up with the date when the reporter decided that some change to the generator warranted a new operating date, not the first date when some part of that generator first operated. I don't know the form well enough to know if this is a real potential issue.

@e-belfer
Copy link
Member

That makes sense to me @arengel. Rescuing operation dates within a year seems like a reasonable first step, a relatively quick fix, and should address the bulk of the problem that Maria is pointing to. Further rescues beyond that might reflect substantial changes in equipment over time that shouldn't be flattened without manual investigation.

@zaneselvans
Copy link
Member

Architecturally, we really need to refactor the harvesting process to cleanly allow different column-specific harvesting functions, since the right methods vary depending on the values being harvested. We've known this for a long time. If you're in there anyway @e-belfer and have thoughts on how we might move in that direction please take some notes!

We got some ways toward this with pudl.metadata.classes.FieldHarvest but never finished implementing it, because of the entanglements of that refactor with standardizing the way we deal with report_date and different temporal granularities.

@e-belfer
Copy link
Member

As a result of the PR #3419, the following plant/generators from your original spreadsheet should get rescued, with the following operating dates used. This doesn't capture all of the plants, but a fair proportion of them (44 of 59).

plant_id_eia generator_id generator_operating_date
1336 S2 1973-04-01
S5 1979-07-01
2115 NG2 2000-05-01
SG1 2000-05-01
SG2 2000-05-01
4195 EP 2010-07-01
EP2 2010-07-01
EP3 2010-07-01
ST1 1966-08-01
ST2 1969-08-01
ST3 1977-08-01
6463 GT1 1972-05-01
8906 1 1967-07-01
3 1958-12-01
ST5 1962-12-01
10071 GEN1 1988-06-01
GEN2 1988-06-01
10298 GT3 2015-05-01
54477 GEN1 1990-05-01
GEN2 1990-05-01
GEN3 1990-05-01
GEN4 1990-05-01
GEN5 1990-05-01
GEN6 1990-05-01
GEN7 1990-05-01
55043 GT1 1998-07-01
ST1 1998-07-01
55063 CTG1 2000-08-01
CTG2 2000-08-01
CTG3 2000-08-01
STG1 2000-08-01
STG2 2000-08-01
STG3 2000-08-01
55151 GEN2 2003-03-01
GEN3 2003-03-01
55177 A 2001-06-01
B 2001-06-01
ST1 2001-06-01
55199 GT1 1999-09-01
55364 CT01 2002-07-01
CT02 2002-07-01
55622 U1 2002-06-01
U2 2002-06-01
59967 FCB 2008-09-01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. data-loss data that we expect should exist seem to be missing or dropped in pudl tables eia860 Anything having to do with EIA Form 860 harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies
Projects
Archived in project
4 participants