Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take the most recently reported generator operating date when there's no 70%+ consistently reported date #3697

Merged
merged 6 commits into from
Jun 26, 2024

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Jun 25, 2024

Overview

Closes #423.

What problem does this address?
Inconsistently reported generator operating dates have been a persistent concern for downstream users, as they wind up being reported as nulls in our entity table after harvesting (see #423, #3340, e.g.). We attempted earlier to adopt a 1 year threshold for rounding and harvesting these dates, but it still left us with a small number of nulls. This PR imposes a more consistent and complete treatment of generator operating dates when they are inconsistently reported.

What did you change?
When generator operating dates are inconsistently reported in 860 and 860M, we default to taking the most recently reported generator operating date. Of the small number of generator operating dates that we could confirm were incorrectly reported, the most recently reported dates wound up being more factually correct (see #423). Rather than having a mix of methods for filling in generator operating dates, let's just adopt a blanket treatment here and assume that the most recent date is an acceptable substitute for a NAN.

Testing

How did you make sure this worked? How can a reviewer verify this?
Run the debug_harvesting notebook, and compare the outputs with the generators identified in #423.

To-do list

Preview Give feedback

@e-belfer e-belfer added bug Things that are just plain broken. data-loss data that we expect should exist seem to be missing or dropped in pudl tables harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies labels Jun 25, 2024
@e-belfer e-belfer requested a review from zaneselvans June 25, 2024 17:11
@e-belfer e-belfer self-assigned this Jun 25, 2024
@e-belfer e-belfer changed the title Take the most recently reported generator operating date when there's no consistency. Take the most recently reported generator operating date when there's no 70%+ consistently reported date Jun 25, 2024
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking but:

  • I made one small suggestion to simplify the "there's no nulls" test.
  • Rounding to 0.1 degrees (up to 7 miles) feels like it could be too aggressive?

@@ -1826,7 +1826,7 @@
},
"generator_operating_date": {
"type": "date",
"description": "Date the generator began commercial operation.",
"description": "Date the generator began commercial operation. If harvested values are <70 percent consistent, we default to using the most recently reported date.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not want to hard-code the required consistency in the description since it can be changed, and we have changed it in some special cases.

f"{sorted(op_df[entity_idx].apply(lambda row: '_'.join(row.to_numpy().astype(str)), axis=1))}"
)
# add the newly cleaned records
op_clean_df = pd.concat([op_clean_df, op_df])
# assert all generator operating dates are not null
assert len(op_clean_df[op_clean_df.generator_operating_date.isnull()]) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor non-blocking nit, but it feels more readable to use any()/all() for this kind of test. E.g.

Suggested change
assert len(op_clean_df[op_clean_df.generator_operating_date.isnull()]) == 0
assert op_clean_df.generator_operating_date.notnull().all()

@e-belfer
Copy link
Member Author

Nothing blocking but:

* Rounding to 0.1 degrees (up to 7 miles) feels like it could be too aggressive?

That decision long pre-dates this PR, and the only changes I'm making here are to just change the format of how we're storing the number. So that question feels out of scope here.

@@ -601,7 +601,12 @@
],
"primary_key": ["plant_id_eia", "generator_id"],
"foreign_key_rules": {
"fields": [["plant_id_eia", "generator_id"]],
"fields": [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this snuck in from another PR!

@zaneselvans zaneselvans added this pull request to the merge queue Jun 26, 2024
@zaneselvans
Copy link
Member

The offending version of twine has been yanked so this is working now. I went ahead and added it to the merge queue so it can make it into the builds tonight.

Merged via the queue into main with commit d6c27b1 Jun 26, 2024
12 checks passed
@zaneselvans zaneselvans deleted the gen_op_date_fill branch June 26, 2024 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. data-loss data that we expect should exist seem to be missing or dropped in pudl tables harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

generators_entity_eia operating_date field blank when op dates disagree
2 participants