-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take the most recently reported generator operating date when there's no 70%+ consistently reported date #3697
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing blocking but:
- I made one small suggestion to simplify the "there's no nulls" test.
- Rounding to 0.1 degrees (up to 7 miles) feels like it could be too aggressive?
src/pudl/metadata/fields.py
Outdated
@@ -1826,7 +1826,7 @@ | |||
}, | |||
"generator_operating_date": { | |||
"type": "date", | |||
"description": "Date the generator began commercial operation.", | |||
"description": "Date the generator began commercial operation. If harvested values are <70 percent consistent, we default to using the most recently reported date.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not want to hard-code the required consistency in the description since it can be changed, and we have changed it in some special cases.
src/pudl/transform/eia.py
Outdated
f"{sorted(op_df[entity_idx].apply(lambda row: '_'.join(row.to_numpy().astype(str)), axis=1))}" | ||
) | ||
# add the newly cleaned records | ||
op_clean_df = pd.concat([op_clean_df, op_df]) | ||
# assert all generator operating dates are not null | ||
assert len(op_clean_df[op_clean_df.generator_operating_date.isnull()]) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor non-blocking nit, but it feels more readable to use any()
/all()
for this kind of test. E.g.
assert len(op_clean_df[op_clean_df.generator_operating_date.isnull()]) == 0 | |
assert op_clean_df.generator_operating_date.notnull().all() |
That decision long pre-dates this PR, and the only changes I'm making here are to just change the format of how we're storing the number. So that question feels out of scope here. |
src/pudl/metadata/resources/eia.py
Outdated
@@ -601,7 +601,12 @@ | |||
], | |||
"primary_key": ["plant_id_eia", "generator_id"], | |||
"foreign_key_rules": { | |||
"fields": [["plant_id_eia", "generator_id"]], | |||
"fields": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this snuck in from another PR!
The offending version of twine has been yanked so this is working now. I went ahead and added it to the merge queue so it can make it into the builds tonight. |
Overview
Closes #423.
What problem does this address?
Inconsistently reported generator operating dates have been a persistent concern for downstream users, as they wind up being reported as nulls in our entity table after harvesting (see #423, #3340, e.g.). We attempted earlier to adopt a 1 year threshold for rounding and harvesting these dates, but it still left us with a small number of nulls. This PR imposes a more consistent and complete treatment of generator operating dates when they are inconsistently reported.
What did you change?
When generator operating dates are inconsistently reported in 860 and 860M, we default to taking the most recently reported generator operating date. Of the small number of generator operating dates that we could confirm were incorrectly reported, the most recently reported dates wound up being more factually correct (see #423). Rather than having a mix of methods for filling in generator operating dates, let's just adopt a blanket treatment here and assume that the most recent date is an acceptable substitute for a NAN.
Testing
How did you make sure this worked? How can a reviewer verify this?
Run the debug_harvesting notebook, and compare the outputs with the generators identified in #423.
To-do list