Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-gigger backfilling technology_description & make prime_mover_code an annually harvested column #1600

Merged
merged 10 commits into from
May 5, 2022

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Apr 29, 2022

Update: I incorporated a change of prime_mover_code from a "static" -> "annual" harvested/normalized column!!

The previous fill-in methodology was filling in the technology_description by backfilling and then using the unique mapping between ESC and tech type. This PR edits the unique mapping fill-in portion to ALSO include the prime_mover_code.

Because this suggestion includes a two-column:one-column map, I had to convert the map(dict) methodology into a split-apply via merge-combine methodology.

how I got here...

I came across this by looking at the older years of plant_id_eia == 1961. this generator was converted into a gas plant so the backfilling didn't work. And this was an energy_source_code_1 == "BIT" generator. BIT has either mapped to Conventional Steam Coal or Coal Integrated Gasification Combined Cycle tech types. This is not a unique mapping so no BIT's were getting filled in.

Out of 135 potential esc/pm -> tech mappings, only 7 of them had non-unique map:

test = (
    gens.groupby(["energy_source_code_1", "prime_mover_code"])
    [["technology_description"]]
    .nunique()
)
len(test)
> 135

image

@cmgosnell cmgosnell added output Exporting data from PUDL into other platforms or interchange formats. data-repair Interpolating or extrapolating data that we don't actually have. rmi labels Apr 29, 2022
@cmgosnell cmgosnell requested a review from aesharpe April 29, 2022 19:12
@codecov
Copy link

codecov bot commented Apr 29, 2022

Codecov Report

Merging #1600 (0dd285d) into dev (5dbc337) will increase coverage by 0.0%.
The diff coverage is 100.0%.

@@          Coverage Diff          @@
##             dev   #1600   +/-   ##
=====================================
  Coverage   84.0%   84.0%           
=====================================
  Files         65      65           
  Lines       7176    7181    +5     
=====================================
+ Hits        6034    6039    +5     
  Misses      1142    1142           
Impacted Files Coverage Δ
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia860.py 100.0% <ø> (ø)
src/pudl/output/eia860.py 66.5% <100.0%> (+0.6%) ⬆️
src/pudl/transform/eia860.py 96.4% <100.0%> (+<0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5dbc337...0dd285d. Read the comment docs.

Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks robust to me! Just a few bb comments

@@ -368,13 +367,15 @@ def generators_eia860(


def fill_generator_technology_description(gens_df: pd.DataFrame) -> pd.DataFrame:
"""Fill in missing ``technology_description`` based on generator and energy source.
"""Fill in missing ``technology_description`` based by backfilling & unquie mapping.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny comment, but unique is spelled wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol of course ty!

Comment on lines 380 to 381
As a result, more than 95% of all generator records end up having a
``technology_description`` associated with them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what the coverage is now that you've integrated prime_mover_code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha its 97%... which is a HUGE improvement from 96%.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with moving the PM code.... this is now 98.1% 😎

cmgosnell added 5 commits May 2, 2022 11:19
I tired adding the PM code into the backfilling. this resulted in 
*sliiightly* less tech-types (a grand total of 1609 record) all from 
generators that have no PM code. After testing the staging.. it felt 
better to do the backfilling w/ the completely consistent map between 
PM/ESC:Tech first bc it feels pretty conserative. And then we come back 
in w/ the bfill w/o the pm code
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@cmgosnell
Copy link
Member Author

I added another layer of this PR. Now, ontop of the backfilling of the technology_description, I've also moved the prime_mover_code from a "static" to an "annual" harvested column. See details on why from this comment

@cmgosnell
Copy link
Member Author

ALSO i still need to run tox -e nuke to run all the validation tests before merging.... this feels important given the change in the table structure. (even though - believe it or not - there seem to be NO changes required in the output layer)

@cmgosnell cmgosnell changed the title Fill in tech w/ energy_source_code_1 & prime_mover_code Re-gigger backfilling technology_description & make prime_mover_code an annually harvested column May 3, 2022
@cmgosnell
Copy link
Member Author

Okay @aesharpe I changed a few things and ran all of the validation tests and they check out!!

@aesharpe
Copy link
Member

aesharpe commented May 4, 2022

Update: I incorporated a change of prime_mover_code from a "static" -> "annual" harvested/normalized column!!

Does this mean that before now we were overwriting certain PM values based on whatever we decided the "static" value was?

Are there any tables that might get messed up by the addition of PM to annual? Or new documentation to add?

@aesharpe
Copy link
Member

aesharpe commented May 4, 2022

this feels important given the change in the table structure.

Which table structure changes? The annual table? The output tables would only change insofar as the PM column differs based on the new annual data, right, so no structure changes there?

@cmgosnell
Copy link
Member Author

@aesharpe good questions

Does this mean that before now we were overwriting certain PM values based on whatever we decided the "static" value was?

Yes. 98% of PM's never change ever so we lumped this columns into the "static" (i.e. you never change ever and any change is probably a data entry issue)... but withing that 2% there were some clear instances of actual change.

Are there any tables that might get messed up by the addition of PM to annual? Or new documentation to add?

Fortunately, the only mode of access to these tables that we've baked in is through pudl.output.eia860.generators_eia860. And this output function was always merging all of the columns of the annually harvested tables and the static harvested generator tables together. So weirdly no output methods needed updating. I was vv surprised but that is the niceness of layering our access via the output tables.

On documentation, the docs should be built off of the metadata and that is now updated. This change should definitely be added to release notes when we make a new release. I've already flagged this for @zaneselvans... (should we make a running 0.7.0 (2022-XX-XX) section??)

Which table structure changes? The annual table?

A column (prime_mover_code) got moved from the static table (generators_entity_eia) to the annual table (generators_eia860).

The output tables would only change insofar as the PM column differs based on the new annual data, right, so no structure changes there?

Exactly. The structure of pudl_out.gens_eia80() is exactly the same. Some of the PM codes of those 2% of gens that differ in time will be different.

@zaneselvans
Copy link
Member

Yes, if there isn't already a v0.7.0 section in the release notes definitely add one. We should make edits to the release notes in every significant PR.

@aesharpe aesharpe merged commit 3b9e6bd into dev May 5, 2022
@zaneselvans zaneselvans deleted the non_static_tech branch May 5, 2022 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-repair Interpolating or extrapolating data that we don't actually have. output Exporting data from PUDL into other platforms or interchange formats. rmi
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants