Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-gigger backfilling technology_description & make prime_mover_code an annually harvested column #1600

Merged
merged 10 commits into from
May 5, 2022
44 changes: 27 additions & 17 deletions src/pudl/output/eia860.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""Functions for pulling data primarily from the EIA's Form 860."""

import logging
from collections import defaultdict

import pandas as pd
import sqlalchemy as sa
Expand Down Expand Up @@ -368,13 +367,15 @@ def generators_eia860(


def fill_generator_technology_description(gens_df: pd.DataFrame) -> pd.DataFrame:
"""Fill in missing ``technology_description`` based on generator and energy source.
"""Fill in missing ``technology_description`` based by backfilling & unquie mapping.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny comment, but unique is spelled wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol of course ty!


Prior to 2014, the EIA 860 did not report ``technology_description``. This
function backfills those early years within groups defined by ``plant_id_eia``,
``generator_id`` and ``energy_source_code_1``. Some remaining missing values are
then filled in using the consistent, unique mappings that are observed between
``energy_source_code_1`` and ``technology_type`` across all years and generators.
``generator_id`` and ``energy_source_code_1``.

Some remaining missing values are then filled in using the consistent,
unique mappings that are observed between ``energy_source_code_1``,
``prime_mover_code`` and ``technology_type`` across all years and generators.

As a result, more than 95% of all generator records end up having a
``technology_description`` associated with them.
Comment on lines 381 to 382
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what the coverage is now that you've integrated prime_mover_code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha its 97%... which is a HUGE improvement from 96%.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with moving the PM code.... this is now 98.1% 😎

Expand All @@ -399,20 +400,29 @@ def fill_generator_technology_description(gens_df: pd.DataFrame) -> pd.DataFrame
)

# Fill in remaining missing technology_descriptions with unique correspondences
# between energy_source_code_1 where possible. Use a default value of pd.NA
# for any technology_description that isn't uniquely identified by energy source
static_fuels = defaultdict(
lambda: pd.NA,
gens_df.dropna(subset=["technology_description"])
.drop_duplicates(subset=["energy_source_code_1", "technology_description"])
.drop_duplicates(subset=["energy_source_code_1"], keep=False)
.set_index("energy_source_code_1")["technology_description"]
.to_dict(),
# between energy_source_code_1 and prime_mover_code where possible.
# get a unique map between ESC/PM and technology_description
esc_pm_to_tech = (
out_df.loc[
:, ["energy_source_code_1", "prime_mover_code", "technology_description"]
]
.dropna(how="any")
.drop_duplicates(keep="first")
.drop_duplicates( # if there are any duplicates w/in esc/pm combo.. it's gotta go
subset=["energy_source_code_1", "prime_mover_code"], keep=False
)
)

out_df.loc[
out_df.technology_description.isna(), "technology_description"
] = out_df.energy_source_code_1.map(static_fuels)
no_tech_mask = out_df.technology_description.isnull()
has_tech = out_df[~no_tech_mask]
no_tech = pd.merge(
out_df[no_tech_mask].drop(columns=["technology_description"]),
esc_pm_to_tech,
on=["energy_source_code_1", "prime_mover_code"],
how="left",
validate="m:1",
)
out_df = pd.concat([has_tech, no_tech])

assert len(out_df) == nrows_orig

Expand Down