Who's putting NaNs in my exchange tables? #246

dt-woods · 2024-08-02T15:03:34Z

Processes are failing to load in openLCA due, in part, to a large number of NaNs found as amounts in process exchanges.

dt-woods · 2024-08-02T15:05:18Z

Going line by line through the generation process inventory generation, everything seems okay except for HYDRO. The problem comes from a blank line in the source CSV file for facility 1784, Twin Falls (MI).

ElectricityLCI/electricitylci/data/hydropower_plant.csv

Line 314 in 2232c41

312,1784,Twin Falls (MI),RFC,22031.999999999996,,,,

UPDATE:

From G. Zaimes,

If I remember correctly, we trained a predictive model using several variables such as the area of the hydroelectric impoundment, age, latitude, longitude, and other factors. It may be that some of these factors were not defined or available for the plant you noted, although it is difficult for me to comment without looking back in the data/code. Another reason could be that the Hydropower model is calibrated to estimate emissions from conventional Hydropower facilities and not Run-Of-River facilities. It may be that the Twin Falls plant is a RoR Hydropower plant and thus we could not provide an accurate estimate of its GHG emissions intensity using our existing model (this might be the more likely culprit). I believe there was some logic in the Hydropower model to check whether a facility is Conventional vs RoR, it may be through double checking to see if this is causing the NaN entries you observe.

dt-woods · 2024-08-02T15:07:06Z

Found a second culprit in hydro upstream where "water" was misspelled. Now there are only three NaNs caused by the blank line above.

ElectricityLCI/electricitylci/hydro_upstream.py

Line 59 in 2232c41

    
           hydro_df.loc[hydro_df["FlowName"]=="Water, fresh","FlowAmount"]=hydro_df.loc[hydro_df["FlowName"]=="Waster, fresh","FlowAmount"]*1000

dt-woods · 2024-08-02T15:22:09Z

Everything's good with generation processes, but when combined with upstream processes through __init__.write_gen_fuel_database_to_dict --> generation.olcaschema_genprocesses, loads of NaNs appear.

dt-woods · 2024-08-02T17:22:17Z

In main.py, the data frame returned by get_generation_process_df converts the column "FlowAmount" from float to object. A clue!

dt-woods · 2024-08-02T18:20:51Z

In main.py, the data frame returned by get_generation_process_df converts the column "FlowAmount" from float to object. A clue!

It looks like the culprit is generation.py's create_generation_process_df, which returns object (i.e., str) flow amounts.

dt-woods · 2024-08-02T18:55:31Z

FOUND!
Setting the NOx (lbs) column to NoneType makes it string. Remove it!

https://github.com/KeyLogicLCA/ElectricityLCI/blob/2232c41f2cb4fd333ad59c8710aa55906e6a7ed3/electricitylci/ampd_plant_emissions.py#L606

dt-woods · 2024-08-05T19:12:36Z

A couple of new updates for Monday.

ElectricityLCI's get_gen_plus_netl function creates all the renewable fuel data frames (i.e., geo, solar, wind, and hydro). All fuels, except hydro, are matched to their respective EIA generation year (e.g., 2022). Hydro, on the other hand, is always set to 2016. Notably, the year 2016 is not found in the years of interest (from the inventories of interest) for any of the new model configs (i.e., 2020–2022). This becomes important when considering further down in get_get_plus_netl where generation.py's create_generation_process_df function is called. This method scans over all inventories of interest (e.g., 2019–2021) to gather up all the relevant facilities (note how 2016 facilities are not included).
The last little tidbit from get_gen_plus_netl calls combinator.py's concat_clean_upstream_and_plant, which, in turn runs fill_nans, which attempts to fill missing facility IDs based on the current year's generation data (EIA generation year), which may be 2022 and is not a part of any of the years for the inventory (i.e., hydro is 2016, and inventory years of interest are 2019–2021). 😟

More to come from this investigation.

…ses USEPA#204

addresses USEPA#246; still need to test for nans in exchange table after this fix

dt-woods · 2024-08-12T20:56:31Z

After all the fixes above; the nans in the exchange table remain.
Further tracing leads me to generation.py's olcaschema_genprocess method.
Walking through the turn_data_into_dict leads to this interesting bit of logic:

# Define inputs based on compartment label
data["input"] = False
input_filter = (
    (data["Compartment"].str.lower().str.contains("input"))
    | (data["Compartment"].str.lower().str.contains("resource"))
    | (data["Compartment"].str.lower().str.contains("technosphere"))
)
data.loc[input_filter, "input"] = True

# Define products based on compartment label
product_filter=(
    (data["Compartment"].str.lower().str.contains("technosphere"))
    | (data["Compartment"].str.lower().str.contains("valuable"))
)
data.loc[product_filter, "FlowType"] = "PRODUCT_FLOW"

# Define wastes based on compartment label
waste_filter = (
    (data["Compartment"].str.lower().str.contains("technosphere"))
)
data.loc[waste_filter, "FlowType"] = "WASTE_FLOW"

If I'm right, the compartment with "technosphere" in its name is always labeled with a flow type as a waste flow, due to the third clause above, so why have it the product flow query?

ElectricityLCI/electricitylci/generation.py

Line 1091 in e562681

product_filter=(

@m-jamieson, do you remember how this labeling works?

dt-woods · 2024-08-12T21:58:16Z

Okay. The NaNs are coming from aggregate_data in generation.py somewhere. Two possible paths forward:

Keep digging through the aggregation method to see where they coming from and handle appropriately.
Modify the turn_data_to_dict to drop exchanges that have nan as their emission factor.

m-jamieson · 2024-08-13T17:01:29Z

I'm alright with 2 - I mean that's basically what happened in post-processing in the first fed commons database. Would be nice to get some idea of what's getting dropped though and why. It can only either be the sum of the emissions of the aggregated group or the electricity. I would guess the electricity route is more likely.

dt-woods · 2024-08-13T18:26:16Z

The problem is with generation_process_df.
There are NaNs in the Emission_factor column associated with these flows;

'Carbon dioxide',
'Methane',
'Nitrogen oxides',
'Nitrous oxide',
'Sulfur dioxide',
'natural gas, through transmission',
'petroleum fuel, through transportation',
'coal, transported',
'coal, processed, at mine',
'power plant construction',
'Ammonia',
'Benzene',
'Ethylbenzene',
'Hexane',
'Hydrochloric acid',
'Lead',
'Manganese',
'Mercury',
'Naphthalene',
'Nickel',
'Toluene',
'Xylene',
'Polycyclic aromatic compounds',
'Benzo[ghi]perylene',
'nuclear fuel, through transportation',
'Anthracene',
'Phenanthrene',
'Hydrogen sulfide',
'Hydrofluoric acid',
'Chlorine',
'Styrene',
'Cobalt',
'Ethylene glycol'

Across just about all BAAs and stage codes.

Note that power plant construction, coal, natural gas, and petroleum NaNs are the cause for issue #250

m-jamieson · 2024-08-13T18:51:04Z

eh...that's a lot....like basically everything. This also suggests that all data sources are involved too. Is the current Keylogic development branch the most reasonable one to try and run to replicate? And then what data year are you running to get these nans?

dt-woods · 2024-08-13T18:53:40Z

Yes and ELCI_2022_config.yaml

m-jamieson · 2024-08-13T21:17:09Z

Just adding some things here as I debug in case I forget to come back to this later. As best as I can tell (or as pandas tells me) there are no NaN FlowAmounts or Electricity in the dataframe prior to the aggregation. There are however a significant number of NaNs in some of the columns that are included in the groupby (like Balancing Authority Name and FuelCategory). While I think this should mean that those facilities aren't included in the groupby, it's probably still not good.

m-jamieson · 2024-08-13T22:04:18Z

As suspected, when the electricity amounts are merged back with the aggregated flow amounts here, there are a bunch of missing electricity amounts in the resulting database_f3. Which suggests that the electricity amounts aren't being summed correctly in calculate_electricity_by_source. And this occurs for mutliple data source combinations:

NEI_ampd_ap42_eGRID
NEI_ap42_eGRID
ampd_ap42_eGRID
NEI_RCRAInfo_TRI_ampd_ap42_eGRID_eia_netl
NEI_TRI

m-jamieson · 2024-08-13T22:30:13Z

Never mind it's a year mis-match problem. For example:
Arizona Public Service Company - BIOMASS - Carbon dioxide has source string "NEI_ampd_ap42_eGRID" and year 2022.
The electricity dataframe has a matching source string "NEI_ampd_ap42_eGRID" but it's for year 2021.

Just to describe what's supposed to be happening here. There are two biomass plants reporting CO2 emissions in the Arizona Public Service Company BA in the year 2022. We can easily sum these. The electricity that's supposed to be assigned to this aggregation is from all plants that reported to the same emission sources. In this case, these two plants report to NEI, ampd, ap42, and eGRID - or at least that's what the fetched data suggests. But there are very few years where plants will be reporting to all the same data sources in the same year. This very likely should be source string "ampd_ap42" for year 2022 and "NEI" for year 2021 or something like that. calculate_electricity_by_source likely needs to be tweaked to differentiate data reporting by year or fix how it's done now.

m-jamieson · 2024-08-13T23:21:32Z

I think this commit created the current issue: KeyLogicLCA@946463a

We groupby a different set of columns "grouby_cols" than before ["FlowName","Compartment"] because before we were trying to omit flows that were only ever reported by one source...it doesn't really matter what year the data comes from at that point because there will only ever be a single year for a single data source. However, when multiple sources are involved, we need to make sure there's source grouping per year to prevent the missing electricity flows as discussed in the previous comment.

m-jamieson · 2024-08-14T02:09:46Z

After replacing the ["FlowName","Compartment"] in the groupbys below, all the blank electricity amounts for the emissions go away, leaving just the inputs as blank electricity, which are all source_list "NEI_RCRAInfo_TRI_ampd_ap42_eGRID_eia_nei" or in other words all sources. Coming into calculate_electricity_by_source the source for all the fuel inputs is "eia", water inputs are "netl", and heat inputs are "eGRID." I think the original idea here is that we'd sum up electricity from all sources to divide fuel inputs by electricity from all plants in the matching year. I think the drop_duplicates happening here needs to have "Year" added to the list ["eGRID_ID"], so ["eGRID_ID","Year"] to make sure that all years of all sources are calculated and available to join to the fuel inputs later on. I'm not 100% satisfied with this because I think the "source" for the fuel inputs should say "eia" once they make it into openLCA.

            source_df = pd.DataFrame(
                db_multiple_sources.groupby(groupby_cols)[
                    ["Source"]].apply(combine_source_lambda),
                columns=["source_list"],
            )
            source_df[["source_list", "source_string"]] = pd.DataFrame(
                source_df["source_list"].values.tolist(),
                index=source_df.index
            )
            source_df.reset_index(inplace=True)
            db_multiple_sources.drop(
                columns=["source_list", "source_string"], inplace=True
            )
            old_index = db_multiple_sources.index
            db_multiple_sources = db_multiple_sources.merge(
                right=source_df,
                left_on=groupby_cols,
                right_on=groupby_cols,
                how="left",
            )

m-jamieson · 2024-08-14T11:17:02Z

Was able to do a test run with the fix above, and it worked as intended - fuel flows are again getting electricity matches, but the facility counts aren't lining up. From an emissions perspective, this is an expected part of the method - the number of reported emissions of a species can be less than the number of facilities reporting electricity generation. For fuel inputs though, this means that resulting plant processes are being more efficient than they should be. In the test run that I did PJM GAS has electricity summed for 170 facilities, but gas inputs from Appalachian and Arkoma basins only sum to 143. Guess this comes down to not having entries in gas_supply_basin_mapping.csv.

As best as I can tell, there are electricity flows for all inputs coming into aggregate, so maybe the answer here is to not separate between "power" and "nonpower" and just run the electricity summing for all entries.

m-jamieson · 2024-08-14T13:50:11Z

Tried the test mentioned in the previous comment and am still getting large mismatches in facility counts due to the lack of matching basin matching data. Power plant construction amounts are listed as eia data and are automatically generated, so there ends up being more of them.

I guess some one solution here are to make special labels for fuel amounts like "eia_fuel" or change construction or both/all. I think adding different labels for the different data sources - like "netlgas" as the source for the upstream gas inventory makes sense and will be clearer in the resulting openLCA metadata. Hard part will be finding all the holes where maybe source wasn't explicitly defined because there are a couple places where Source is defined across combinations of dataframes

… tables? USEPA#246 New source strings were added to differentiate between sources. This allows better alignment with calculated total electricity values when the data are aggregated and avoids NaNs particularly when some of the internal data haven't been updated to the latest year.

… tables? USEPA#246

dt-woods added the bug label Aug 2, 2024

dt-woods added a commit to KeyLogicLCA/ElectricityLCI that referenced this issue Aug 6, 2024

address USEPA#247; address USEPA#246 (include 2016 for hydro); addres…

313dd5c

…ses USEPA#204

dt-woods added a commit to KeyLogicLCA/ElectricityLCI that referenced this issue Aug 8, 2024

addresses USEPA#246

e8f11b8

dt-woods added a commit to KeyLogicLCA/ElectricityLCI that referenced this issue Aug 9, 2024

appears to fix eGRID_ID nans in get_gen_plus_netl

fa14f32

addresses USEPA#246; still need to test for nans in exchange table after this fix

dt-woods added a commit to KeyLogicLCA/ElectricityLCI that referenced this issue Aug 13, 2024

option 2---remove Nans from exchange table; addresses USEPA#246

563244a

dt-woods added a commit to KeyLogicLCA/ElectricityLCI that referenced this issue Aug 13, 2024

mostly just documentation; addresses USEPA#246 and USEPA#204

43fb567

m-jamieson added a commit to KeyLogicLCA/ElectricityLCI that referenced this issue Aug 19, 2024

missed adding a source for nuclear. Who's putting NaNs in my exchange…

d52a26a

… tables? USEPA#246

dt-woods added the in v2 Issues marked for inclusion in version 2 label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Who's putting NaNs in my exchange tables? #246

Who's putting NaNs in my exchange tables? #246

dt-woods commented Aug 2, 2024

dt-woods commented Aug 2, 2024 •

edited

Loading

dt-woods commented Aug 2, 2024

dt-woods commented Aug 2, 2024 •

edited

Loading

dt-woods commented Aug 2, 2024

dt-woods commented Aug 2, 2024

dt-woods commented Aug 2, 2024

dt-woods commented Aug 5, 2024 •

edited

Loading

dt-woods commented Aug 12, 2024

dt-woods commented Aug 12, 2024 •

edited

Loading

m-jamieson commented Aug 13, 2024

dt-woods commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

dt-woods commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

m-jamieson commented Aug 14, 2024

m-jamieson commented Aug 14, 2024

m-jamieson commented Aug 14, 2024 •

edited

Loading

Who's putting NaNs in my exchange tables? #246

Who's putting NaNs in my exchange tables? #246

Comments

dt-woods commented Aug 2, 2024

dt-woods commented Aug 2, 2024 • edited Loading

dt-woods commented Aug 2, 2024

dt-woods commented Aug 2, 2024 • edited Loading

dt-woods commented Aug 2, 2024

dt-woods commented Aug 2, 2024

dt-woods commented Aug 2, 2024

dt-woods commented Aug 5, 2024 • edited Loading

dt-woods commented Aug 12, 2024

dt-woods commented Aug 12, 2024 • edited Loading

m-jamieson commented Aug 13, 2024

dt-woods commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

dt-woods commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

m-jamieson commented Aug 13, 2024

m-jamieson commented Aug 14, 2024

m-jamieson commented Aug 14, 2024

m-jamieson commented Aug 14, 2024 • edited Loading

dt-woods commented Aug 2, 2024 •

edited

Loading

dt-woods commented Aug 2, 2024 •

edited

Loading

dt-woods commented Aug 5, 2024 •

edited

Loading

dt-woods commented Aug 12, 2024 •

edited

Loading

m-jamieson commented Aug 14, 2024 •

edited

Loading