-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Who's putting NaNs in my exchange tables? #246
Comments
Going line by line through the generation process inventory generation, everything seems okay except for HYDRO. The problem comes from a blank line in the source CSV file for facility 1784, Twin Falls (MI).
UPDATE: From G. Zaimes,
|
Found a second culprit in hydro upstream where "water" was misspelled. Now there are only three NaNs caused by the blank line above.
|
Everything's good with generation processes, but when combined with upstream processes through |
In main.py, the data frame returned by |
It looks like the culprit is generation.py's |
FOUND! |
A couple of new updates for Monday.
More to come from this investigation. |
addresses USEPA#246; still need to test for nans in exchange table after this fix
After all the fixes above; the nans in the exchange table remain. # Define inputs based on compartment label
data["input"] = False
input_filter = (
(data["Compartment"].str.lower().str.contains("input"))
| (data["Compartment"].str.lower().str.contains("resource"))
| (data["Compartment"].str.lower().str.contains("technosphere"))
)
data.loc[input_filter, "input"] = True
# Define products based on compartment label
product_filter=(
(data["Compartment"].str.lower().str.contains("technosphere"))
| (data["Compartment"].str.lower().str.contains("valuable"))
)
data.loc[product_filter, "FlowType"] = "PRODUCT_FLOW"
# Define wastes based on compartment label
waste_filter = (
(data["Compartment"].str.lower().str.contains("technosphere"))
)
data.loc[waste_filter, "FlowType"] = "WASTE_FLOW" If I'm right, the compartment with "technosphere" in its name is always labeled with a flow type as a waste flow, due to the third clause above, so why have it the product flow query? ElectricityLCI/electricitylci/generation.py Line 1091 in e562681
@m-jamieson, do you remember how this labeling works? |
Okay. The NaNs are coming from
|
I'm alright with 2 - I mean that's basically what happened in post-processing in the first fed commons database. Would be nice to get some idea of what's getting dropped though and why. It can only either be the sum of the emissions of the aggregated group or the electricity. I would guess the electricity route is more likely. |
The problem is with 'Carbon dioxide', Across just about all BAAs and stage codes. Note that power plant construction, coal, natural gas, and petroleum NaNs are the cause for issue #250 |
eh...that's a lot....like basically everything. This also suggests that all data sources are involved too. Is the current Keylogic development branch the most reasonable one to try and run to replicate? And then what data year are you running to get these nans? |
Yes and ELCI_2022_config.yaml |
Just adding some things here as I debug in case I forget to come back to this later. As best as I can tell (or as pandas tells me) there are no NaN |
As suspected, when the electricity amounts are merged back with the aggregated flow amounts here, there are a bunch of missing electricity amounts in the resulting database_f3. Which suggests that the electricity amounts aren't being summed correctly in
|
Never mind it's a year mis-match problem. For example: Just to describe what's supposed to be happening here. There are two biomass plants reporting CO2 emissions in the Arizona Public Service Company BA in the year 2022. We can easily sum these. The electricity that's supposed to be assigned to this aggregation is from all plants that reported to the same emission sources. In this case, these two plants report to NEI, ampd, ap42, and eGRID - or at least that's what the fetched data suggests. But there are very few years where plants will be reporting to all the same data sources in the same year. This very likely should be source string "ampd_ap42" for year 2022 and "NEI" for year 2021 or something like that. |
I think this commit created the current issue: KeyLogicLCA@946463a We groupby a different set of columns "grouby_cols" than before ["FlowName","Compartment"] because before we were trying to omit flows that were only ever reported by one source...it doesn't really matter what year the data comes from at that point because there will only ever be a single year for a single data source. However, when multiple sources are involved, we need to make sure there's source grouping per year to prevent the missing electricity flows as discussed in the previous comment. |
After replacing the ["FlowName","Compartment"] in the groupbys below, all the blank electricity amounts for the emissions go away, leaving just the inputs as blank electricity, which are all source_list "NEI_RCRAInfo_TRI_ampd_ap42_eGRID_eia_nei" or in other words all sources. Coming into
|
Was able to do a test run with the fix above, and it worked as intended - fuel flows are again getting electricity matches, but the facility counts aren't lining up. From an emissions perspective, this is an expected part of the method - the number of reported emissions of a species can be less than the number of facilities reporting electricity generation. For fuel inputs though, this means that resulting plant processes are being more efficient than they should be. In the test run that I did PJM GAS has electricity summed for 170 facilities, but gas inputs from Appalachian and Arkoma basins only sum to 143. Guess this comes down to not having entries in As best as I can tell, there are electricity flows for all inputs coming into aggregate, so maybe the answer here is to not separate between "power" and "nonpower" and just run the electricity summing for all entries. |
Tried the test mentioned in the previous comment and am still getting large mismatches in facility counts due to the lack of matching basin matching data. Power plant construction amounts are listed as eia data and are automatically generated, so there ends up being more of them. I guess some one solution here are to make special labels for fuel amounts like "eia_fuel" or change construction or both/all. I think adding different labels for the different data sources - like "netlgas" as the source for the upstream gas inventory makes sense and will be clearer in the resulting openLCA metadata. Hard part will be finding all the holes where maybe source wasn't explicitly defined because there are a couple places where Source is defined across combinations of dataframes |
… tables? USEPA#246 New source strings were added to differentiate between sources. This allows better alignment with calculated total electricity values when the data are aggregated and avoids NaNs particularly when some of the internal data haven't been updated to the latest year.
Processes are failing to load in openLCA due, in part, to a large number of NaNs found as amounts in process exchanges.
The text was updated successfully, but these errors were encountered: