-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generators_entity_eia contains incorrect prime mover codes #1585
Comments
Another example is plant with id 50150, the Union Carbide Seadrift Cogen plant in Texas. In the EIA forms, all of its generators are identified as either GT or ST, while in generators_entity_eia, they are all identified as CT or CA. This suggests that perhaps the pudl code might be "correcting" the prime mover code if it thinks that it should actually be a combined cycle. If this is the case, it should be updated in all tables so that records can be matched on prime mover code across tables. |
There is also at least one instance where the prime mover code is missing. In generators_entity_eia, the PM is identified as "None" for all of the generators at plant 50489. However, in the 2020 EIA-860 file, all of the generators have PMs identified for them. |
Based on our conversation @cmgosnell and @zaneselvans it seems that Otherwise, there seems to be no way to programmatically check the assigned prime mover code using the pudl sqlite database. I am currently having to fix this by manually going to the EIA-860 excel files and cross-referencing the code there. |
Hey @grgmiller ! After poking at this a little, it seems worthwhile to change the way we categorize some lingering questionsWe had previously assumed that the PM of a given generator does not change over time. BUT it seems like there are plants that legitimately change (see examples below). There are 778 plant/generator_id's that have more than one PM across all the years. This is about 2% of all plant/gens. While some of them do seem to be actual conversions and change over time.... many of them seem ambiguous and are probably data entry errors. There are 67(!) plant/generator_id/report_date's that have more than one PM within one year. Something else to explore is how this effects the other plant/pm tables that don't have generators (and therefor haven't been getting this "static" treatment). It seems better overall to have data that is inconsistent over time if it is internally consistent between these tables. some plant explorations(plant_id_eia, generator_id)
|
Thanks @cmgosnell . My thought is that the best approach is to ensure that the I think what this means is that if the pm code changes over time, we should make sure that it consistently changes in all EIA tables. If we think that the change is a typo, we can change it, but we need to make sure that the change is made in all tables. With regards to typos, is there an easy way to infer that? For example if a plant reports the pm code separately in 923 and 860, and it only changes in one of the tables, then perhaps that is a typo. However, if the pm code is consistently updated across all tables in a year, perhaps that represents an actual conversion. (however, I suppose there is also the possibility that there was a conversion, but the reporter failed to update the code in one of the tables). |
I think differentiating between data entry errors where someone fails to update something or uses the wrong code, and actual conversions may have to fall back on manual research. I guess there are patterns of discrepancy that would be more or less likely to represent a real conversion. Like if the code changes only once, and stays changed, and eventually shows up everywhere consistently. Right now we only attempt to enforce a consistent value in the entity tables, after gathering together all of the observations from the many data tables. The cases where the code is a primary key and has to be left in the data tables instead of being normalized, and propagating the inferred to be correct value back out to the data tables is not something we currently have infrastructure to do (but it would be good to do! And there are statistical inference methods for doing this kind of thing, explored in by Ihab Ilyas and Xu Chu in Data Cleaning and their subsequent research. |
Here's a specific example: There is a generator in gens_eia860 with a 2020 report date that is completely missing prime mover data.. This is plant 59012 with generator_id "STEAM" (this is the DC Water CHP plant). It looks like in 2015-2018, this generator had data reported in EIA-860, and was listed as a "CA" prime mover, and the other three generators were listed as "CT". However, it looks like starting in 2019, this generator was removed from EIA-860, and the other three generators changed their prime mover to "GT". Should this generator still exist with a 2020 record in gens_eia860? If so, should it be marked as retired instead of just having all missing data associated with it? |
It seems like they might have just removed the HRSG, but kept the gas turbine around? Is it missing a bunch of other attributes, or just the PM? If that's the only generator-year missing a PM, then it seems like we should maybe consider it an invalid record and drop it in the transform step. I don't know what fraction of generators that go out of service end up explicitly getting a retirement date. I bet it's not 100%. I wonder if it'll show up revised in future data with a retirement date (is it in the 2021 eia860m data?) |
It is missing a bunch of other attributes (not just PM) in both the 2019 and 2020 |
The generators_entity_eia table saved in the pudl sqlite database does not seem to always have correct prime mover codes associated with each generator.
Specific example: In generators_entity_eia, the generator with
plant_id_eia == 2465
andgenerator_id == GT1
is associated with a "CT" prime mover code. However, in the 2020 EIA-923 and EIA-860 files, the prime mover for this generator is listed as "GT".This is causing errors in
analysis.allocate_net_gen
because this depends on matching the prime mover codes.Steps to recreate:
I am loading generators_entity_eia from the pudl sqlite file downloaded from https://zenodo.org/record/6349861#.YlsSwtPMJhG
I am cross referencing the raw EIA 923 and 860 files from 2020
The text was updated successfully, but these errors were encountered: