Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform f1 xmssn line #2103

Merged
merged 17 commits into from
Dec 6, 2022
Merged

Transform f1 xmssn line #2103

merged 17 commits into from
Dec 6, 2022

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Dec 1, 2022

TLDR

This PR adds the transmission table to the list of transformed FERC1 tables. It is the last remaining non row-literal table, and was relatively strait forward.

I created a new transform class for this table called replace_with_na that allows you to replace certain values with NA without needed to categorize the rest of the values.

The transforms I applied to the table are:

  • replace_with_na
  • drop_invalid_rows

Thoughts...

same col, different definition

There were a few columns pertaining to cost (capex_land, opex_maintenance etc. ) that are already defined in FIELD_METADATA. In this case, however, instead of pertaining to plants they pertain to transmission lines. I could have:

a) Updated the definitions to say "plant or transmission line"
b) Added these columns to FIELD_METADATA_BY_RESOURCE dict and gave them new, transmission specific definitions.

I chose option (b) because I thought it would be more specific. I'm not sure how to check whether this works in the docs build however because the table is not yet in the PUDL db. If you know of another good way to test I'll happily do so.

col: supporting_structure_type

I wanted to apply a string categorization on the supporting_structure_type column. The definition in the XBRL clearly states that the field should be one of four things:

Supporting structure can be: (1) single pole wood or steel; (2) H-frame wood, or steel poles; (3) tower; or (4) underground construction If a transmission line has more than one type of supporting structure.

However, the actual content of the field spans way beyond what can reasonably be inferred as one of those four categories. My options here are:

a) Categorize all values regardless, nulling out those that are un-categorizable.
b) Map recognizable values and leave unrecognizable values as they are.
c) Leave all values as they are.

Right now, I've done nothing (c), but I would consider doing option (b). I'm not likely to consider option (a) because I'm don't want to drop lots of information without knowing more about what it actually means. Granted, I asked my roomate that works on transmission lines what those other, un-categorizable values could be and they didn't know what they were...

Here's an example of the type of values in supporting_structure_type:

'H-FRAME',
 'WOOD POLE',
 'H-Wood',
 'SP-Wood',
 'SP-Steel',
 'Underground',
'HFW & SPS',
 'HFW & ST',
 'HFW & SPW',
 '      T',
 '  W-H Fr.',
 '  DC-CP',
'H-.74',
 'P-.76;H-1.21',
 'Z-.04;H-9.03;',
 'HH-2.45',
 'Z-1.10',
 'P-3.87;H-18.68',

col: conductor_size_and_material

This column is also very messy and not something I could easily parse. As the title suggests, there are two types of information embedded in one column, and I don't know enough about conductor size or material (and the reporting isn't consistent enough) to separate them out or standardize them. For now, this is just a regular string column with all sorts of information in there...

If we wanted, we could do more research on conductor size and material and at least null out some useless values.

Notes

I have yet to include this table in the output layer.

Add table to TABLE_NAME_MAP in extract module

Add table to RESOURCE_METADATA in metadata module

Add table to Ferc1TableId class and create basic TransmissionFerc1TableTransformer class in the transform module

Add column renames in the params module
Update some of the column names in the transform params

Add final column names to the resource metadata

Add description to the FERC1_STRING_NORM dict
Add transmission table to the transform function in the ferc1 transform module and the main statement at the end of that module.

Add the replace_with_na and drop_invalid_rows to the list of params and the transform_main function for the transmission table
@aesharpe aesharpe added ferc1 Anything having to do with FERC Form 1 rmi xbrl Related to the FERC XBRL transition dbf Data coming from FERC's old Visual FoxPro DBF database file format. labels Dec 1, 2022
@aesharpe aesharpe self-assigned this Dec 1, 2022
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Dec 1, 2022

Codecov Report

Base: 85.3% // Head: 85.3% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (e3e8034) compared to base (bfb3a5b).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2103   +/-   ##
=====================================
  Coverage   85.3%   85.3%           
=====================================
  Files         72      72           
  Lines       8293    8310   +17     
=====================================
+ Hits        7074    7091   +17     
  Misses      1219    1219           
Impacted Files Coverage Δ
src/pudl/extract/ferc1.py 87.6% <ø> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/params/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/classes.py 93.6% <100.0%> (+0.2%) ⬆️
src/pudl/transform/ferc1.py 95.2% <100.0%> (+<0.1%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cmgosnell cmgosnell linked an issue Dec 1, 2022 that may be closed by this pull request
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to add these tables into the settings files (both fast and full). for the fast file, you'll need to add the raw tables into the ferc_to_sqlite_settings section as well as the pudl table name in the pudl etl section.

1 small docs suggestion and 1 possible suggestion about moving use the default transform_main

@aesharpe
Copy link
Member Author

aesharpe commented Dec 1, 2022

It's worth noting that there are some records with duplicate record_id values. The other information differs slightly. I'm not sure what to do with them. For example:

record_id utility_id_ferc1 report_year start_point end_point operating_voltage_kv designed_voltage_kv supporting_structure_type transmission_line_length_miles transmission_line_and_structures_length_miles num_transmission_circuits conductor_size_and_material capex_land capex_other capex_total opex_operations opex_maintenance opex_rents opex_total
132342 f1_xmssn_line_2000_12_32_17_14 165 2000 11415 NORTHWEST DEVON 138 138 UG 0.3 0 1 2500 1/C CU nan nan nan nan nan nan nan
132343 f1_xmssn_line_2000_12_32_17_14 165 2000 11414 NORTHWEST DEVON 138 138 UG 0.18 0 1 2500 1/C CU nan nan nan nan nan nan nan

and

record_id utility_id_ferc1 report_year start_point end_point operating_voltage_kv designed_voltage_kv supporting_structure_type transmission_line_length_miles transmission_line_and_structures_length_miles num_transmission_circuits conductor_size_and_material capex_land capex_other capex_total opex_operations opex_maintenance opex_rents opex_total
563551 transmission_line_statistics_422_2021_c000317_345_kv_lines__ 178 2021 345 KV Lines nan nan nan nan 1.16458e+06 nan 1.16458e+06 nan nan nan nan
563556 transmission_line_statistics_422_2021_c000317_345_kv_lines__ 178 2021 34.5 KV Lines nan nan 7.31 nan 56036 3.37056e+06 3.4266e+06 nan nan nan nan

@cmgosnell
Copy link
Member

I think this is failing bc you haven't added the raw tables into the fast settings file. see my previous comment.

also on the non-unique record_id, how many dupes are there overall? and do they look like truly different data (which these two examples do look actually different to me at least). overall my main question is whether we actually believe this table should have unique record ids. I would start by looking at these records raw tables to just double check to see whether or not there is something weird in the raw records or whether we have transformed them weirdly.

@aesharpe
Copy link
Member Author

aesharpe commented Dec 2, 2022

how many dupes are there overall?

There are 8 DBF dups and 12 XBRL dups

@aesharpe
Copy link
Member Author

aesharpe commented Dec 2, 2022

my main question is whether we actually believe this table should have unique record ids. I would start by looking at these records raw tables to just double check to see whether or not there is something weird in the raw records or whether we have transformed them weirdly.

Right now, I think it makes sense to create record ids.

… has_unique_record_ids: bool = False to the transmission table transformer class
… get rid of header rows. Kind of like small gens table, but less useful information.

Also update the logger for the drop_invalid_rows to show how many rows were removed incase it's less than 0.0% so it doesn't look like nothing was removed
@aesharpe
Copy link
Member Author

aesharpe commented Dec 2, 2022

I looked into whether there was a primary key for this table, and I'm not convinced that there is. The obvious report_year, utility_id_ferc1, start_point, end_point did not yield a unique dataset. Neither did adding the other distinguishing columns, supporting_structure_type, transmission_line_length_miles.

This seems like the kind of table that could have duplicate rows that are both valid. It would mean that Utility X has identical transmission lines going from point A to point B which, while unlikely, is not zero. See the one example from the data:

record_id utility_id_ferc1 report_year start_point end_point operating_voltage_kv designed_voltage_kv supporting_structure_type transmission_line_length_miles transmission_line_and_structures_length_miles num_transmission_circuits conductor_size_and_material capex_land capex_other capex_total opex_operations opex_maintenance opex_rents opex_total dup
580357 transmission_line_statistics_422_2021_c001346_pa_chemicals_potter_corten_pole_043_1_17 284 2021 PA Chemicals Potter 138 138 (Corten Pole) 0.43 nan 1 (17) nan nan nan nan nan nan nan False
580375 transmission_line_statistics_422_2021_c001346_pa_chemicals_potter_corten_pole_043_1_17 284 2021 PA Chemicals Potter 138 138 (Corten Pole) 0.43 nan 1 (17) nan nan nan nan nan nan nan True

Because there's not a ton of cost information to compare the two, I don't think we can definitely say that these rows are different or the same. I also don't think that we can define primary keys.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey! can you delete the notebook (we talked about this and noted that it wasn't supposed to be added into this pr) plus bb changes below. I'll poke back in here quickly once you address these to get this merged iiiiiin

@aesharpe aesharpe merged commit b1023a0 into dev Dec 6, 2022
@aesharpe aesharpe deleted the transform_f1_xmssn_line branch December 6, 2022 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbf Data coming from FERC's old Visual FoxPro DBF database file format. ferc1 Anything having to do with FERC Form 1 rmi xbrl Related to the FERC XBRL transition
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transform f1_xmssn_line xbrl + dbf
2 participants