Transform f1 xmssn line #2103

aesharpe · 2022-12-01T18:55:10Z

TLDR

This PR adds the transmission table to the list of transformed FERC1 tables. It is the last remaining non row-literal table, and was relatively strait forward.

I created a new transform class for this table called replace_with_na that allows you to replace certain values with NA without needed to categorize the rest of the values.

The transforms I applied to the table are:

replace_with_na
drop_invalid_rows

Thoughts...

same col, different definition

There were a few columns pertaining to cost (capex_land, opex_maintenance etc. ) that are already defined in FIELD_METADATA. In this case, however, instead of pertaining to plants they pertain to transmission lines. I could have:

a) Updated the definitions to say "plant or transmission line"
b) Added these columns to FIELD_METADATA_BY_RESOURCE dict and gave them new, transmission specific definitions.

I chose option (b) because I thought it would be more specific. I'm not sure how to check whether this works in the docs build however because the table is not yet in the PUDL db. If you know of another good way to test I'll happily do so.

col: `supporting_structure_type`

I wanted to apply a string categorization on the supporting_structure_type column. The definition in the XBRL clearly states that the field should be one of four things:

Supporting structure can be: (1) single pole wood or steel; (2) H-frame wood, or steel poles; (3) tower; or (4) underground construction If a transmission line has more than one type of supporting structure.

However, the actual content of the field spans way beyond what can reasonably be inferred as one of those four categories. My options here are:

a) Categorize all values regardless, nulling out those that are un-categorizable.
b) Map recognizable values and leave unrecognizable values as they are.
c) Leave all values as they are.

Right now, I've done nothing (c), but I would consider doing option (b). I'm not likely to consider option (a) because I'm don't want to drop lots of information without knowing more about what it actually means. Granted, I asked my roomate that works on transmission lines what those other, un-categorizable values could be and they didn't know what they were...

Here's an example of the type of values in supporting_structure_type:

'H-FRAME',
 'WOOD POLE',
 'H-Wood',
 'SP-Wood',
 'SP-Steel',
 'Underground',
'HFW & SPS',
 'HFW & ST',
 'HFW & SPW',
 '      T',
 '  W-H Fr.',
 '  DC-CP',
'H-.74',
 'P-.76;H-1.21',
 'Z-.04;H-9.03;',
 'HH-2.45',
 'Z-1.10',
 'P-3.87;H-18.68',

col: `conductor_size_and_material`

This column is also very messy and not something I could easily parse. As the title suggests, there are two types of information embedded in one column, and I don't know enough about conductor size or material (and the reporting isn't consistent enough) to separate them out or standardize them. For now, this is just a regular string column with all sorts of information in there...

If we wanted, we could do more research on conductor size and material and at least null out some useless values.

Notes

I have yet to include this table in the output layer.

Add table to TABLE_NAME_MAP in extract module Add table to RESOURCE_METADATA in metadata module Add table to Ferc1TableId class and create basic TransmissionFerc1TableTransformer class in the transform module Add column renames in the params module

Update some of the column names in the transform params Add final column names to the resource metadata Add description to the FERC1_STRING_NORM dict

Add transmission table to the transform function in the ferc1 transform module and the main statement at the end of that module. Add the replace_with_na and drop_invalid_rows to the list of params and the transform_main function for the transmission table

…th_na params

review-notebook-app · 2022-12-01T18:55:15Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2022-12-01T18:56:36Z

Codecov Report

Base: 85.3% // Head: 85.3% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (e3e8034) compared to base (bfb3a5b).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2103   +/-   ##
=====================================
  Coverage   85.3%   85.3%           
=====================================
  Files         72      72           
  Lines       8293    8310   +17     
=====================================
+ Hits        7074    7091   +17     
  Misses      1219    1219

Impacted Files	Coverage Δ
src/pudl/extract/ferc1.py	`87.6% <ø> (ø)`
src/pudl/metadata/fields.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/ferc1.py	`100.0% <ø> (ø)`
src/pudl/transform/params/ferc1.py	`100.0% <ø> (ø)`
src/pudl/transform/classes.py	`93.6% <100.0%> (+0.2%)`	⬆️
src/pudl/transform/ferc1.py	`95.2% <100.0%> (+<0.1%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

cmgosnell

You'll need to add these tables into the settings files (both fast and full). for the fast file, you'll need to add the raw tables into the ferc_to_sqlite_settings section as well as the pudl table name in the pudl etl section.

1 small docs suggestion and 1 possible suggestion about moving use the default transform_main

src/pudl/transform/ferc1.py

aesharpe · 2022-12-01T22:46:20Z

It's worth noting that there are some records with duplicate record_id values. The other information differs slightly. I'm not sure what to do with them. For example:

	record_id	utility_id_ferc1	report_year	start_point	end_point	operating_voltage_kv	designed_voltage_kv	supporting_structure_type	transmission_line_length_miles	transmission_line_and_structures_length_miles	num_transmission_circuits	conductor_size_and_material	capex_land	capex_other	capex_total	opex_operations	opex_maintenance	opex_rents	opex_total
132342	f1_xmssn_line_2000_12_32_17_14	165	2000	11415 NORTHWEST	DEVON	138	138	UG	0.3	0	1	2500 1/C CU	nan	nan	nan	nan	nan	nan	nan
132343	f1_xmssn_line_2000_12_32_17_14	165	2000	11414 NORTHWEST	DEVON	138	138	UG	0.18	0	1	2500 1/C CU	nan	nan	nan	nan	nan	nan	nan

and

	record_id	utility_id_ferc1	report_year	start_point	end_point	operating_voltage_kv	designed_voltage_kv	supporting_structure_type	transmission_line_length_miles	transmission_line_and_structures_length_miles	num_transmission_circuits	conductor_size_and_material	capex_land	capex_other	capex_total	opex_operations	opex_maintenance	opex_rents	opex_total
563551	transmission_line_statistics_422_2021_c000317_345_kv_lines__	178	2021	345 KV Lines		nan	nan		nan	nan			1.16458e+06	nan	1.16458e+06	nan	nan	nan	nan
563556	transmission_line_statistics_422_2021_c000317_345_kv_lines__	178	2021	34.5 KV Lines		nan	nan		7.31	nan			56036	3.37056e+06	3.4266e+06	nan	nan	nan	nan

…ansform_main function

cmgosnell · 2022-12-02T16:18:42Z

I think this is failing bc you haven't added the raw tables into the fast settings file. see my previous comment.

also on the non-unique record_id, how many dupes are there overall? and do they look like truly different data (which these two examples do look actually different to me at least). overall my main question is whether we actually believe this table should have unique record ids. I would start by looking at these records raw tables to just double check to see whether or not there is something weird in the raw records or whether we have transformed them weirdly.

aesharpe · 2022-12-02T17:12:37Z

how many dupes are there overall?

There are 8 DBF dups and 12 XBRL dups

aesharpe · 2022-12-02T20:57:14Z

my main question is whether we actually believe this table should have unique record ids. I would start by looking at these records raw tables to just double check to see whether or not there is something weird in the raw records or whether we have transformed them weirdly.

Right now, I think it makes sense to create record ids.

… has_unique_record_ids: bool = False to the transmission table transformer class

… get rid of header rows. Kind of like small gens table, but less useful information. Also update the logger for the drop_invalid_rows to show how many rows were removed incase it's less than 0.0% so it doesn't look like nothing was removed

aesharpe · 2022-12-02T23:49:09Z

I looked into whether there was a primary key for this table, and I'm not convinced that there is. The obvious report_year, utility_id_ferc1, start_point, end_point did not yield a unique dataset. Neither did adding the other distinguishing columns, supporting_structure_type, transmission_line_length_miles.

This seems like the kind of table that could have duplicate rows that are both valid. It would mean that Utility X has identical transmission lines going from point A to point B which, while unlikely, is not zero. See the one example from the data:

	record_id	utility_id_ferc1	report_year	start_point	end_point	operating_voltage_kv	designed_voltage_kv	supporting_structure_type	transmission_line_length_miles	transmission_line_and_structures_length_miles	num_transmission_circuits	conductor_size_and_material	capex_land	capex_other	capex_total	opex_operations	opex_maintenance	opex_rents	opex_total	dup
580357	transmission_line_statistics_422_2021_c001346_pa_chemicals_potter_corten_pole_043_1_17	284	2021	PA Chemicals	Potter	138	138	(Corten Pole)	0.43	nan	1	(17)	nan	nan	nan	nan	nan	nan	nan	False
580375	transmission_line_statistics_422_2021_c001346_pa_chemicals_potter_corten_pole_043_1_17	284	2021	PA Chemicals	Potter	138	138	(Corten Pole)	0.43	nan	1	(17)	nan	nan	nan	nan	nan	nan	nan	True

Because there's not a ton of cost information to compare the two, I don't think we can definitely say that these rows are different or the same. I also don't think that we can define primary keys.

cmgosnell

hey! can you delete the notebook (we talked about this and noted that it wasn't supposed to be added into this pr) plus bb changes below. I'll poke back in here quickly once you address these to get this merged iiiiiin

src/pudl/transform/ferc1.py

src/pudl/metadata/fields.py

…able transformer class because it's just calling .super ()

aesharpe added 6 commits November 28, 2022 15:15

Add new column names and descriptions to the metadata

2ec754c

Update some of the column names in the transform params Add final column names to the resource metadata Add description to the FERC1_STRING_NORM dict

Add supporting_structure_type column to transmission table replace_wi…

09d4312

…th_na params

Merge with dev

af30e3c

Add return statement to transmission table transform_main function

343dfb4

aesharpe added ferc1 Anything having to do with FERC Form 1 rmi xbrl Related to the FERC XBRL transition dbf Data coming from FERC's old Visual FoxPro DBF database file format. labels Dec 1, 2022

aesharpe requested review from zaneselvans and cmgosnell December 1, 2022 18:55

aesharpe self-assigned this Dec 1, 2022

cmgosnell linked an issue Dec 1, 2022 that may be closed by this pull request

Transform f1_xmssn_line xbrl + dbf #1822

Closed

cmgosnell requested changes Dec 1, 2022

View reviewed changes

src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved

src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved

aesharpe added 2 commits December 1, 2022 15:22

Add FERC1 transmission table to ETL settings files

468a546

Fix docstring in transmission table transform

7e87425

Move replace_with_na function to the Ferc1AbstractTableTransformer.tr…

3a4adce

…ansform_main function

Add raw transmission inst/duration tables to the etl_fast.yml file

3026618

aesharpe added 4 commits December 2, 2022 14:55

Add raw transmission dbf file to fast_etl.yml

7bc0750

Add transmission table to list of non_unique_record_id_tables and add…

940a7a5

… has_unique_record_ids: bool = False to the transmission table transformer class

Merge branch 'dev' into transform_f1_xmssn_line

3503b52

cmgosnell requested changes Dec 6, 2022

View reviewed changes

src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved

src/pudl/metadata/fields.py Show resolved Hide resolved

aesharpe added 3 commits December 6, 2022 13:26

Merge branch 'dev' into transform_f1_xmssn_line

bee2b2f

Remove ferc1-etl-debug notebook

e2814dc

Remove transform_main table specific function from the transmission t…

e3e8034

…able transformer class because it's just calling .super ()

cmgosnell approved these changes Dec 6, 2022

View reviewed changes

aesharpe merged commit b1023a0 into dev Dec 6, 2022

aesharpe deleted the transform_f1_xmssn_line branch December 6, 2022 21:50

aesharpe mentioned this pull request Dec 8, 2022

Transform f1_xmssn_line xbrl + dbf #1822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform f1 xmssn line #2103

Transform f1 xmssn line #2103

aesharpe commented Dec 1, 2022 •

edited

Loading

review-notebook-app bot commented Dec 1, 2022

codecov bot commented Dec 1, 2022 •

edited

Loading

cmgosnell left a comment

aesharpe commented Dec 1, 2022 •

edited

Loading

cmgosnell commented Dec 2, 2022

aesharpe commented Dec 2, 2022

aesharpe commented Dec 2, 2022

aesharpe commented Dec 2, 2022 •

edited

Loading

cmgosnell left a comment

Transform f1 xmssn line #2103

Transform f1 xmssn line #2103

Conversation

aesharpe commented Dec 1, 2022 • edited Loading

TLDR

Thoughts...

same col, different definition

col: supporting_structure_type

col: conductor_size_and_material

Notes

review-notebook-app bot commented Dec 1, 2022

codecov bot commented Dec 1, 2022 • edited Loading

Codecov Report

cmgosnell left a comment

Choose a reason for hiding this comment

aesharpe commented Dec 1, 2022 • edited Loading

cmgosnell commented Dec 2, 2022

aesharpe commented Dec 2, 2022

aesharpe commented Dec 2, 2022

aesharpe commented Dec 2, 2022 • edited Loading

cmgosnell left a comment

Choose a reason for hiding this comment

aesharpe commented Dec 1, 2022 •

edited

Loading

col: `supporting_structure_type`

col: `conductor_size_and_material`

codecov bot commented Dec 1, 2022 •

edited

Loading

aesharpe commented Dec 1, 2022 •

edited

Loading

aesharpe commented Dec 2, 2022 •

edited

Loading