-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform f1 xmssn line #2103
Transform f1 xmssn line #2103
Conversation
Add table to TABLE_NAME_MAP in extract module Add table to RESOURCE_METADATA in metadata module Add table to Ferc1TableId class and create basic TransmissionFerc1TableTransformer class in the transform module Add column renames in the params module
Update some of the column names in the transform params Add final column names to the resource metadata Add description to the FERC1_STRING_NORM dict
Add transmission table to the transform function in the ferc1 transform module and the main statement at the end of that module. Add the replace_with_na and drop_invalid_rows to the list of params and the transform_main function for the transmission table
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov ReportBase: 85.3% // Head: 85.3% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #2103 +/- ##
=====================================
Coverage 85.3% 85.3%
=====================================
Files 72 72
Lines 8293 8310 +17
=====================================
+ Hits 7074 7091 +17
Misses 1219 1219
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to add these tables into the settings files (both fast and full). for the fast file, you'll need to add the raw tables into the ferc_to_sqlite_settings
section as well as the pudl table name in the pudl etl section.
1 small docs suggestion and 1 possible suggestion about moving use the default transform_main
It's worth noting that there are some records with duplicate
and
|
…ansform_main function
I think this is failing bc you haven't added the raw tables into the fast settings file. see my previous comment. also on the non-unique record_id, how many dupes are there overall? and do they look like truly different data (which these two examples do look actually different to me at least). overall my main question is whether we actually believe this table should have unique record ids. I would start by looking at these records raw tables to just double check to see whether or not there is something weird in the raw records or whether we have transformed them weirdly. |
There are 8 DBF dups and 12 XBRL dups |
Right now, I think it makes sense to create record ids. |
… has_unique_record_ids: bool = False to the transmission table transformer class
… get rid of header rows. Kind of like small gens table, but less useful information. Also update the logger for the drop_invalid_rows to show how many rows were removed incase it's less than 0.0% so it doesn't look like nothing was removed
I looked into whether there was a primary key for this table, and I'm not convinced that there is. The obvious This seems like the kind of table that could have duplicate rows that are both valid. It would mean that Utility X has identical transmission lines going from point A to point B which, while unlikely, is not zero. See the one example from the data:
Because there's not a ton of cost information to compare the two, I don't think we can definitely say that these rows are different or the same. I also don't think that we can define primary keys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey! can you delete the notebook (we talked about this and noted that it wasn't supposed to be added into this pr) plus bb changes below. I'll poke back in here quickly once you address these to get this merged iiiiiin
…able transformer class because it's just calling .super ()
TLDR
This PR adds the transmission table to the list of transformed FERC1 tables. It is the last remaining non row-literal table, and was relatively strait forward.
I created a new transform class for this table called
replace_with_na
that allows you to replace certain values with NA without needed to categorize the rest of the values.The transforms I applied to the table are:
Thoughts...
same col, different definition
There were a few columns pertaining to cost (
capex_land
,opex_maintenance
etc. ) that are already defined inFIELD_METADATA
. In this case, however, instead of pertaining to plants they pertain to transmission lines. I could have:a) Updated the definitions to say "plant or transmission line"
b) Added these columns to
FIELD_METADATA_BY_RESOURCE
dict and gave them new, transmission specific definitions.I chose option (b) because I thought it would be more specific. I'm not sure how to check whether this works in the docs build however because the table is not yet in the PUDL db. If you know of another good way to test I'll happily do so.
col:
supporting_structure_type
I wanted to apply a string categorization on the
supporting_structure_type
column. The definition in the XBRL clearly states that the field should be one of four things:However, the actual content of the field spans way beyond what can reasonably be inferred as one of those four categories. My options here are:
a) Categorize all values regardless, nulling out those that are un-categorizable.
b) Map recognizable values and leave unrecognizable values as they are.
c) Leave all values as they are.
Right now, I've done nothing (c), but I would consider doing option (b). I'm not likely to consider option (a) because I'm don't want to drop lots of information without knowing more about what it actually means. Granted, I asked my roomate that works on transmission lines what those other, un-categorizable values could be and they didn't know what they were...
Here's an example of the type of values in
supporting_structure_type
:col:
conductor_size_and_material
This column is also very messy and not something I could easily parse. As the title suggests, there are two types of information embedded in one column, and I don't know enough about conductor size or material (and the reporting isn't consistent enough) to separate them out or standardize them. For now, this is just a regular string column with all sorts of information in there...
If we wanted, we could do more research on conductor size and material and at least null out some useless values.
Notes
I have yet to include this table in the output layer.