-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spot fix function/class #2254
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov ReportBase: 86.0% // Head: 86.1% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #2254 +/- ##
=====================================
Coverage 86.0% 86.1%
=====================================
Files 74 74
Lines 9273 9304 +31
=====================================
+ Hits 7983 8014 +31
Misses 1290 1290
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this function need a unit test?
Yes, it's good practice to make unit tests for these functions! You can make one in the test/unit/transform/classes_test.py
module. It should be relatively strait forward to mimic the structure of the other tests there. Just make sure your test function has the suffix test_
.
This currently breaks the validation test when I run it locally, presumably because the number of expected rows changes, as rows are getting rescued. How should I go about fixing this?
When you run the validation tests it should say something along the lines of "found X rows expected X rows" You can take that new number of found rows and put them in the set of expected rows for each of the ferc1 tables near the top of the test/validate/ferc1_test.py
module.
Where in the main transform process should this function live? Putting it towards the end presumes the input will be normalized/well-behaved, while putting it earlier would run the input through the rest of the transformation pipeline (making it harder to spot fix particular numeric values that get transformed through the pipeline). The function currently lives at the very start of the main transform, presuming it will mostly be used to spot fix non-categorical string values (e.g. plant names), but data type restrictions allow int and float inputs as well.
I think this logic makes sense. If a user wanted to move it, they could simply recreate the transform_main
function in their table transformer class and update the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good overall! see my reply to austen's comment about generalizing the identifying columns/values. and ditto to austen's responses to your three questions (the specific place to go tweak the expect values is in test.validate.ferc1_test.test_minmax_rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need to apply manually compiled spot-fixes is really widespread, and we've done it in a bunch of different ad-hoc ways in different modules / datasets. Rather than creating another narrowly focused solution here (which will become just another isolated method of doing spot-fixes) I was imagining that we would try and come up with something more generally applicable that we could use to standardize this operation across the codebase.
If we want to do that we should compile a list of the places where we're currently applying spot fixes so we can at least understand what cases we can / want to cover, and maybe translate them into unit-tests.
I don't think it's much more work to vectorize the application of groups of fixes by setting indexes / constructing dataframes of the fixes, and it'll make the fixes we compile more compact, and probably much faster (though this is probably not a time intensive step).
self.normalize_strings(df) | ||
self.spot_fix_values(df) | ||
.pipe(self.normalize_strings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the FERC1 context where do we think the spot fixes should be applied? There's some overlap between the kinds of things that can be fixed by spot fixes, and the string normalization / categorization etc. The spot fixes feel like a fix of last resort, when the more general approaches can't work. Does it make sense for them to be the very first thing that's done? Do we foresee needing to apply different sets of spot fixes in more than one step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to answer and might end up varying. There are some cases where you might want to spot fix things so that they get recognized and treated aptly by other functions and there are some cases where you might want to spot fix things after they have gone though all possible transforms as a final "buffing" of the data.
Wherever we choose to put it there will undoubtedly be some tables where we'll have to override the transform and move the location of the spot fix function. I also think it's worth considering a scenario where we need to call spot fix at two different points of the transform process. I'm not sure our current setup can handle that.
Made some major updates in response to the great feedback above. Here's what's changed:
Some questions still to resolve:
A small-scale demonstration of the actual spot-fixer function with from datetime import datetime
import numpy as np
import pandas as pd
# A few records from our favorite test plant...
barry = pd.DataFrame(
np.rec.array(
[
(3, "1", "2020-01-01T00:00:00.000000000", 153.1),
(3, "2", "2020-01-01T00:00:00.000000000", 153.1),
(3, "3", "2020-01-01T00:00:00.000000000", 272.0),
(3, "1", "2021-01-01T00:00:00.000000000", 153.1),
(3, "2", "2021-01-01T00:00:00.000000000", 153.1),
(3, "3", "2021-01-01T00:00:00.000000000", 272.0),
(3, "1", "2022-01-01T00:00:00.000000000", 153.1),
(3, "2", "2022-01-01T00:00:00.000000000", 153.1),
(3, "3", "2022-01-01T00:00:00.000000000", 272.0),
],
dtype=[
("plant_id_eia", "int64"),
("generator_id", "O"),
("report_date", "<M8[ns]"),
("capacity_mw", "<f8"),
],
)
)
spot_dict = [
{
"idx_cols": ["plant_id_eia", "generator_id"],
"fix_cols": ["capacity_mw"],
"expect_unique": False,
"spot_fixes": [
(3, "1", 1000.0),
(3, "2", 999.0),
],
},
{
"idx_cols": ["generator_id", "report_date"],
"fix_cols": ["capacity_mw"],
"expect_unique": True,
"spot_fixes": [
("1", datetime.strptime("2022-01-01", "%Y-%m-%d"), 200.1),
("1", datetime.strptime("2021-01-01", "%Y-%m-%d"), 100.1),
],
},
]
for spot_fix in spot_dict:
spot_fixes_df = pd.DataFrame(
spot_fix["spot_fixes"], columns=spot_fix["idx_cols"] + spot_fix["fix_cols"]
)
assert (
spot_fixes_df.dtypes
== barry[spot_fix["idx_cols"] + spot_fix["fix_cols"]].dtypes
).all(), "Spot fix data types do not match existing dataframe datatypes."
"""Check that the datatypes of the spot fixed values match the existing data types."""
spot_fixes_df = spot_fixes_df.set_index(spot_fix["idx_cols"])
barry = barry.set_index(spot_fix["idx_cols"])
if spot_fix["expect_unique"] is True:
cols_list = ", ".join(spot_fix["idx_cols"])
assert (
barry.index.is_unique
), f"This spot fix expects a unique set of idx_col, but the idx_cols provided are not uniquely identifying: {cols_list}."
barry.loc[spot_fixes_df.index, spot_fix["fix_cols"]] = spot_fixes_df
barry = barry.reset_index()
barry |
src/pudl/transform/classes.py
Outdated
df.loc[df[params.record_col] == params.record_id, key] = params.fixes[ | ||
key | ||
] # Manually update value | ||
if not (spot_fixes_df.dtypes == df[params.idx_cols + params.fix_cols].dtypes).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be an overly strict test. E.g. what if the type of the specified fix is int
but the column being fixed a nullable pandas Int64
dtype? Some unit tests will help ferret out this kind of thing too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point. Retooling to use a simpler solution here - we convert the dtypes of the input to match the corresponding named column dtype in the input. That way the spot fix can be run before or after dtype conversions in the main transform function, and we get more flexible handling of string -> datetime, float/integer conversions. By default this will return a ValueError if the input isn't able to be converted into the correct dtype, and point to the source of the issue, which I think is ideal behavior in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds like a good solution -- just making sure the fix is compatible (convertible to) the type that it'll find in the context of the dataframe it's part of.
Let's Define Spot Fixing...I would love to settle on a clear definition for "spot fix" (and maybe change the function name to be more specific). It can mean many different things, and I'm not sure how many of those things we actually want this function to cover. @zaneselvans you mentioned wanting to use this for a lot of different types of spot fixes, but I don't want this to become a black box. I think it's worth asking ourselves if we want more functions with overlapping capacity or fewer, less specific functions with generic purposes. I'm slightly in favor of the former for clarity's sake. For instance, we have the function Right now the
And the
In my mind, the purpose of this spot fixer is enable quick, one-off fixes for those gnarly little errors that slip through the cracks. Say someone is looking through the data and notices a glaring but tiny but obvious mistake like a misspelling or year error--instead of getting lost in an issue thread or going to die in the icebox, it can get taken care of in the spot fixer. In other words:
Other, similar types of spot fixing include:
I think at most this function could do all three of these things. But I want to define clear boundaries so it doesn't overlap with or get confused with any of the other extant functions. Current Ad-hoc spot fixes:Here's a list of some spot fixes I found in the FERC1 and EIA data that can help inform how we thing about spot fixes
...will continue to update as I find more |
Thanks for this @aesharpe. In my mind this function as initially described in the issue and as I've written it is well suited to addressing relatively contained errors that one spots while poking around in the data and would like to manually fix. This could include fixes like: correcting typos, fixing an inputted variable for a single plant or year, or fixing IDs as in the Other functions like I'm inclined to keep the form of this spot_fixer() general enough to accept many kinds of input data, but its intended purpose relatively limited. I'm happy to update the function documentation to make the intended use of this function clearer. |
@aesharpe Most of the data cleaning we have to do can be done in much more general ways, but it seems like there often end up being a few dregs that we know are wrong, and can identify fixes for, but that don't fit into any of the more generalized cleaning processes, so we've ended up imposing manually compiled edits. Spot fixing should be a last resort. When other more general fixes are available, we should use them first. Also, it only makes sense to use this kind of fix when it's a hard-coded edit. It can't include regexes. It can't accommodate things like converting percentages to proportions. We already having the coding tables and encoders to fix bad codes and document the meanings of the correct codes. So I think this should only be used on things like the one-off errors in your list. The other fixes you listed are more general / programmatic. Some examples that seem appropriate:
I think the only things this PR lacks are the data-type compatibility check/enforcement step that @e-belfer suggested, and some unit tests. We can create a separate issue enumerating existing manually compiled spot-fixes that can be converted to use this function and the format of fix that it requires. |
@zaneselvans ok, I'm glad we're on the same page about this. I was thinking you meant someone more broad when you said "generic". The list of spot-fixes I made was more to show the scope of little tweaks we do and to help us consider what we would want to call a spot fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! I suggested a few additional test cases.
(1, 1776, "2020-01-01T00:00:00.000000000", 132.1, "planty"), | ||
(2, 1976, "2020-01-02", 2000.0, "genny"), | ||
(3, 1976, date.today(), 123.1, "big burner"), | ||
(4, 1976, "1985-02-01", 213.1, "sparky"), | ||
(5, pd.NA, pd.NA, -5.0, "barry"), | ||
(6, 2000, "2000-01-01", 231.1, "replace me"), | ||
(7, pd.NA, pd.NA, 101.10, pd.NA), | ||
(8, 2012, "01-01-2020", 899.98, "another plant name"), | ||
(9, 1850, "2022-03-01T00:00:00.000000000", 543.21, np.nan), | ||
(10, date.today().year, date.today(), 8.1, "cat corp"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other test cases that come to mind:
- Check that the type enforcement works by giving it e.g. a numeric string "123.456" as a fix to apply in a float column, and a number to apply in a string column (e.g. 123.456 should get turned into "123.456")
- Check that when an applied spot-fix cannot be cast to the data type of the input column it applies to, the appropriate exception is raised (e.g. give it a non-numeric string to apply in a float column).
- Check that when a given set of
idx_cols
criteria select more than one row, the spot fix is applied to all of the selected rows (e.g. try using a single spot fix to set all plant names with year=1976 to "Bicentennial") - Check what happens when the spot fixes you give it select zero rows -- should result in no change to the input dataframe at all, right?
Are there any other edge cases you can think of that need to be exercised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check that the type enforcement works by giving it e.g. a numeric string "123.456" as a fix to apply in a float column, and a number to apply in a string column (e.g. 123.456 should get turned into "123.456")
Added.
Check that when an applied spot-fix cannot be cast to the data type of the input column it applies to, the appropriate exception is raised (e.g. give it a non-numeric string to apply in a float column).
Added this and a check that the non-unique error is also raised as expected.
Check that when a given set of idx_cols criteria select more than one row, the spot fix is applied to all of the selected rows (e.g. try using a single spot fix to set all plant names with year=1976 to "Bicentennial")
This should already be happening to convert all 1976 plants to have capacity_mw 999.9. One of these values is subsequently overwritten in a later spot fix, but two should remain changed.
Check what happens when the spot fixes you give it select zero rows -- should result in no change to the input dataframe at all, right?
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing the string-to-numeric or numeric-to-string type casting tests. What am I missing?
Maybe i lost the thread here on this, but I kinda thought these name fixes were for plant records that had only one record for each |
@cmgosnell Got it! I think there was some miscommunication about whether or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 🟢
PR Overview
We noticed during the FERC1 transform process that some plant records with missing name values can be manually rescued, with plant names fairly easily found by manual inspection for rows with data (#1980). In order to preserve these records from getting dropped, this PR adds a new general transform method to the
AbstractTableTransformer
. Built to be flexible for different datasets and sources, this function is designed to be used during the main stage of transformation, beforedrop_invalid_rows()
is called. These parameters should be stored as a list of dictionaries, where each dictionary is a set of transformations containing the following parameters:idx_cols
: the column names used to identify records to be spot-fixedfix_cols
: the columns to be spot-fixedexpect_unique
: is the user expecting that each spot fix will only be applied to one row in the dataframe? A boolean.spot_fixes
: a list of tuples containing the values for theidx_cols
andfix_cols
for each fix.As defensive measures, the code includes the following checks:
expect_unique
isTRUE
Particular forms of feedback that would be useful, in addition to general review:
datetime
objects as strings and they'll get converted using the new dtype checker.transform_main
?PR Checklist
dev
).