Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spot fix function/class #2254

Merged
merged 19 commits into from
Feb 20, 2023
Merged

Add spot fix function/class #2254

merged 19 commits into from
Feb 20, 2023

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Jan 31, 2023

PR Overview

We noticed during the FERC1 transform process that some plant records with missing name values can be manually rescued, with plant names fairly easily found by manual inspection for rows with data (#1980). In order to preserve these records from getting dropped, this PR adds a new general transform method to the AbstractTableTransformer. Built to be flexible for different datasets and sources, this function is designed to be used during the main stage of transformation, before drop_invalid_rows() is called. These parameters should be stored as a list of dictionaries, where each dictionary is a set of transformations containing the following parameters:

  • idx_cols: the column names used to identify records to be spot-fixed
  • fix_cols: the columns to be spot-fixed
  • expect_unique: is the user expecting that each spot fix will only be applied to one row in the dataframe? A boolean.
  • spot_fixes: a list of tuples containing the values for the idx_cols and fix_cols for each fix.

As defensive measures, the code includes the following checks:

  • checks that the input data can be converted to the data type of the dataframe
  • checks that the index columns provided act as a unique index if expect_unique is TRUE
  • checks that the records identified to be fixed are actually in the dataset / data subset.

Particular forms of feedback that would be useful, in addition to general review:

  • @cmgosnell Is the original set of manual fixes you provided still working as intended, even though record_id isn't a primary key anymore?
  • Are there any missing input type hints? We can provide datetime objects as strings and they'll get converted using the new dtype checker.
  • Should the spot fixer live at the very start or very end of transform_main?
  • Is the documentation for this transformation clear enough?

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@e-belfer e-belfer added ferc1 Anything having to do with FERC Form 1 data-repair Interpolating or extrapolating data that we don't actually have. rmi xbrl Related to the FERC XBRL transition labels Jan 31, 2023
@e-belfer e-belfer requested a review from aesharpe January 31, 2023 21:53
@e-belfer e-belfer self-assigned this Jan 31, 2023
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Jan 31, 2023

Codecov Report

Base: 86.0% // Head: 86.1% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (04883d3) compared to base (31b074e).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2254   +/-   ##
=====================================
  Coverage   86.0%   86.1%           
=====================================
  Files         74      74           
  Lines       9273    9304   +31     
=====================================
+ Hits        7983    8014   +31     
  Misses      1290    1290           
Impacted Files Coverage Δ
src/pudl/transform/params/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/classes.py 94.2% <100.0%> (+0.4%) ⬆️
src/pudl/transform/ferc1.py 95.9% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@e-belfer e-belfer added data-cleaning Tasks related to cleaning & regularizing data during ETL. and removed data-repair Interpolating or extrapolating data that we don't actually have. labels Feb 1, 2023
Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function need a unit test?

Yes, it's good practice to make unit tests for these functions! You can make one in the test/unit/transform/classes_test.py module. It should be relatively strait forward to mimic the structure of the other tests there. Just make sure your test function has the suffix test_.

This currently breaks the validation test when I run it locally, presumably because the number of expected rows changes, as rows are getting rescued. How should I go about fixing this?

When you run the validation tests it should say something along the lines of "found X rows expected X rows" You can take that new number of found rows and put them in the set of expected rows for each of the ferc1 tables near the top of the test/validate/ferc1_test.py module.

Where in the main transform process should this function live? Putting it towards the end presumes the input will be normalized/well-behaved, while putting it earlier would run the input through the rest of the transformation pipeline (making it harder to spot fix particular numeric values that get transformed through the pipeline). The function currently lives at the very start of the main transform, presuming it will mostly be used to spot fix non-categorical string values (e.g. plant names), but data type restrictions allow int and float inputs as well.

I think this logic makes sense. If a user wanted to move it, they could simply recreate the transform_main function in their table transformer class and update the order.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good overall! see my reply to austen's comment about generalizing the identifying columns/values. and ditto to austen's responses to your three questions (the specific place to go tweak the expect values is in test.validate.ferc1_test.test_minmax_rows

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need to apply manually compiled spot-fixes is really widespread, and we've done it in a bunch of different ad-hoc ways in different modules / datasets. Rather than creating another narrowly focused solution here (which will become just another isolated method of doing spot-fixes) I was imagining that we would try and come up with something more generally applicable that we could use to standardize this operation across the codebase.

If we want to do that we should compile a list of the places where we're currently applying spot fixes so we can at least understand what cases we can / want to cover, and maybe translate them into unit-tests.

I don't think it's much more work to vectorize the application of groups of fixes by setting indexes / constructing dataframes of the fixes, and it'll make the fixes we compile more compact, and probably much faster (though this is probably not a time intensive step).

Comment on lines -834 to +835
self.normalize_strings(df)
self.spot_fix_values(df)
.pipe(self.normalize_strings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the FERC1 context where do we think the spot fixes should be applied? There's some overlap between the kinds of things that can be fixed by spot fixes, and the string normalization / categorization etc. The spot fixes feel like a fix of last resort, when the more general approaches can't work. Does it make sense for them to be the very first thing that's done? Do we foresee needing to apply different sets of spot fixes in more than one step?

Copy link
Member

@aesharpe aesharpe Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to answer and might end up varying. There are some cases where you might want to spot fix things so that they get recognized and treated aptly by other functions and there are some cases where you might want to spot fix things after they have gone though all possible transforms as a final "buffing" of the data.

Wherever we choose to put it there will undoubtedly be some tables where we'll have to override the transform and move the location of the spot fix function. I also think it's worth considering a scenario where we need to call spot fix at two different points of the transform process. I'm not sure our current setup can handle that.

@e-belfer
Copy link
Member Author

e-belfer commented Feb 8, 2023

Made some major updates in response to the great feedback above. Here's what's changed:

  • Building on @zaneselvans 's MultiIndex code, the function now iterates through "sets" of fixes (e.g., name changes for the steam table, manual capacity fix), rather than through each individual fix. Users can specify multiple index column for each set of fixes.
  • Added expected_unique boolean parameter based on @aesharpe 's suggestion. This will return a ValueError if users think they've supplied a primary set of indexing columns but they haven't. Conversely, we can also use the function to apply the same fix to multiple rows, if this is desired behavior.
  • Expects that the dtypes of the input data will match the dtypes of the spot fixed records, and throws a ValueError if not. Note that this might cause a headache if dtypes are changing throughout the transform_main step, or might change where in the process we want to add this function?

Some questions still to resolve:

  • @cmgosnell Is the original set of manual fixes you provided still working as intended, even though record_id isn't a primary key anymore? I'll wait to update test.validate.ferc1_test.test_minmax_rows until I know what we're changing immediately, if anything.
  • Tests still need to be written. What's a good way to scope out potential use/edge cases to test this function out on. especially as someone who is still new to the codebase?
  • What is the entire list of datatypes found in extracted dataframes that we might want to update? (And how do I get pydantic to play nicely with them?)

A small-scale demonstration of the actual spot-fixer function with barry:

from datetime import datetime

import numpy as np
import pandas as pd

# A few records from our favorite test plant...
barry = pd.DataFrame(
    np.rec.array(
        [
            (3, "1", "2020-01-01T00:00:00.000000000", 153.1),
            (3, "2", "2020-01-01T00:00:00.000000000", 153.1),
            (3, "3", "2020-01-01T00:00:00.000000000", 272.0),
            (3, "1", "2021-01-01T00:00:00.000000000", 153.1),
            (3, "2", "2021-01-01T00:00:00.000000000", 153.1),
            (3, "3", "2021-01-01T00:00:00.000000000", 272.0),
            (3, "1", "2022-01-01T00:00:00.000000000", 153.1),
            (3, "2", "2022-01-01T00:00:00.000000000", 153.1),
            (3, "3", "2022-01-01T00:00:00.000000000", 272.0),
        ],
        dtype=[
            ("plant_id_eia", "int64"),
            ("generator_id", "O"),
            ("report_date", "<M8[ns]"),
            ("capacity_mw", "<f8"),
        ],
    )
)

spot_dict = [
    {
        "idx_cols": ["plant_id_eia", "generator_id"],
        "fix_cols": ["capacity_mw"],
        "expect_unique": False,
        "spot_fixes": [
            (3, "1", 1000.0),
            (3, "2", 999.0),
        ],
    },
    {
        "idx_cols": ["generator_id", "report_date"],
        "fix_cols": ["capacity_mw"],
        "expect_unique": True,
        "spot_fixes": [
            ("1", datetime.strptime("2022-01-01", "%Y-%m-%d"), 200.1),
            ("1", datetime.strptime("2021-01-01", "%Y-%m-%d"), 100.1),
        ],
    },
]


for spot_fix in spot_dict:

    spot_fixes_df = pd.DataFrame(
        spot_fix["spot_fixes"], columns=spot_fix["idx_cols"] + spot_fix["fix_cols"]
    )

    assert (
        spot_fixes_df.dtypes
        == barry[spot_fix["idx_cols"] + spot_fix["fix_cols"]].dtypes
    ).all(), "Spot fix data types do not match existing dataframe datatypes."
    """Check that the datatypes of the spot fixed values match the existing data types."""

    spot_fixes_df = spot_fixes_df.set_index(spot_fix["idx_cols"])
    barry = barry.set_index(spot_fix["idx_cols"])

    if spot_fix["expect_unique"] is True:
        cols_list = ", ".join(spot_fix["idx_cols"])
        assert (
            barry.index.is_unique
        ), f"This spot fix expects a unique set of idx_col, but the idx_cols provided are not uniquely identifying: {cols_list}."

    barry.loc[spot_fixes_df.index, spot_fix["fix_cols"]] = spot_fixes_df
    barry = barry.reset_index()

barry

df.loc[df[params.record_col] == params.record_id, key] = params.fixes[
key
] # Manually update value
if not (spot_fixes_df.dtypes == df[params.idx_cols + params.fix_cols].dtypes).all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be an overly strict test. E.g. what if the type of the specified fix is int but the column being fixed a nullable pandas Int64 dtype? Some unit tests will help ferret out this kind of thing too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point. Retooling to use a simpler solution here - we convert the dtypes of the input to match the corresponding named column dtype in the input. That way the spot fix can be run before or after dtype conversions in the main transform function, and we get more flexible handling of string -> datetime, float/integer conversions. By default this will return a ValueError if the input isn't able to be converted into the correct dtype, and point to the source of the issue, which I think is ideal behavior in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds like a good solution -- just making sure the fix is compatible (convertible to) the type that it'll find in the context of the dataframe it's part of.

@e-belfer e-belfer changed the title Added spot fix function/class Add spot fix function/class Feb 14, 2023
@aesharpe
Copy link
Member

Let's Define Spot Fixing...

I would love to settle on a clear definition for "spot fix" (and maybe change the function name to be more specific). It can mean many different things, and I'm not sure how many of those things we actually want this function to cover. @zaneselvans you mentioned wanting to use this for a lot of different types of spot fixes, but I don't want this to become a black box.

I think it's worth asking ourselves if we want more functions with overlapping capacity or fewer, less specific functions with generic purposes. I'm slightly in favor of the former for clarity's sake. For instance, we have the function replace_with_na which we could probably merge into the spot-fixer if we wanted, but it's nice to have a specific logger for replacing things with NA (though we could also have more nuanced logging outputs for the spot-fixer). Another example is whether we should define a separate function for something like bad +/- signs. Technically that could be done by a spot-fixer, but do we want that?

Right now the SpotFixer class in classes.py says:

Parameters that replace certain values with a manual corrected value

And the spot_fix_values() function says:

Manually fix one-off singular missing values across a DataFrame.

In my mind, the purpose of this spot fixer is enable quick, one-off fixes for those gnarly little errors that slip through the cracks. Say someone is looking through the data and notices a glaring but tiny but obvious mistake like a misspelling or year error--instead of getting lost in an issue thread or going to die in the icebox, it can get taken care of in the spot fixer. In other words:

  • Cherry pick singular bad values and replace them with specific new values.
    • Ex: I found a typo in this one plant name that I want to fix, or this row is blank but should actually be plant name XYZ.

Other, similar types of spot fixing include:

  • Bulk replace certain bad values with a certain other values.
    • Ex: All 0s should be NA or all cases of "Plant4" should be "Plant"
    • Something like categorize_strings but less strict...(maybe categorize_strings itself doesn't have to be strict, and we just rely on the enum defined in fields.py enforced in enforce_schema function, or we add another parameter about strictness).
  • Use regex to replace bad values.
    • Ex: Anytime you see the phrase "X Plant" replace it with "Y Plant" or in this specific row, replace "--" with "-"

I think at most this function could do all three of these things. But I want to define clear boundaries so it doesn't overlap with or get confused with any of the other extant functions.

Current Ad-hoc spot fixes:

Here's a list of some spot fixes I found in the FERC1 and EIA data that can help inform how we thing about spot fixes

  • Combine data from certain adjacent rows. I made a function called spot_fix_rows() for the small plants table that does this. (We might want to consider renaming one or both functions to distinguish them).
  • Alter part of the data with a constant. The ownership_eia860 table reports pre-2012 values as percentages and post-2012 values as proportions. To sync the data, we divide all pre-2012 values by 100.
  • Fix bad +/- signage throughout the data. We created the function apply_sign_conventions for the plant_in_service table, and we might want to flesh that out instead of just using a spot-fixer. We could use a generic spot-fixer for one-off bad negatives, but I wonder if we'd be better off creating a function with a more specific purpose (fixing bad signs), if just for transparency about what's happening.
  • Fix spelling of should-be categoricals.
    • The ownership_eia860 table creates a new column called owner_country for instances of CN (Canada) in the owner_state column, changes the spelling to CAN, and nullifies all CN values in owner_state.
    • The generators_eia860 and plants_eia860 tables has a few should-be boolean column with values N, Y, X. We determined that X means N and replaced them accordingly so the boolean converter would properly recognize the values.
    • The generators_eia860 table also has some U values that are converted to unknown
    • The eia861 tables also do this with bad NERC region values and balancing authority codes
  • Fillna(). For the generators_eia860 table, we use fillna("RE") on the operational_status_code because we concatenated several tables and one of them didn't have values for operational_status_code.
  • One-off Errors.
    • The balancing_authority_eia861 table has a case of a single balancing authority id reported as 13047 thing when it should be 13407.
    • The balancing_authority_eia861 table

...will continue to update as I find more

@e-belfer
Copy link
Member Author

e-belfer commented Feb 14, 2023

Thanks for this @aesharpe.

In my mind this function as initially described in the issue and as I've written it is well suited to addressing relatively contained errors that one spots while poking around in the data and would like to manually fix. This could include fixes like: correcting typos, fixing an inputted variable for a single plant or year, or fixing IDs as in the balancing_authority_eia861 table. This is similar to how you've described cherry picking and a more limited subset of bulk replacing a small handful of specific values. Ideally, one would use this for instances where no more than several dozen values are being replaced in any one set of spot fixes, not where one wishes to update hundreds or more values. For instance, we could presumably replace the function spot_fix_rows with transformation parameters used in the spot_fix_values.

Other functions like pd.replace() would be better suited to handle tasks where the value used to find a record is also the value to be replaced, such as replacing all instances of a value or regex match (e.g. CN -> CAN, big--plant -> big_plant).

I'm inclined to keep the form of this spot_fixer() general enough to accept many kinds of input data, but its intended purpose relatively limited. I'm happy to update the function documentation to make the intended use of this function clearer.

@zaneselvans
Copy link
Member

@aesharpe Most of the data cleaning we have to do can be done in much more general ways, but it seems like there often end up being a few dregs that we know are wrong, and can identify fixes for, but that don't fit into any of the more generalized cleaning processes, so we've ended up imposing manually compiled edits. Spot fixing should be a last resort. When other more general fixes are available, we should use them first. Also, it only makes sense to use this kind of fix when it's a hard-coded edit. It can't include regexes. It can't accommodate things like converting percentages to proportions. We already having the coding tables and encoders to fix bad codes and document the meanings of the correct codes.

So I think this should only be used on things like the one-off errors in your list. The other fixes you listed are more general / programmatic. Some examples that seem appropriate:

  • Timezone offset code fixes enumerated at the top of pudl.transform.ferc714
  • Balancing authority name & ID fixes enumerated at the top of pudl.transform.eia861
  • County name fixes enumerated at the top of pudl.transform.eia861

I think the only things this PR lacks are the data-type compatibility check/enforcement step that @e-belfer suggested, and some unit tests.

We can create a separate issue enumerating existing manually compiled spot-fixes that can be converted to use this function and the format of fix that it requires.

@aesharpe
Copy link
Member

@zaneselvans ok, I'm glad we're on the same page about this. I was thinking you meant someone more broad when you said "generic". The list of spot-fixes I made was more to show the scope of little tweaks we do and to help us consider what we would want to call a spot fix.

@e-belfer e-belfer marked this pull request as ready for review February 15, 2023 15:46
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! I suggested a few additional test cases.

Comment on lines +341 to +350
(1, 1776, "2020-01-01T00:00:00.000000000", 132.1, "planty"),
(2, 1976, "2020-01-02", 2000.0, "genny"),
(3, 1976, date.today(), 123.1, "big burner"),
(4, 1976, "1985-02-01", 213.1, "sparky"),
(5, pd.NA, pd.NA, -5.0, "barry"),
(6, 2000, "2000-01-01", 231.1, "replace me"),
(7, pd.NA, pd.NA, 101.10, pd.NA),
(8, 2012, "01-01-2020", 899.98, "another plant name"),
(9, 1850, "2022-03-01T00:00:00.000000000", 543.21, np.nan),
(10, date.today().year, date.today(), 8.1, "cat corp"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other test cases that come to mind:

  • Check that the type enforcement works by giving it e.g. a numeric string "123.456" as a fix to apply in a float column, and a number to apply in a string column (e.g. 123.456 should get turned into "123.456")
  • Check that when an applied spot-fix cannot be cast to the data type of the input column it applies to, the appropriate exception is raised (e.g. give it a non-numeric string to apply in a float column).
  • Check that when a given set of idx_cols criteria select more than one row, the spot fix is applied to all of the selected rows (e.g. try using a single spot fix to set all plant names with year=1976 to "Bicentennial")
  • Check what happens when the spot fixes you give it select zero rows -- should result in no change to the input dataframe at all, right?

Are there any other edge cases you can think of that need to be exercised?

Copy link
Member Author

@e-belfer e-belfer Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that the type enforcement works by giving it e.g. a numeric string "123.456" as a fix to apply in a float column, and a number to apply in a string column (e.g. 123.456 should get turned into "123.456")

Added.

Check that when an applied spot-fix cannot be cast to the data type of the input column it applies to, the appropriate exception is raised (e.g. give it a non-numeric string to apply in a float column).

Added this and a check that the non-unique error is also raised as expected.

Check that when a given set of idx_cols criteria select more than one row, the spot fix is applied to all of the selected rows (e.g. try using a single spot fix to set all plant names with year=1976 to "Bicentennial")

This should already be happening to convert all 1976 plants to have capacity_mw 999.9. One of these values is subsequently overwritten in a later spot fix, but two should remain changed.

Check what happens when the spot fixes you give it select zero rows -- should result in no change to the input dataframe at all, right?

Added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing the string-to-numeric or numeric-to-string type casting tests. What am I missing?

@cmgosnell
Copy link
Member

@cmgosnell Is the original set of manual fixes you provided still working as intended, even though record_id isn't a primary key anymore? I'll wait to update test.validate.ferc1_test.test_minmax_rows until I know what we're changing immediately, if anything.

Maybe i lost the thread here on this, but I kinda thought these name fixes were for plant records that had only one record for each record_id. But i'm assuming this wasn't the case because you have expect_unique is False? I think for this one in particular, I would have thought it would be unique but just fixing the name even on multiple fields feels relatively harmless!

@e-belfer
Copy link
Member Author

@cmgosnell Got it! I think there was some miscommunication about whether or not record_id is uniquely identifying, and it should be in this case as far as I understand. I updated the expectations and they haven't thrown any errors so I guess that's a secondary confirmation :)

@e-belfer e-belfer removed the request for review from aesharpe February 16, 2023 20:08
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🟢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-cleaning Tasks related to cleaning & regularizing data during ETL. ferc1 Anything having to do with FERC Form 1 rmi xbrl Related to the FERC XBRL transition
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

create spot fixer as generic transformer & apply to missing plant names
4 participants