Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

402 incorrect flags #32

Merged
merged 6 commits into from
Jun 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 18 additions & 4 deletions src/imputation_flags.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
def create_impute_flags(
df: pd.DataFrame,
target: str,
period: str,
reference: str,
strata: str,
auxiliary: str,
Expand All @@ -25,9 +26,10 @@ def create_impute_flags(
DataFrame containing forward, backward predictive period columns (
These columns are created by calling flag_matched_pair_merge forward
and backwards)

target : str
Column name containing target variable.
period: str
Column name containing date variable.
reference : str
Column name containing business reference id.
strata : str
Expand Down Expand Up @@ -59,11 +61,20 @@ def create_impute_flags(
backward_target_roll = "b_predictive_" + target + "_roll"
forward_aux_roll = "f_predictive_" + auxiliary + "_roll"

df[forward_target_roll] = df.groupby([reference, strata])[
df.sort_values([reference, strata, period], inplace=True)

# TODO : similar conditions at cum imputation links
df["fill_group"] = (
(df[period] - pd.DateOffset(months=1) != df.shift(1)[period])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard coding the month offset? variable frequency of data, if supplied every 4 months this would need to be updated.

| (df[strata].diff(1) != 0)
| (df[reference].diff(1) != 0)
).cumsum()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToDo: cumsum() does not work when date offset is -1.
Might be nice to have but maybe not essential.
Cumsum works top to bottom of a column? Might need to see if that can be reversed.

Copy link
Collaborator Author

@AntonZogk AntonZogk Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this code is trying to do is assign a unique id for the values which meets these criteria (values were sorted):

  1. Date difference between 2 rows is the same if you offset them
  2. no difference in strata
  3. no difference in refence

So hard coded 1 in pd.DateOffset refers to the data frequency and this approach should fail if not monthly data are supplied. The hard coded 1 in the shift and diff refers to the previous row though ( this is why we sorted the data) consequently no reason changing these hard coded values. But if we do we need to change the cumsum to do a reverse order for -1, bellow is the result without cumsum() for values 1 and -1. So for -1 we need to apply cumsum() in reverse order (we can use [::-1].cumsum()), but both 1 and -1 will give the same result ( values will have same integer if same strata, reference and date diffenence is ok) . There is a TODO for now and we can revisit this in the future.

reference strata period target_variable fill_group_1 fill_group_-1
0 1 100 2020-01-01 00:00:00 8444 True False
1 1 100 2020-02-01 00:00:00 nan False False
2 1 100 2020-03-01 00:00:00 2003 False False
3 1 100 2020-04-01 00:00:00 1003 False True
4 2 100 2020-01-01 00:00:00 nan True False
5 2 100 2020-02-01 00:00:00 nan False False
6 2 100 2020-03-01 00:00:00 nan False False
7 2 100 2020-04-01 00:00:00 3251 False True
8 3 100 2020-01-01 00:00:00 nan True False
9 3 100 2020-02-01 00:00:00 7511 False False
10 3 100 2020-03-01 00:00:00 1234 False False
11 3 100 2020-04-01 00:00:00 1214 False True
12 4 100 2020-01-01 00:00:00 64 True False
13 4 100 2020-02-01 00:00:00 nan False False
14 4 100 2020-03-01 00:00:00 nan False False
15 4 100 2020-04-01 00:00:00 254 False True
16 5 100 2020-01-01 00:00:00 65 True False
17 5 100 2020-02-01 00:00:00 342 False False
18 5 100 2020-03-01 00:00:00 634 False False
19 5 100 2020-04-01 00:00:00 254 False True
20 6 100 2020-01-01 00:00:00 64 True False
21 6 100 2020-02-01 00:00:00 nan False False
22 6 100 2020-03-01 00:00:00 654 False False
23 6 100 2020-04-01 00:00:00 nan False True
24 7 100 2020-01-01 00:00:00 nan True False
25 7 100 2020-02-01 00:00:00 nan False False
26 7 100 2020-03-01 00:00:00 nan False True


df[forward_target_roll] = df.groupby([reference, strata, "fill_group"])[
"f_predictive_" + target
].ffill()

df[backward_target_roll] = df.groupby([reference, strata])[
df[backward_target_roll] = df.groupby([reference, strata, "fill_group"])[
"b_predictive_" + target
].bfill()

Expand All @@ -80,7 +91,9 @@ def create_impute_flags(
construction_conditions = df[target].isna() & df[auxiliary].notna()
df["c_flag"] = np.where(construction_conditions, True, False)

df[forward_aux_roll] = df.groupby([reference, strata])[predictive_auxiliary].ffill()
df[forward_aux_roll] = df.groupby([reference, strata, "fill_group"])[
predictive_auxiliary
].ffill()

fic_conditions = df[target].isna() & df[forward_aux_roll].notna()
df["fic_flag"] = np.where(fic_conditions, True, False)
Expand All @@ -91,6 +104,7 @@ def create_impute_flags(
backward_target_roll,
forward_aux_roll,
predictive_auxiliary,
"fill_group",
],
axis=1,
inplace=True,
Expand Down
4 changes: 4 additions & 0 deletions tests/imputation_flag_data.csv
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,7 @@ reference,strata,period,target_variable,auxiliary,f_predictive_target_variable,b
7,100,202001,,40.0,,,False,False,False,True,False,,c
7,100,202002,,,,,False,False,False,False,True,40.0,fic
7,100,202003,,,,,False,False,False,False,True,,fic
8,100,202001,789,55,,,TRUE,FALSE,FALSE,FALSE,FALSE,,r
8,100,202002,,66,789,,FALSE,TRUE,FALSE,TRUE,TRUE,55,fir
8,100,202004,,77,,987,FALSE,FALSE,TRUE,TRUE,FALSE,,bir
8,100,202005,987,88,,,TRUE,FALSE,FALSE,FALSE,FALSE,77,r
1 change: 1 addition & 0 deletions tests/test_imputation_flags.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ def test_create_impute_flags(self, imputation_flag_test_data):
df_output = create_impute_flags(
df=df_input,
target="target_variable",
period="period",
reference="reference",
strata="strata",
auxiliary="auxiliary",
Expand Down
Loading