-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
402 incorrect flags #32
Conversation
* Add fill group for filling * Fill group check if there is a missing date too * Add period argument to function, to check for missing dates
Add scenario with missing date Adjust test function to use period column
In order to check if strata, reference difference is not 0 we need to sort the values.
* Fill group works only with diff(1) since we are seperating the values by strata and if missing date exists. * astype(int) is redundant, when summing True/False same as 1/0 * Add kwargs and update docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work catching and fixing this issue Anton.
Left some minor reviews with possible suggestions.
|
||
# TODO : similar conditions at cum imputation links | ||
df["fill_group"] = ( | ||
(df[period] - pd.DateOffset(months=1) != df.shift(1)[period]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard coding the month offset? variable frequency of data, if supplied every 4 months this would need to be updated.
(df[period] - pd.DateOffset(months=1) != df.shift(1)[period]) | ||
| (df[strata].diff(1) != 0) | ||
| (df[reference].diff(1) != 0) | ||
).cumsum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToDo: cumsum() does not work when date offset is -1.
Might be nice to have but maybe not essential.
Cumsum works top to bottom of a column? Might need to see if that can be reversed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this code is trying to do is assign a unique id for the values which meets these criteria (values were sorted):
- Date difference between 2 rows is the same if you offset them
- no difference in strata
- no difference in refence
So hard coded 1
in pd.DateOffset
refers to the data frequency and this approach should fail if not monthly data are supplied. The hard coded 1
in the shift
and diff
refers to the previous row though ( this is why we sorted the data) consequently no reason changing these hard coded values. But if we do we need to change the cumsum to do a reverse order for -1, bellow is the result without cumsum() for values 1
and -1
. So for -1
we need to apply cumsum()
in reverse order (we can use [::-1].cumsum()
), but both 1
and -1
will give the same result ( values will have same integer if same strata, reference and date diffenence is ok) . There is a TODO
for now and we can revisit this in the future.
reference | strata | period | target_variable | fill_group_1 | fill_group_-1 | |
---|---|---|---|---|---|---|
0 | 1 | 100 | 2020-01-01 00:00:00 | 8444 | True | False |
1 | 1 | 100 | 2020-02-01 00:00:00 | nan | False | False |
2 | 1 | 100 | 2020-03-01 00:00:00 | 2003 | False | False |
3 | 1 | 100 | 2020-04-01 00:00:00 | 1003 | False | True |
4 | 2 | 100 | 2020-01-01 00:00:00 | nan | True | False |
5 | 2 | 100 | 2020-02-01 00:00:00 | nan | False | False |
6 | 2 | 100 | 2020-03-01 00:00:00 | nan | False | False |
7 | 2 | 100 | 2020-04-01 00:00:00 | 3251 | False | True |
8 | 3 | 100 | 2020-01-01 00:00:00 | nan | True | False |
9 | 3 | 100 | 2020-02-01 00:00:00 | 7511 | False | False |
10 | 3 | 100 | 2020-03-01 00:00:00 | 1234 | False | False |
11 | 3 | 100 | 2020-04-01 00:00:00 | 1214 | False | True |
12 | 4 | 100 | 2020-01-01 00:00:00 | 64 | True | False |
13 | 4 | 100 | 2020-02-01 00:00:00 | nan | False | False |
14 | 4 | 100 | 2020-03-01 00:00:00 | nan | False | False |
15 | 4 | 100 | 2020-04-01 00:00:00 | 254 | False | True |
16 | 5 | 100 | 2020-01-01 00:00:00 | 65 | True | False |
17 | 5 | 100 | 2020-02-01 00:00:00 | 342 | False | False |
18 | 5 | 100 | 2020-03-01 00:00:00 | 634 | False | False |
19 | 5 | 100 | 2020-04-01 00:00:00 | 254 | False | True |
20 | 6 | 100 | 2020-01-01 00:00:00 | 64 | True | False |
21 | 6 | 100 | 2020-02-01 00:00:00 | nan | False | False |
22 | 6 | 100 | 2020-03-01 00:00:00 | 654 | False | False |
23 | 6 | 100 | 2020-04-01 00:00:00 | nan | False | True |
24 | 7 | 100 | 2020-01-01 00:00:00 | nan | True | False |
25 | 7 | 100 | 2020-02-01 00:00:00 | nan | False | False |
26 | 7 | 100 | 2020-03-01 00:00:00 | nan | False | True |
Summary
The aim is fixing a bug when a target value was not supplied
Checklists
This pull request meets the following requirements:
If you feel some of these conditions do not apply for this pull request, please
add a comment to explain why.
Closes #28