-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise cps_stage4/extrapolation.py script #242
Revise cps_stage4/extrapolation.py script #242
Conversation
cps_stage4/extrapolation.py
Outdated
def _extrapolate(WT, I, benefits, prob, target, | ||
benefit_name, benefit_year, tol=0.01, J=15): | ||
def _extrapolate(WT, III, benefits, prob, target, | ||
benefit_name, benefit_year, tol, JJJ=15): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martinholmer why did you change the variable names here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the pucodestyle warnings about ambiguous variable name.
See this list of warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but III
doesn't seem to add any more information about what the matrix is than the information given by I
. Could we use something like indicator
or ind
, given that this is an indicator matrix. JJJ
could be num_members_in_unit
or something else like that. Does that seem sensible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hdoupe said:
I see, but
III
doesn't seem to add any more information about what the matrix is than the information given byI
.
Yes, you are absolutely right about that.
Could we use something like
indicator
orind
, given that this is an indicator matrix.JJJ
could benum_members_in_unit
or something else like that. Does that seem sensible?
I like your proposed rename of JJJ
because the variable name says what it is.
But changing III
to indicator
still leaves readers of the script wondering what is being indicated.
What does III
indicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. By "indicator" I mean: indicator of whether the person is participating or not. How about participating_switch
or participating_indicator
? These are still pretty long though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hdoupe said:
JJJ
could benum_members_in_unit
or something else like that.
Do you mean number of people in a filing unit? If so, that doesn't seem right given the code.
For example, what doesJJJ=15
mean? Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that J=15
means that there are at a maximum 15 members in each unit. That means that the maximum width of the participation indicator matrix is 15.
cps_stage4/extrapolation.py
Outdated
prob_col = [col for col in list(cps_benefit) | ||
if col.startswith('{0}_PROB'.format(benefit.upper()))] | ||
if col.startswith('{}_PROB'.format(benefit.upper()))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martinholmer what's the benefit of using '{}'.format(...)
over '{0}'.format(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's one character less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for the explanation.
@martinholmer said:
Is there a way that we can examine how the data that is used as an input to this script has changed over time? Perhaps, something has changed in this script, but if the inputs data is different then, that would explain why the tolerance needed to be raised and why the results are so different. |
@martinholmer said:
I'm looking into this now. |
@hdoupe said:
Yes, I'm sure the inputs have changed. The |
@martinholmer said:
OK, that sounds good to me. |
The line of code that causes the pandas warning is:
This is taking two dataframes:
This line of code produces one dataframe:
Thus, the two dataframes are simply stacked on top of each other. They are aligned by their "columns". That is the Pandas issue pandas-dev/pandas#4588 discusses whether pandas should sort these columns (or rows if In our case, both dataframes ought to have the same columns. I am unfamiliar with the inner workings of pandas and thus, cannot say whether the columns are guaranteed to have the same ordering or not. So, my recommendation is to pass This explanation is based on my reading of the docs. If anyone has a different interpretation of them, then I'm interested in hearing it. Also, please let me know if this explanation is unclear or could be improved. I hope this explanation clears things up. @martinholmer thanks for laying out the problem clearly and directing me to the relevant portion of the pandas docs. |
@hdoupe, We need to revisit the issue of the Pandas warning. The warning (the text of which I post below) arises when this line is executed:
So, you are merging the The complete warning is this:
I have no idea what you're doing here, so you're going to have to give me more direction. |
@martinholmer asked:
This isn't quite correct. Consider this toy example:
There are two dataframes with the same columns "a" and "b". Both columns in both data frames have two elements in them. The dataframes are stacked on top of each other. The result is one dataframe with two columns "a" and "b" and four elements in each column. This is the same thing that occurs in
I'm not sure. Both dataframes have the same columns. There could be an internal pandas bug or something in the script that swaps the order of the columns. If the column order is changed, then yes, they would need to be sorted. However, I'm not sure why the order would change. I hope this helps. I'm happy to continue going over how this script works and working to make it easier to understand. |
@hdoupe and @andersonfrailey, I think PR #242 is ready for review. All the changes today are either cosmetic style changes or eliminate the Pandas concat warning. Today's changes produce exactly the same results as were being produced before today's changes. |
Thanks for working on this @martinholmer and @hdoupe. LGTM, but I'd defer to @hdoupe's final judgement. |
Yes, this looks good to me. Thanks for making these updates @martinholmer. I'll take a look at the failing tests soon. It looks like the path names were never moved into fixtures in addition to some other maintenance work that they will need. |
This pull request makes changes that allow the
extrapolation.py
script to run to completion and to pass the taxdata repo's code-style test. The first commit in this PR increases the relative tolerance (for an acceptable extrapolation) from 0.01 to 0.05, which allows the script to run to completion. The second commit eliminates several pycodestyle (nee PEP8) warnings such as line too long, too many spaces, and ambiguous variable name. The changes in the second commit in this PR produce exactly the samecps_benefits.csv.gz
file as the first commit produced, confirming that the changes in the second commit are purely cosmetic.However, the contents of the new
cps_benefits.csv.gz
file produced with this code are substantially different from the contents of thecps_benefits.csv.gz
file on the taxdata master branch (which is currently being used in Tax-Calculator). The next step is to determine if this newcps_benefits.csv.gz
file has sensible content. We plan to do that by using it (and other more current taxdata made files) in the kind of "test" described in taxdata issue #241.@hdoupe and @andersonfrailey, I'd appreciate it if you could review these changes.
Also, @hdoupe, I need your direction on how to handle the pandas warning described at the bottom of this 232 comment.