Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise cps_stage4/extrapolation.py script #242

Merged
merged 10 commits into from
Jul 12, 2018
Merged

Revise cps_stage4/extrapolation.py script #242

merged 10 commits into from
Jul 12, 2018

Conversation

martinholmer
Copy link
Contributor

This pull request makes changes that allow the extrapolation.py script to run to completion and to pass the taxdata repo's code-style test. The first commit in this PR increases the relative tolerance (for an acceptable extrapolation) from 0.01 to 0.05, which allows the script to run to completion. The second commit eliminates several pycodestyle (nee PEP8) warnings such as line too long, too many spaces, and ambiguous variable name. The changes in the second commit in this PR produce exactly the same cps_benefits.csv.gz file as the first commit produced, confirming that the changes in the second commit are purely cosmetic.

However, the contents of the new cps_benefits.csv.gz file produced with this code are substantially different from the contents of the cps_benefits.csv.gz file on the taxdata master branch (which is currently being used in Tax-Calculator). The next step is to determine if this new cps_benefits.csv.gz file has sensible content. We plan to do that by using it (and other more current taxdata made files) in the kind of "test" described in taxdata issue #241.

@hdoupe and @andersonfrailey, I'd appreciate it if you could review these changes.

Also, @hdoupe, I need your direction on how to handle the pandas warning described at the bottom of this 232 comment.

def _extrapolate(WT, I, benefits, prob, target,
benefit_name, benefit_year, tol=0.01, J=15):
def _extrapolate(WT, III, benefits, prob, target,
benefit_name, benefit_year, tol, JJJ=15):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinholmer why did you change the variable names here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the pucodestyle warnings about ambiguous variable name.
See this list of warnings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but III doesn't seem to add any more information about what the matrix is than the information given by I. Could we use something like indicator or ind, given that this is an indicator matrix. JJJ could be num_members_in_unit or something else like that. Does that seem sensible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdoupe said:

I see, but III doesn't seem to add any more information about what the matrix is than the information given by I.

Yes, you are absolutely right about that.

Could we use something like indicator or ind, given that this is an indicator matrix. JJJ could be num_members_in_unit or something else like that. Does that seem sensible?

I like your proposed rename of JJJ because the variable name says what it is.
But changing III to indicator still leaves readers of the script wondering what is being indicated.
What does III indicate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. By "indicator" I mean: indicator of whether the person is participating or not. How about participating_switch or participating_indicator? These are still pretty long though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdoupe said:

JJJ could be num_members_in_unit or something else like that.
Do you mean number of people in a filing unit? If so, that doesn't seem right given the code.
For example, what does JJJ=15 mean? Can you clarify?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that J=15 means that there are at a maximum 15 members in each unit. That means that the maximum width of the participation indicator matrix is 15.

prob_col = [col for col in list(cps_benefit)
if col.startswith('{0}_PROB'.format(benefit.upper()))]
if col.startswith('{}_PROB'.format(benefit.upper()))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinholmer what's the benefit of using '{}'.format(...) over '{0}'.format(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's one character less.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the explanation.

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 5, 2018

@martinholmer said:

However, the contents of the new cps_benefits.csv.gz file produced with this code are substantially different from the contents of the cps_benefits.csv.gz file on the taxdata master branch (which is currently being used in Tax-Calculator). The next step is to determine if this new cps_benefits.csv.gz file has sensible content. We plan to do that by using it (and other more current taxdata made files) in the kind of "test" described in taxdata issue #241.

Is there a way that we can examine how the data that is used as an input to this script has changed over time? Perhaps, something has changed in this script, but if the inputs data is different then, that would explain why the tolerance needed to be raised and why the results are so different.

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 5, 2018

@martinholmer said:

Also, @hdoupe, I need your direction on how to handle the pandas warning described at the bottom of this 232 comment.

I'm looking into this now.

@martinholmer
Copy link
Contributor Author

@hdoupe said:

if the input data are different, then that would explain why the tolerance needed to be raised and why the results are so different.

Yes, I'm sure the inputs have changed. The extrapolation.py script hasn't been run in months and when I tried to run it it crashed because it was reading in a file that did not exist in the taxdata repository. I think things were way too messed up to do a backward-looking investigation. My focus is on looking forward and getting sensible (that is, tested) benefits extrapolation results. I think the preliminary testing work in issue #241 already shows previously undetected problems with the extrapolated TANF data.

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 6, 2018

@martinholmer said:

My focus is on looking forward and getting sensible (that is, tested) benefits extrapolation results. I think the preliminary testing work in issue #241 already shows previously undetected problems with the extrapolated TANF data.

OK, that sounds good to me.

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 6, 2018

The line of code that causes the pandas warning is:

result = pd.concat([noncandidates, candidates], axis=0, ignore_index=False)

This is taking two dataframes:

  • noncandidates -- a N_n*J by 4 matrix (why 4?: see columns bullet point)
    • N_n the number of people who are not candidates for a change in participation status
    • J - max number of people in each unit (15 for our data set)
    • Columns: 'wt', 'I', 'benefits', 'prob' (weight, participation status indicator, benefit amount received, probability of participating in program)
  • candidates -- a N_c*J by 4
    • N_c the number of people who are candidates for a change in participation status
    • same definitions for other dimensions

This line of code produces one dataframe:

  • results -- a J * (N_n + N_c) by 4 matrix

Thus, the two dataframes are simply stacked on top of each other. They are aligned by their "columns". That is the I column of candidates is matched to the I column of noncandidates and so on with the other columns. Both dataframes have the exact same columns and these columns have the exact same order.

Pandas issue pandas-dev/pandas#4588 discusses whether pandas should sort these columns (or rows if axis=1) by default if they are not already aligned. If you pass sort=True, then pandas will sort both dataframes by their columns (or rows) using alphanumeric ordering. If you pass sort=False, then pandas will not sort the dataframes no matter whether the columns (or rows) are aligned or not.

In our case, both dataframes ought to have the same columns. I am unfamiliar with the inner workings of pandas and thus, cannot say whether the columns are guaranteed to have the same ordering or not. So, my recommendation is to pass sort=True. join can be set to inner or outer. Both dataframes should have the same columns--no more and no less. So, it doesn't matter whether the join is inner or outer.

This explanation is based on my reading of the docs. If anyone has a different interpretation of them, then I'm interested in hearing it. Also, please let me know if this explanation is unclear or could be improved. I hope this explanation clears things up.

@martinholmer thanks for laying out the problem clearly and directing me to the relevant portion of the pandas docs.

@martinholmer
Copy link
Contributor Author

martinholmer commented Jul 10, 2018

@hdoupe, We need to revisit the issue of the Pandas warning.

The warning (the text of which I post below) arises when this line is executed:

        result = pd.concat([noncandidates, candidates], axis=0,
                           ignore_index=False)

So, you are merging the noncandidates and the candidates dataframes on axis=0, which is the rows.
Is that correct? Is that what you want to do? I don't see how the rows match up since the names of the two dataframes suggest they are two distinct groups.

The complete warning is this:

extrapolation.py:159: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False
  ignore_index=False)

I have no idea what you're doing here, so you're going to have to give me more direction.
To me, the warning says the order of the columns is not the same in the two dataframes. Is that true or false?

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 10, 2018

@martinholmer asked:

So, you are merging the noncandidates and the candidates dataframes on axis=0, which is the rows.
Is that correct? Is that what you want to do? I don't see how the rows match up since the names of the two dataframes suggest they are two distinct groups.

This isn't quite correct. Consider this toy example:

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
>>> df
   a  b
0  1  3
1  2  4
>>> df2 = pd.DataFrame({"a": [5, 6], "b": [7, 8]})
>>> df2
   a  b
0  5  7
1  6  8
>>> r = pd.concat([df, df2], axis=0)
>>> r
   a  b
0  1  3
1  2  4
0  5  7
1  6  8
>>> r.a
0    1
1    2
0    5
1    6
Name: a, dtype: int64
>>> r.b
0    3
1    4
0    7
1    8
Name: b, dtype: int64
>>> 

There are two dataframes with the same columns "a" and "b". Both columns in both data frames have two elements in them. The dataframes are stacked on top of each other. The result is one dataframe with two columns "a" and "b" and four elements in each column.

This is the same thing that occurs in extrapolation.py. Does this make sense?

I have no idea what you're doing here, so you're going to have to give me more direction.
To me, the warning says the order of the columns is not the same in the two dataframes. Is that true or false?

I'm not sure. Both dataframes have the same columns. There could be an internal pandas bug or something in the script that swaps the order of the columns. If the column order is changed, then yes, they would need to be sorted. However, I'm not sure why the order would change.

I hope this helps. I'm happy to continue going over how this script works and working to make it easier to understand.

@martinholmer
Copy link
Contributor Author

@hdoupe, Thanks for the detailed explanation of pd.concat logic in pull request #242.

I've added sort=False to the pd.concat call and the extrapolation.py results are unchanged. Does that seem like the correct way to suppress the warning message?

@martinholmer
Copy link
Contributor Author

martinholmer commented Jul 10, 2018

@hdoupe and @andersonfrailey, I think PR #242 is ready for review. All the changes today are either cosmetic style changes or eliminate the Pandas concat warning. Today's changes produce exactly the same results as were being produced before today's changes.

@andersonfrailey
Copy link
Collaborator

Thanks for working on this @martinholmer and @hdoupe. LGTM, but I'd defer to @hdoupe's final judgement.

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 11, 2018

Yes, this looks good to me. Thanks for making these updates @martinholmer.

I'll take a look at the failing tests soon. It looks like the path names were never moved into fixtures in addition to some other maintenance work that they will need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants