Revise cps_stage4/extrapolation.py script #242

martinholmer · 2018-07-05T20:22:07Z

This pull request makes changes that allow the extrapolation.py script to run to completion and to pass the taxdata repo's code-style test. The first commit in this PR increases the relative tolerance (for an acceptable extrapolation) from 0.01 to 0.05, which allows the script to run to completion. The second commit eliminates several pycodestyle (nee PEP8) warnings such as line too long, too many spaces, and ambiguous variable name. The changes in the second commit in this PR produce exactly the same cps_benefits.csv.gz file as the first commit produced, confirming that the changes in the second commit are purely cosmetic.

However, the contents of the new cps_benefits.csv.gz file produced with this code are substantially different from the contents of the cps_benefits.csv.gz file on the taxdata master branch (which is currently being used in Tax-Calculator). The next step is to determine if this new cps_benefits.csv.gz file has sensible content. We plan to do that by using it (and other more current taxdata made files) in the kind of "test" described in taxdata issue #241.

@hdoupe and @andersonfrailey, I'd appreciate it if you could review these changes.

Also, @hdoupe, I need your direction on how to handle the pandas warning described at the bottom of this 232 comment.

hdoupe · 2018-07-05T21:34:37Z

cps_stage4/extrapolation.py

-    def _extrapolate(WT, I, benefits, prob, target,
-                     benefit_name, benefit_year, tol=0.01, J=15):
+    def _extrapolate(WT, III, benefits, prob, target,
+                     benefit_name, benefit_year, tol, JJJ=15):


@martinholmer why did you change the variable names here?

To avoid the pucodestyle warnings about ambiguous variable name.
See this list of warnings.

I see, but III doesn't seem to add any more information about what the matrix is than the information given by I. Could we use something like indicator or ind, given that this is an indicator matrix. JJJ could be num_members_in_unit or something else like that. Does that seem sensible?

@hdoupe said:

I see, but III doesn't seem to add any more information about what the matrix is than the information given by I.

Yes, you are absolutely right about that.

Could we use something like indicator or ind, given that this is an indicator matrix. JJJ could be num_members_in_unit or something else like that. Does that seem sensible?

I like your proposed rename of JJJ because the variable name says what it is.
But changing III to indicator still leaves readers of the script wondering what is being indicated.
What does III indicate?

Ah, right. By "indicator" I mean: indicator of whether the person is participating or not. How about participating_switch or participating_indicator? These are still pretty long though.

@hdoupe said:

JJJ could be num_members_in_unit or something else like that.
Do you mean number of people in a filing unit? If so, that doesn't seem right given the code.
For example, what does JJJ=15 mean? Can you clarify?

My understanding is that J=15 means that there are at a maximum 15 members in each unit. That means that the maximum width of the participation indicator matrix is 15.

hdoupe · 2018-07-05T21:35:53Z

cps_stage4/extrapolation.py

            prob_col = [col for col in list(cps_benefit)
-                        if col.startswith('{0}_PROB'.format(benefit.upper()))]
+                        if col.startswith('{}_PROB'.format(benefit.upper()))]


@martinholmer what's the benefit of using '{}'.format(...) over '{0}'.format(...)

It's one character less.

OK, thanks for the explanation.

hdoupe · 2018-07-05T21:41:57Z

@martinholmer said:

However, the contents of the new cps_benefits.csv.gz file produced with this code are substantially different from the contents of the cps_benefits.csv.gz file on the taxdata master branch (which is currently being used in Tax-Calculator). The next step is to determine if this new cps_benefits.csv.gz file has sensible content. We plan to do that by using it (and other more current taxdata made files) in the kind of "test" described in taxdata issue #241.

Is there a way that we can examine how the data that is used as an input to this script has changed over time? Perhaps, something has changed in this script, but if the inputs data is different then, that would explain why the tolerance needed to be raised and why the results are so different.

hdoupe · 2018-07-05T21:42:56Z

@martinholmer said:

Also, @hdoupe, I need your direction on how to handle the pandas warning described at the bottom of this 232 comment.

I'm looking into this now.

martinholmer · 2018-07-05T22:03:46Z

@hdoupe said:

if the input data are different, then that would explain why the tolerance needed to be raised and why the results are so different.

Yes, I'm sure the inputs have changed. The extrapolation.py script hasn't been run in months and when I tried to run it it crashed because it was reading in a file that did not exist in the taxdata repository. I think things were way too messed up to do a backward-looking investigation. My focus is on looking forward and getting sensible (that is, tested) benefits extrapolation results. I think the preliminary testing work in issue #241 already shows previously undetected problems with the extrapolated TANF data.

hdoupe · 2018-07-06T13:17:13Z

@martinholmer said:

My focus is on looking forward and getting sensible (that is, tested) benefits extrapolation results. I think the preliminary testing work in issue #241 already shows previously undetected problems with the extrapolated TANF data.

OK, that sounds good to me.

hdoupe · 2018-07-06T13:56:04Z

The line of code that causes the pandas warning is:

result = pd.concat([noncandidates, candidates], axis=0, ignore_index=False)

This is taking two dataframes:

noncandidates -- a N_n*J by 4 matrix (why 4?: see columns bullet point)
- N_n the number of people who are not candidates for a change in participation status
- J - max number of people in each unit (15 for our data set)
- Columns: 'wt', 'I', 'benefits', 'prob' (weight, participation status indicator, benefit amount received, probability of participating in program)
candidates -- a N_c*J by 4
- N_c the number of people who are candidates for a change in participation status
- same definitions for other dimensions

This line of code produces one dataframe:

results -- a J * (N_n + N_c) by 4 matrix

Thus, the two dataframes are simply stacked on top of each other. They are aligned by their "columns". That is the I column of candidates is matched to the I column of noncandidates and so on with the other columns. Both dataframes have the exact same columns and these columns have the exact same order.

Pandas issue pandas-dev/pandas#4588 discusses whether pandas should sort these columns (or rows if axis=1) by default if they are not already aligned. If you pass sort=True, then pandas will sort both dataframes by their columns (or rows) using alphanumeric ordering. If you pass sort=False, then pandas will not sort the dataframes no matter whether the columns (or rows) are aligned or not.

In our case, both dataframes ought to have the same columns. I am unfamiliar with the inner workings of pandas and thus, cannot say whether the columns are guaranteed to have the same ordering or not. So, my recommendation is to pass sort=True. join can be set to inner or outer. Both dataframes should have the same columns--no more and no less. So, it doesn't matter whether the join is inner or outer.

This explanation is based on my reading of the docs. If anyone has a different interpretation of them, then I'm interested in hearing it. Also, please let me know if this explanation is unclear or could be improved. I hope this explanation clears things up.

@martinholmer thanks for laying out the problem clearly and directing me to the relevant portion of the pandas docs.

martinholmer · 2018-07-10T19:44:23Z

@hdoupe, We need to revisit the issue of the Pandas warning.

The warning (the text of which I post below) arises when this line is executed:

        result = pd.concat([noncandidates, candidates], axis=0,
                           ignore_index=False)

So, you are merging the noncandidates and the candidates dataframes on axis=0, which is the rows.
Is that correct? Is that what you want to do? I don't see how the rows match up since the names of the two dataframes suggest they are two distinct groups.

The complete warning is this:

extrapolation.py:159: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False
  ignore_index=False)

I have no idea what you're doing here, so you're going to have to give me more direction.
To me, the warning says the order of the columns is not the same in the two dataframes. Is that true or false?

hdoupe · 2018-07-10T20:07:59Z

@martinholmer asked:

So, you are merging the noncandidates and the candidates dataframes on axis=0, which is the rows.
Is that correct? Is that what you want to do? I don't see how the rows match up since the names of the two dataframes suggest they are two distinct groups.

This isn't quite correct. Consider this toy example:

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
>>> df
   a  b
0  1  3
1  2  4
>>> df2 = pd.DataFrame({"a": [5, 6], "b": [7, 8]})
>>> df2
   a  b
0  5  7
1  6  8
>>> r = pd.concat([df, df2], axis=0)
>>> r
   a  b
0  1  3
1  2  4
0  5  7
1  6  8
>>> r.a
0    1
1    2
0    5
1    6
Name: a, dtype: int64
>>> r.b
0    3
1    4
0    7
1    8
Name: b, dtype: int64
>>>

There are two dataframes with the same columns "a" and "b". Both columns in both data frames have two elements in them. The dataframes are stacked on top of each other. The result is one dataframe with two columns "a" and "b" and four elements in each column.

This is the same thing that occurs in extrapolation.py. Does this make sense?

I have no idea what you're doing here, so you're going to have to give me more direction.
To me, the warning says the order of the columns is not the same in the two dataframes. Is that true or false?

I'm not sure. Both dataframes have the same columns. There could be an internal pandas bug or something in the script that swaps the order of the columns. If the column order is changed, then yes, they would need to be sorted. However, I'm not sure why the order would change.

I hope this helps. I'm happy to continue going over how this script works and working to make it easier to understand.

martinholmer · 2018-07-10T20:20:28Z

@hdoupe, Thanks for the detailed explanation of pd.concat logic in pull request #242.

I've added sort=False to the pd.concat call and the extrapolation.py results are unchanged. Does that seem like the correct way to suppress the warning message?

martinholmer · 2018-07-10T20:49:54Z

@hdoupe and @andersonfrailey, I think PR #242 is ready for review. All the changes today are either cosmetic style changes or eliminate the Pandas concat warning. Today's changes produce exactly the same results as were being produced before today's changes.

andersonfrailey · 2018-07-10T21:13:04Z

Thanks for working on this @martinholmer and @hdoupe. LGTM, but I'd defer to @hdoupe's final judgement.

hdoupe · 2018-07-11T00:01:11Z

Yes, this looks good to me. Thanks for making these updates @martinholmer.

I'll take a look at the failing tests soon. It looks like the path names were never moved into fixtures in addition to some other maintenance work that they will need.

martinholmer added 2 commits July 5, 2018 14:41

Increase relative tol from 0.01 to 0.05 in extrapolation.py

47bb2ca

Eliminate PEP8-type warnings from extrapolation.py

406e0cb

hdoupe reviewed Jul 5, 2018

View reviewed changes

martinholmer added the ready label Jul 5, 2018

martinholmer added 2 commits July 10, 2018 14:47

Rename JJJ as maxsize

e92e085

Rename benefit as bname; use SYEAR and LYEAR

b24834e

Add sort=False to Pandas concat call

409bc91

Rename III as part

b3e1076

Eliminate variable name clash

c928d36

Simplify year loop

cf888b7

martinholmer mentioned this pull request Jul 11, 2018

Add benefit test that compares actual and target benefit amounts/counts #246

Merged

martinholmer added 2 commits July 11, 2018 21:58

Multiply weights by 0.01

9e564bc

Update cps_benefits.csv.gz

f5e6baf

martinholmer merged commit b59e656 into PSLmodels:master Jul 12, 2018

martinholmer mentioned this pull request Jul 12, 2018

Questions about extrapolated benefits #241

Closed

martinholmer deleted the revise-extrapolation branch July 12, 2018 13:44

martinholmer removed the ready label Jul 12, 2018

martinholmer mentioned this pull request Jul 12, 2018

What data work needs to be done for next (0.20.2) release of Tax-Calculator? #235

Closed

martinholmer mentioned this pull request Jul 13, 2018

Benefits extrapolation script does not work #232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise cps_stage4/extrapolation.py script #242

Revise cps_stage4/extrapolation.py script #242

martinholmer commented Jul 5, 2018

hdoupe Jul 5, 2018

martinholmer Jul 5, 2018

hdoupe Jul 6, 2018

martinholmer Jul 10, 2018

hdoupe Jul 10, 2018

martinholmer Jul 10, 2018

hdoupe Jul 10, 2018

hdoupe Jul 5, 2018

martinholmer Jul 5, 2018

hdoupe Jul 6, 2018

hdoupe commented Jul 5, 2018

hdoupe commented Jul 5, 2018

martinholmer commented Jul 5, 2018

hdoupe commented Jul 6, 2018

hdoupe commented Jul 6, 2018

martinholmer commented Jul 10, 2018 •

edited

Loading

hdoupe commented Jul 10, 2018

martinholmer commented Jul 10, 2018

martinholmer commented Jul 10, 2018 •

edited

Loading

andersonfrailey commented Jul 10, 2018

hdoupe commented Jul 11, 2018

Revise cps_stage4/extrapolation.py script #242

Revise cps_stage4/extrapolation.py script #242

Conversation

martinholmer commented Jul 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hdoupe commented Jul 5, 2018

hdoupe commented Jul 5, 2018

martinholmer commented Jul 5, 2018

hdoupe commented Jul 6, 2018

hdoupe commented Jul 6, 2018

martinholmer commented Jul 10, 2018 • edited Loading

hdoupe commented Jul 10, 2018

martinholmer commented Jul 10, 2018

martinholmer commented Jul 10, 2018 • edited Loading

andersonfrailey commented Jul 10, 2018

hdoupe commented Jul 11, 2018

martinholmer commented Jul 10, 2018 •

edited

Loading

martinholmer commented Jul 10, 2018 •

edited

Loading