Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently use baseline expanded_income to fuzz reform results in dropq tables #1537

Merged
merged 46 commits into from
Sep 6, 2017
Merged

Consistently use baseline expanded_income to fuzz reform results in dropq tables #1537

merged 46 commits into from
Sep 6, 2017

Conversation

martinholmer
Copy link
Collaborator

@martinholmer martinholmer commented Aug 30, 2017

Motivation

This pull request was developed after looking at the problems faced by @andersonfrailey in pull request #1521. The goal of that pull request was simple enough and his skills are substantial, but the complexity of the dropq difference table logic made the task in #1521 quite challenging. I looked at the issue and couldn't figure out how to reach the goal of #1521. My conclusion was that the dropq difference table requirements were not that complicated, but that the code was unnecessarily complex, and therefore, difficult to follow.

What's Been Done?

This pull request is an attempt to simplify the code that generates dropq difference tables. The main strategy is to revise the create_difference_table utility function code to be general enough to support difference table creation in the dropq_utils.py file. This allows the removal of the dropq_diff_table function.

In the course of doing that, it was easy to add two features missing in the old dropq logic.

  • FIrst, it is now possible to create difference tables with an income measure that the TaxBrain difference table calls "Adjusted Income", which I assume is AGI. The "Adjusted Income" in TaxBrain button says "Not yet implemented" when the mouse is over the button. The high-level dropq logic in this pull request does not generate difference tables with AGI as the income measure, but doing so would now be straightforward.

  • Second, the dropq difference tables now include the tax difference expressed as a percent of baseline after-tax expanded income. However, TaxBrain would have to be revised to display that new difference statistic. This was the objective of pull request [WIP] Add After-Tax Income Percent Change column to dropq results #1521.

After doing this difference-table work, it seemed natural to continue and simplify the code that generates dropq distribution tables. The same basic strategy was pursued: generalize the create_distribution_table utility function and use it to replace the dropq_dist_table function.

Note this pull request includes all the changes in pull request #1534.

Consequences

After this pull request is merged, there will be fewer lines of code and hopefully less confusing and better documented dropq code.

The new dump option in dropq test_run_tax_calc_model has been used to check that all these code changes leave the results returned by the dropq run_nth_year_tax_calc_model function unchanged. The aggregate tax results are unchanged and all the baseline distribution table results are unchanged. However, there are some changes in the other table results. These changes occur in reforms where the policy reform causes the value of expanded_income to change for some filing units. The old code did not use baseline expanded_income to assign filing units to bins in reform-distribution-table or difference-table construction, while the new code does do that per the discussion in issue #1540.

Also, the new code standardizes DataFrame column names. So, now all six of the difference tables have the same column names and all four of the distribution tables have the same columns names. That was not always the case in the old code.

And finally, the aggregate table has been constructed by fuzzing just 3 filing unit records (instead of the 30 records).

So, to summarize the consequences of this pull request:

  • First, difference table results may differ from the results generated by the old code.

  • Second, distribution and difference table column names may differ from what they were in the old code.

  • Third, the aggregate tax totals for the reform and reform-baseline difference will be slightly more accurate.

These changes suggest that this pull request would trigger an API change, meaning the the next release would be the 0.11.0 version.

Subsequent Work

During the course of this work, it became clear that it wold be far less risky to return a dictionary of named results (rather than a unnamed tuple of results that depends on the order of the thirteen results). That change will be made in a subsequent pull request. That change would also constitute an API change, but one that would be easy for TaxBrain developers to handle according to this comment by @hdoupe.

Also, we are still not generating any distribution or difference tables using AGI as the income measure to assign filing units to decile or income bins. The TaxBrain GUI implies that this is a forthcoming feature. If that is still desirable, we can add AGI binning in a subsequent pull request.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun @brittainhard

@codecov-io
Copy link

codecov-io commented Aug 30, 2017

Codecov Report

Merging #1537 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #1537   +/-   ##
======================================
  Coverage     100%    100%           
======================================
  Files          37      37           
  Lines        2565    2558    -7     
======================================
- Hits         2565    2558    -7
Impacted Files Coverage Δ
taxcalc/taxcalcio.py 100% <100%> (ø) ⬆️
taxcalc/utilsprvt.py 100% <100%> (ø) ⬆️
taxcalc/utils.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53c129a...f6b3e8a. Read the comment docs.

income_measure='expanded_income',
result_type='weighted_sum')
dist1_bin = create_distribution_table(df1, groupby='webapp_income_bins',
for col in [c for c in list(df2) if c.endswith('_xdec')]:
df2[col[:-5]] = df2[col]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a bit too concise and contains a magic number (why -5?). I think a more maintainable solution would be to:

  • make the list comprehension a local variable with a descriptive name
  • give "5" a descriptive variable name and add a comment on why the last five elements of "col" are chopped off.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely correct. Commit 542cedb tries to make things more clear.

@martinholmer
Copy link
Collaborator Author

Pull request #1537 is now complete and an overview of the pull request is now available.

If there are no concerns or objections, pull request #1537 (which includes #1534) will be merged into the master branch at the end of the day on Wednesday, September 6th.

@@ -1,6 +1,7 @@
"""
The dropq functions are used by TaxBrain to call Tax-Calculator in order
to maintain the privacy of the micro data being used by TaxBrain.
to maintain the privacy of the micro data being used by TaxBrain. This
is done by adding random "fuzz" to the results in each table cell.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly: "This is done by adding random "fuzz" to the sample from which the results in each table cell are drawn."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The description of what "fuzzing" means needs to be improved.
Commit 537b1c0 is an attempt to improve the documentation.

@feenberg
Copy link
Contributor

feenberg commented Sep 5, 2017 via email

@martinholmer
Copy link
Collaborator Author

@feenberg said why can't we just say the dropq logic is

dropping 3 randomly selected taxable returns from each table row

Because the dropq logic does not drop any returns from a table row.

@martinholmer martinholmer changed the title Use revised create_distribution_table instead of dropq_dist_table Consistently use baseline expanded_income to fuzz reform results in dropq tables Sep 6, 2017
@martinholmer
Copy link
Collaborator Author

@MattHJensen and @feenberg, Both of you have raised concerns about the wording of the top docstring in the dropq.py file. Here is the latest wording:

The dropq functions are used by TaxBrain to call Tax-Calculator in order
to maintain the privacy of the IRS-SOI PUF data being used by TaxBrain.
This is done by "fuzzing" reform results for several randomly selected
filing units in each table cell.  The filing units randomly selected
differ for each policy reform and the "fuzzing" involves replacing the
post-reform tax results for the selected units with their pre-reform
tax results.

If you have any remaining concerns, please raise them now.
If part of the wording is incorrect or vague, I would appreciate your suggestions for alternative wording.

@martinholmer
Copy link
Collaborator Author

After one more review of pull request #1537, I have found myself wondering about one more question.

As I understand it, the basic approach in the dropq logic is to fuzz (or obscure) reform results for three randomly selected filing units in each table row. That means that when constructing decile tables, 30 units (three in each decile) are selected for fuzzing. And when constructing WEBAPP_INCOME_BINS tables, 36 units (three in each of the twelve WEBAPP_INCOME_BINS) are selected for fuzzing. We can confirm this with the results from the following adhoc dump of a test using a PUF subsample.

bin_type=dec
True     10746
False       30
Name: nofuzz, dtype: int64

bin_type=bin
True     10740
False       36
Name: nofuzz, dtype: int64

Notice that the totals in the decile tables will not be exactly the same as the totals in the BINS tables, but that is to be expected.

So far, I have no questions.

But what about the aggregate (or fiscal totals) table, which is shown at the top of the TaxBrain "Static Results" page? One way to view that table is that it is a single bin table. If we adopt that view, then shouldn't we be fuzzing just 3 units?

From a broader perspective, there appears to be four ways to construct the aggregate table amounts:

  1. no fuzzing of reform results
  2. fuzz reform results for just 3 filing units
  3. fuzz reform results for the 30 filing units fuzzed in decile tables
  4. fuzz reform results for the 36 filing units fuzzed in BINS tables

Looking at the code on the master branch, it uses the way described in 3.
And looking at the code in this pull request, it also uses the way described in 3.

But why not use the way described in 2? That approach is a consistent application of the rule of fuzzing three records in each table row. Or, is fuzzing the aggregate results not required from a privacy perspective? If not required, then the way described in 1 would be just fine.

So, my question is which of the four ways should be used in the aggregate table.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun

@feenberg
Copy link
Contributor

feenberg commented Sep 6, 2017 via email

@MattHJensen
Copy link
Contributor

@martinholmer, I like the new description of what 'fuzzing' is.

With regard to the liabilities table, I think we should fuzz 3 records from each cell rather than 30.

@martinholmer
Copy link
Collaborator Author

@MattHJensen said in #1537:

With regard to the [aggregate] liabilities table, I think we should fuzz 3 records from each cell rather than 30.

OK, I'll make one final change to #1537 to fuzz just three records when computing the aggregate tables.

@feenberg

@martinholmer martinholmer merged commit 7e3458f into PSLmodels:master Sep 6, 2017
@martinholmer martinholmer deleted the revise-dist-table branch September 6, 2017 22:26
@feenberg
Copy link
Contributor

feenberg commented Sep 7, 2017 via email

@MattHJensen
Copy link
Contributor

@feenberg asked:

Where did 30 come from? Is there something I should be aware of?

We were dropping the same records that were dropped from the decile table, 3 for each cell. Here is @martinholmer's initial comment about the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants