-
-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistently use baseline expanded_income to fuzz reform results in dropq tables #1537
Consistently use baseline expanded_income to fuzz reform results in dropq tables #1537
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1537 +/- ##
======================================
Coverage 100% 100%
======================================
Files 37 37
Lines 2565 2558 -7
======================================
- Hits 2565 2558 -7
Continue to review full report at Codecov.
|
taxcalc/dropq/dropq_utils.py
Outdated
income_measure='expanded_income', | ||
result_type='weighted_sum') | ||
dist1_bin = create_distribution_table(df1, groupby='webapp_income_bins', | ||
for col in [c for c in list(df2) if c.endswith('_xdec')]: | ||
df2[col[:-5]] = df2[col] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a bit too concise and contains a magic number (why -5?). I think a more maintainable solution would be to:
- make the list comprehension a local variable with a descriptive name
- give "5" a descriptive variable name and add a comment on why the last five elements of "col" are chopped off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely correct. Commit 542cedb tries to make things more clear.
Pull request #1537 is now complete and an overview of the pull request is now available. If there are no concerns or objections, pull request #1537 (which includes #1534) will be merged into the master branch at the end of the day on Wednesday, September 6th. |
taxcalc/dropq/dropq.py
Outdated
@@ -1,6 +1,7 @@ | |||
""" | |||
The dropq functions are used by TaxBrain to call Tax-Calculator in order | |||
to maintain the privacy of the micro data being used by TaxBrain. | |||
to maintain the privacy of the micro data being used by TaxBrain. This | |||
is done by adding random "fuzz" to the results in each table cell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly: "This is done by adding random "fuzz" to the sample from which the results in each table cell are drawn."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. The description of what "fuzzing" means needs to be improved.
Commit 537b1c0 is an attempt to improve the documentation.
On Tue, 5 Sep 2017, Martin Holmer wrote:
@martinholmer commented on this pull request.
____________________________________________________________________________
In taxcalc/dropq/dropq.py:
> @@ -1,6 +1,7 @@
"""
The dropq functions are used by TaxBrain to call Tax-Calculator in order
-to maintain the privacy of the micro data being used by TaxBrain.
+to maintain the privacy of the micro data being used by TaxBrain. This
+is done by adding random "fuzz" to the results in each table cell.
That isn't the best description of the method. can't we do better? Is it
any harder to say "by dropping 3 randomly selected taxable returns from
each table row"?
dan
…
You're right. The description of what "fuzzing" means needs to be improved.
Commit 537b1c0 is an attempt to improve the documentation.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVdXCIL9ULSVJJMlY4RR2H5m-zbtOks5sfZ_0gaJpZM4PIBfE.gif]
|
@feenberg said why can't we just say the dropq logic is
Because the dropq logic does not drop any returns from a table row. |
@MattHJensen and @feenberg, Both of you have raised concerns about the wording of the top docstring in the
If you have any remaining concerns, please raise them now. |
After one more review of pull request #1537, I have found myself wondering about one more question. As I understand it, the basic approach in the dropq logic is to fuzz (or obscure) reform results for three randomly selected filing units in each table row. That means that when constructing decile tables, 30 units (three in each decile) are selected for fuzzing. And when constructing WEBAPP_INCOME_BINS tables, 36 units (three in each of the twelve WEBAPP_INCOME_BINS) are selected for fuzzing. We can confirm this with the results from the following adhoc dump of a test using a PUF subsample.
Notice that the totals in the decile tables will not be exactly the same as the totals in the BINS tables, but that is to be expected. So far, I have no questions. But what about the aggregate (or fiscal totals) table, which is shown at the top of the TaxBrain "Static Results" page? One way to view that table is that it is a single bin table. If we adopt that view, then shouldn't we be fuzzing just 3 units? From a broader perspective, there appears to be four ways to construct the aggregate table amounts:
Looking at the code on the master branch, it uses the way described in 3. But why not use the way described in 2? That approach is a consistent application of the rule of fuzzing three records in each table row. Or, is fuzzing the aggregate results not required from a privacy perspective? If not required, then the way described in 1 would be just fine. So, my question is which of the four ways should be used in the aggregate table. @MattHJensen @feenberg @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun |
Excellent.
…On Wed, 6 Sep 2017, Martin Holmer wrote:
@MattHJensen and @feenberg, Both of you have raised concerns about the
wording of the top docstring in the dropq.py file. Here is the latest
wording:
The dropq functions are used by TaxBrain to call Tax-Calculator in order
to maintain the privacy of the IRS-SOI PUF data being used by TaxBrain.
This is done by "fuzzing" reform results for several randomly selected
filing units in each table cell. The filing units randomly selected
differ for each policy reform and the "fuzzing" involves replacing the
post-reform tax results for the selected units with their pre-reform
tax results.
If you have any remaining concerns, please raise them now.
If part of the wording is incorrect or vague, I would appreciate your
suggestions for alternative wording.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVX7ykiYYw_vOtWP-RrBzPh2hVznAks5sfquegaJpZM4PIBfE.gif]
|
@martinholmer, I like the new description of what 'fuzzing' is. With regard to the liabilities table, I think we should fuzz 3 records from each cell rather than 30. |
@MattHJensen said in #1537:
OK, I'll make one final change to #1537 to fuzz just three records when computing the aggregate tables. |
On Wed, 6 Sep 2017, Martin Holmer wrote:
@MattHJensen said in #1537:
With regard to the [aggregate] liabilities table, I think we should fuzz 3 records from
each cell rather than 30.
OK, I'll make one final change to #1537 to fuzz just three records when computing the aggregate
tables.
Where did 30 come from? Is there something I should be aware of?
dan
…
@feenberg
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVeii6zbhsyHR33pyPrcvK11b15qJks5sfvMQgaJpZM4PIBfE.gif]
|
@feenberg asked:
We were dropping the same records that were dropped from the decile table, 3 for each cell. Here is @martinholmer's initial comment about the issue. |
Motivation
This pull request was developed after looking at the problems faced by @andersonfrailey in pull request #1521. The goal of that pull request was simple enough and his skills are substantial, but the complexity of the dropq difference table logic made the task in #1521 quite challenging. I looked at the issue and couldn't figure out how to reach the goal of #1521. My conclusion was that the dropq difference table requirements were not that complicated, but that the code was unnecessarily complex, and therefore, difficult to follow.
What's Been Done?
This pull request is an attempt to simplify the code that generates dropq difference tables. The main strategy is to revise the
create_difference_table
utility function code to be general enough to support difference table creation in thedropq_utils.py
file. This allows the removal of thedropq_diff_table
function.In the course of doing that, it was easy to add two features missing in the old dropq logic.
FIrst, it is now possible to create difference tables with an income measure that the TaxBrain difference table calls "Adjusted Income", which I assume is AGI. The "Adjusted Income" in TaxBrain button says "Not yet implemented" when the mouse is over the button. The high-level dropq logic in this pull request does not generate difference tables with AGI as the income measure, but doing so would now be straightforward.
Second, the dropq difference tables now include the tax difference expressed as a percent of baseline after-tax expanded income. However, TaxBrain would have to be revised to display that new difference statistic. This was the objective of pull request [WIP] Add After-Tax Income Percent Change column to dropq results #1521.
After doing this difference-table work, it seemed natural to continue and simplify the code that generates dropq distribution tables. The same basic strategy was pursued: generalize the
create_distribution_table
utility function and use it to replace thedropq_dist_table
function.Note this pull request includes all the changes in pull request #1534.
Consequences
After this pull request is merged, there will be fewer lines of code and hopefully less confusing and better documented dropq code.
The new dump option in dropq
test_run_tax_calc_model
has been used to check that all these code changes leave the results returned by the dropqrun_nth_year_tax_calc_model
function unchanged. The aggregate tax results are unchanged and all the baseline distribution table results are unchanged. However, there are some changes in the other table results. These changes occur in reforms where the policy reform causes the value ofexpanded_income
to change for some filing units. The old code did not use baselineexpanded_income
to assign filing units to bins in reform-distribution-table or difference-table construction, while the new code does do that per the discussion in issue #1540.Also, the new code standardizes DataFrame column names. So, now all six of the difference tables have the same column names and all four of the distribution tables have the same columns names. That was not always the case in the old code.
And finally, the aggregate table has been constructed by fuzzing just 3 filing unit records (instead of the 30 records).
So, to summarize the consequences of this pull request:
First, difference table results may differ from the results generated by the old code.
Second, distribution and difference table column names may differ from what they were in the old code.
Third, the aggregate tax totals for the reform and reform-baseline difference will be slightly more accurate.
These changes suggest that this pull request would trigger an API change, meaning the the next release would be the 0.11.0 version.
Subsequent Work
During the course of this work, it became clear that it wold be far less risky to return a dictionary of named results (rather than a unnamed tuple of results that depends on the order of the thirteen results). That change will be made in a subsequent pull request. That change would also constitute an API change, but one that would be easy for TaxBrain developers to handle according to this comment by @hdoupe.
Also, we are still not generating any distribution or difference tables using AGI as the income measure to assign filing units to decile or income bins. The TaxBrain GUI implies that this is a forthcoming feature. If that is still desirable, we can add AGI binning in a subsequent pull request.
@MattHJensen @feenberg @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun @brittainhard