Switch to use of new Python-generated puf.csv file #1429

martinholmer · 2017-06-14T01:51:04Z

This pull request makes the changes in Tax-Calculator caused by the use of the new Python-generated puf.csv input file described in taxdata pull request 100.

codecov-io · 2017-06-14T01:57:27Z

Codecov Report

Merging #1429 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1429      +/-   ##
==========================================
- Coverage   99.64%   99.64%   -0.01%     
==========================================
  Files          40       40              
  Lines        2788     2785       -3     
==========================================
- Hits         2778     2775       -3     
  Misses         10       10

Impacted Files	Coverage Δ
taxcalc/records.py	`98.78% <100%> (-0.02%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c8bbda...339de15. Read the comment docs.

martinholmer · 2017-06-14T17:51:41Z

Tax-Calculator pull request #1429 is now being merged into the master branch.

You will need a new version of the puf.csv file in order to pass the requires_pufcsv tests and get the same taxcalc/comparison/reform_results.txt results.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun @codykallen
@PeterDSteinberg @brittainhard

MattHJensen · 2017-06-15T19:28:54Z

taxcalc/tests/test_pufcsv.py

                                  random_state=rn_seed)
    rec_subsample = Records(data=subsample)
    calc_subsample = Calculator(policy=Policy(), records=rec_subsample)
    adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs)
    # compare combined tax liability from full and sub samples for each year
    taxes_subsample = adt_subsample.loc["Combined Liability ($b)"]
-    reltol = 0.01  # maximum allowed relative difference in tax liability
+    reltol = 0.04  # maximum allowed relative difference in tax liability


@andersonfrailey, @hdoupe,

@martinholmer flagged this line change on a call today.

The normal differences between the 2% sample and the full sample rose sharply when we introduced the new file. This could be symptomatic of a problem, and it is certainly intriguing. Weights may be a source of interest -- Martin noticed that they are not used to define the 2% sample.

Could you investigate when you want a break from your other work?

@MattHJensen, I briefly looked into it. There's a weights parameter in in the pandas sample function. I set that equal to full_sample.s006 and got worse results. I had to raise reltol to 0.1 in order to pass the test.

What do you think caused us to need to increase retol moving from the 2% sample to the full sample?

@MattHJensen, I'm not sure. I hoped that there was something going on in the test function. I looked at using the sample weights for the sub-sample selection and running the test with different seeds. However, this hasn't changed the results. With different seeds, the same 2% sample rate, the same 0.04 tolerance and no weights, this test fails about 20% of the time. If I use the weights, this test fails every time with a relative difference ranging from about 0.07 to 0.15. I haven't been able to figure out why using weights causes the the difference to increase.

I still think that there could be something going on with the random seed. Perhaps, we had a very small relative difference last time because we got lucky and drew a favorable sample.

Also, why do we expect the results from the full sample and 2% of the sample to be within 1% of each other?

I altered the function some to get these results:

@pytest.mark.requires_pufcsv def test_agg(tests_path, puf_path): # pylint: disable=redefined-outer-name """ Test Tax-Calculator aggregate taxes with no policy reform using the full-sample puf.csv and a two-percent sub-sample of puf.csv """ use_weights = False rn_seed = 0 msgs = [] n_tests = 20 for i in range(0, n_tests): msg = "" rn_seed = np.random.randint(0,10000) # to ensure sub-sample is always the same # pylint: disable=too-many-locals,too-many-statements nyrs = 10 # create a Policy object (clp) containing current-law policy parameters clp = Policy() # create a Records object (rec) containing all puf.csv input records rec = Records(data=puf_path) # create a Calculator object using clp policy and puf records calc = Calculator(policy=clp, records=rec) calc_start_year = calc.current_year # create aggregate diagnostic table (adt) as a Pandas DataFrame object adt = multiyear_diagnostic_table(calc, nyrs) taxes_fullsample = adt.loc["Combined Liability ($b)"] # convert adt results to a string with a trailing EOL character adtstr = adt.to_string() + '\n' # generate differences between actual and expected results actual = adtstr.splitlines(True) aggres_path = os.path.join(tests_path, 'pufcsv_agg_expect.txt') with open(aggres_path, 'r') as expected_file: txt = expected_file.read() expected_results = txt.rstrip('\n\t ') + '\n' # cleanup end of file txt expected = expected_results.splitlines(True) diff = difflib.unified_diff(expected, actual, fromfile='expected', tofile='actual', n=0) # convert diff generator into a list of lines: diff_lines = list() for line in diff: diff_lines.append(line) # test failure if there are any diff_lines if len(diff_lines) > 0: new_filename = '{}{}'.format(aggres_path[:-10], 'actual.txt') with open(new_filename, 'w') as new_file: new_file.write(adtstr) msg = 'PUFCSV AGG RESULTS DIFFER FOR FULL-SAMPLE\n' msg += '-------------------------------------------------\n' msg += '--- NEW RESULTS IN pufcsv_agg_actual.txt FILE ---\n' msg += '--- if new OK, copy pufcsv_agg_actual.txt to ---\n' msg += '--- pufcsv_agg_expect.txt ---\n' msg += '--- and rerun test. ---\n' msg += '-------------------------------------------------\n' raise ValueError(msg) # create aggregate diagnostic table using sub sample of records fullsample = pd.read_csv(puf_path) subfrac = 0.02 # sub-sample fraction if use_weights: # normalize weights for sample fullsample['weights'] = fullsample.s006/fullsample.s006.sum() weights = 'weights' else: # set weights to None and pandas uses uniform dist to select samp weights = None subsample = fullsample.sample(frac=subfrac, # pylint: disable=no-member random_state=rn_seed, weights=weights) weight_prop = subsample.s006.sum()/fullsample.s006.sum() rec_subsample = Records(data=subsample) calc_subsample = Calculator(policy=Policy(), records=rec_subsample) adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs) # compare combined tax liability from full and sub samples for each year taxes_subsample = adt_subsample.loc["Combined Liability ($b)"] reltol = 0.04 # maximum allowed relative difference in tax liability if not np.allclose(taxes_subsample, taxes_fullsample, atol=0.0, rtol=reltol): msg = 'PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE\n' msg += 'WHEN subfrac = {:.3f} and reltol = {:.4f} ' msg += 'and weight prop = {:.4f} and seed = {:.0f}\n' msg = msg.format(subfrac, reltol, weight_prop, rn_seed) it_sub = np.nditer(taxes_subsample, flags=['f_index']) it_all = np.nditer(taxes_fullsample, flags=['f_index']) while not it_sub.finished: cyr = it_sub.index + calc_start_year tax_sub = float(it_sub[0]) tax_all = float(it_all[0]) reldiff = abs(tax_sub - tax_all) / abs(tax_all) if reldiff > reltol: msgstr = ' year,sub,full,reldif= {}\t{:.2f}\t{:.2f}\t{:.4f}\n' msg += msgstr.format(cyr, tax_sub, tax_all, reldiff) it_sub.iternext() it_all.iternext() if len(msg) > 0: msgs.append(msg) if len(msgs) > 0: acc = (len(msgs)/float(n_tests))*100 acc_msg = "Failure rate: {:.2f}%".format(acc) raise ValueError('\n'.join([acc_msg] + msgs))

@MattHJensen I ran the test function from my previous comment on the previous data set and taxcalc 0.8.5. I got similar results to the ones produced with the newest data set and taxcalc 0.9.0:

With different seeds, the same 2% sample rate, the same 0.04 tolerance and no weights, this test fails about 20% of the time. If I use the weights, this test fails every time with a relative difference ranging from about 0.07 to 0.15.

It seems like the seed that was chosen in the first test function yielded a sample that produced results very similar to those produced from the full data set. I'm still not sure why using a weighted sample results in a larger relative difference than an unweighted sample.

Thanks @hdoupe.

Also, why do we expect the results from the full sample and 2% of the sample to be within 1% of each other?

We do not expect it, but it was nice that they were so close and interesting when the gap expanded so much.

It seems like the seed that was chosen in the first test function yielded a sample that produced results very similar to those produced from the full data set.

Got it. That makes sense. It sounds like we were lucky before. Do you think we should specify a seed that generally produces small differences?

Thanks @MattHJensen, that makes sense.

It does seem like we got lucky before. For the purpose of this test, it seems like setting a random seed that generally produces small differences works well enough. Then, every time a new data set is released we would update the tolerance and the seed.

If we chose a different random seed every time then there would be cases where an unlucky sample is drawn and the test would fail even though nothing is wrong. This problem could be solved by running several iterations and as long as x% of the iterations pass then the test is passed. But that may be overkill for this problem. I re-wrote the test function to do this and it takes about a minute and a half to run versus 45 seconds for the original test function.

@pytest.mark.requires_pufcsv def test_agg(tests_path, puf_path): # pylint: disable=redefined-outer-name """ Test Tax-Calculator aggregate taxes with no policy reform using the full-sample puf.csv and a two-percent sub-sample of puf.csv """ # pylint: disable=too-many-locals,too-many-statements puf = pd.read_csv(puf_path) ... # create aggregate diagnostic table using sub sample of records fullsample = puf.copy() use_weights = False rn_seed = 0 msgs = [] n_tests = 50 for i in range(0, n_tests): msg = "" rn_seed = np.random.randint(0, 10000) # to ensure sub-sample is always the same subfrac = 0.02 # sub-sample fraction if use_weights: # normalize weights for sample fullsample['weights'] = fullsample.s006/fullsample.s006.sum() weights = 'weights' else: # set weights to None and pandas uses uniform dist to select samp weights = None subsample = fullsample.sample(frac=subfrac, # pylint: disable=no-member random_state=rn_seed, weights=weights, axis=0) weight_prop = subsample.s006.sum()/fullsample.s006.sum() rec_subsample = Records(data=subsample) calc_subsample = Calculator(policy=Policy(), records=rec_subsample) adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs) # compare combined tax liability from full and sub samples for each year taxes_subsample = adt_subsample.loc["Combined Liability ($b)"] reltol = 0.04 # maximum allowed relative difference in tax liability if not np.allclose(taxes_subsample, taxes_fullsample, atol=0.0, rtol=reltol): msg = 'PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE\n' msg += 'WHEN subfrac = {:.3f} and reltol = {:.4f} ' msg += 'and weight prop = {:.4f} and seed = {:.0f}\n' msg = msg.format(subfrac, reltol, weight_prop, rn_seed) it_sub = np.nditer(taxes_subsample, flags=['f_index']) it_all = np.nditer(taxes_fullsample, flags=['f_index']) while not it_sub.finished: cyr = it_sub.index + calc_start_year tax_sub = float(it_sub[0]) tax_all = float(it_all[0]) reldiff = abs(tax_sub - tax_all) / abs(tax_all) if reldiff > reltol: msgstr = ' year,sub,full,reldif= {}\t{:.2f}\t{:.2f}\t{:.4f}\n' msg += msgstr.format(cyr, tax_sub, tax_all, reldiff) it_sub.iternext() it_all.iternext() if len(msg) > 0: msgs.append(msg) fail_rate = (len(msgs)/float(n_tests))*100 if fail_rate > 0.3: fail_msg = "Failure rate: {:.2f}%".format(fail_rate) raise ValueError('\n'.join([fail_msg] + msgs))

@hdoupe, that was fun reading, but probably overkill as you say. The purpose of having a "generally friendly" seed for a particular dataset would be to tell TaxBrain to always use that seed when doing a "Quick Calculation".

It's not a huge deal, though, as our users should expect a mismatch between the 2% and full sample anyways, and it sounds like identifying the seed for every new taxpuf could be burdensome. Unless @hdoupe or someone else objects, I think we should close this issue and accept the wider tolerance. Thanks to @hdoupe's detective work, we know that nothing has fundamentally changed with the new dataset, we were just luckier before w the old seed.

@MattHJensen said

The purpose of having a "generally friendly" seed for a particular dataset would be to tell TaxBrain to always use that seed when doing a "Quick Calculation".

Ok, that makes sense. Thanks for reviewing.

martinholmer · 2017-06-20T00:41:45Z

I'm in agreement with what I've read in the #1429 conversation between @hdoupe and @MattHJensen about sampling the puf.csv input file. But several more things have to happen to implement what appears to be the consensus sampling plan.

(1) Need to find a random number seed that allows the 2% sample to produce results within %1 percent of the full-sample results. The only way I know to do that is hunt and peck for the "magic" seed. If a broad search doesn't produce a "magic" seed, then rethink the whole sampling scheme. For example, where did the 2% sampling rate come from? Is there any scientific basis for 2% (rather than a bigger percent)?

(2) Once a suitable seed is found in (1), then it must be used in TaxBrain. Does anybody know where the webapp-public repo code sets the value of the sampling seed?

(3) And finally, the TaxBrain web interface probably needs to say something about how the sampling results compare with the full-sample results. Otherwise, people have no idea about what to expect.

feenberg · 2017-06-20T12:39:38Z

On Mon, 19 Jun 2017, Martin Holmer wrote: I'm in agreement with what I've read in the #1429 conversation between @hdoupe and @MattHJensen about sampling the puf.csv input file. But several more things have to happen to implement what appears to be the consensus sampling plan. (1) Need to find a random number seed that allows the 2% sample to produce results within %1 percent of the full-sample results. The only way I know to do that is hunt and peck for the "magic" seed. If a broad search doesn't produce a "magic" seed, then rethink the whole sampling scheme. For example, where did the 2% sampling rate come from? Is there any scientific basis for 2% (rather than a bigger percent)?

You could probably match one aggregate, or even two, but probably not more than that. The 2% was chosen to get near instant results in fortran. Going to an 8% sample would halve the typical error but quadruple the time. dan

…

(2) Once a suitable seed is found in (1), then it must be used in TaxBrain. Does anybody know where the webapp-public repo code sets the value of the sampling seed? (3) And finally, the TaxBrain web interface probably needs to say something about how the sampling results compare with the full-sample results. Otherwise, people have no idea about what to expect. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVfeiMXlOXi8-qjx0bly31u_N6Et3ks5sFxVKgaJpZM4N5SZS.gif]

martinholmer · 2017-06-20T13:03:35Z

@hdoupe and @MattHJensen, I think there may be a bug in the subsample test being discussed in #1429. Consider this code:

# create aggregate diagnostic table using sub sample of records
        fullsample = pd.read_csv(puf_path)
        subfrac = 0.02  # sub-sample fraction
        if use_weights:  # normalize weights for sample
            fullsample['weights'] = fullsample.s006/fullsample.s006.sum()
            weights = 'weights'
        else: # set weights to None and pandas uses uniform dist to select samp
            weights = None
        subsample = fullsample.sample(frac=subfrac,  # pylint: disable=no-member
                                      random_state=rn_seed,
                                      weights=weights)
        weight_prop = subsample.s006.sum()/fullsample.s006.sum()
        rec_subsample = Records(data=subsample)
        calc_subsample = Calculator(policy=Policy(), records=rec_subsample)
        adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs)

The Records constructor is called with data=subsample as its only argument.
Looking at the Records.__init__ code suggests that the weights will be read in from the puf_weights.csv file, which does not reflect the weight_prop value.

Maybe I'm confused, but it seems to me this issue should be explored more thoroughly in its own issue and pull request.

Unless I'm missing something, it seems as if the sampling-adjusted weights need to somehow be put into the Records object before the Calculator object is constructed and used.

hdoupe · 2017-06-20T14:45:29Z

@martinholmer said

Looking at the Records.__init__ code suggests that the weights will be read in from the puf_weights.csv file, which does not reflect the weight_prop value.

I created weight_prop in an effort to figure out why the weighted sample produced more different results in terms of relative difference than the unweighted sample.

I'm looking at this section from Records.__init__

# weights must be same size as tax record data
if not self.WT.empty and self.dim != len(self.WT):
    frac = float(self.dim) / len(self.WT)
    self.WT = self.WT.iloc[self.index]
    self.WT = self.WT / frac

where self.dim = subsample size and len(self.WT) = full sample size

To me, it looks like the constructor handles this scenario just fine. Does this clear things up for you?

martinholmer · 2017-06-20T15:16:52Z

Sent from my iPhone

On Jun 20, 2017, at 10:45 AM, Henry Doupe ***@***.***> wrote: @martinholmer said Looking at the Records.__init__ code suggests that the weights will be read in from the puf_weights.csv file, which does not reflect the weight_prop value. I created weight_prop in an effort to figure out why the weighted sample produced more different results in terms of relative difference than the unweighted sample. I'm looking at this section from Records.__init__ # weights must be same size as tax record data if not self.WT.empty and self.dim != len(self.WT): frac = float(self.dim) / len(self.WT) self.WT = self.WT.iloc[self.index] self.WT = self.WT / frac where self.dim = subsample size and len(self.WT) = full sample size To me, it looks like the constructor handles this scenario just fine. Does this clear things up for you?

Above code would seem to work only if doing unweighted sampling. Or am I missing something else?

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

martinholmer added 7 commits June 13, 2017 08:51

Remove _ID_Casualy_frt_in_pufcsv_year parameter

8bc489f

New weights/ratios and new results

83ca168

Slightly different results with new puf.csv age-diffs

1524b20

Merge branch 'g20500' into new-pufcsv

b774359

Very minor changes to pufcsv results from ID_Casualty_frt_in_pufcsv_year

13580a2

Weaken puf.csv sub-sample test

29b4143

Fix typo in test_pufcsv.py

37ab0eb

talumbau added the in progress label Jun 14, 2017

This was referenced Jun 14, 2017

Python Translation Initial Results PSLmodels/taxdata#92

Closed

Remove _ID_Casualy_frt_in_pufcsv_year parameter #1426

Closed

martinholmer added 2 commits June 14, 2017 10:29

New results using taxdata undo-pr97 puf.csv file

ddc7b9d

Update RELEASES.md

339de15

martinholmer changed the title ~~Use new Python-generated puf.csv file~~ Switch to use of new Python-generated puf.csv file Jun 14, 2017

martinholmer merged commit 0d30e13 into PSLmodels:master Jun 14, 2017

talumbau removed the in progress label Jun 14, 2017

martinholmer deleted the new-pufcsv branch June 14, 2017 20:13

MattHJensen reviewed Jun 15, 2017

View reviewed changes

hdoupe mentioned this pull request Jun 20, 2017

Treatment of weights in Records object constructor, Part 1 #1439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to use of new Python-generated puf.csv file #1429

Switch to use of new Python-generated puf.csv file #1429

martinholmer commented Jun 14, 2017 •

edited

Loading

codecov-io commented Jun 14, 2017 •

edited

Loading

martinholmer commented Jun 14, 2017

MattHJensen Jun 15, 2017 •

edited

Loading

hdoupe Jun 15, 2017

MattHJensen Jun 16, 2017 •

edited

Loading

hdoupe Jun 16, 2017

hdoupe Jun 19, 2017

MattHJensen Jun 19, 2017 •

edited

Loading

hdoupe Jun 19, 2017

MattHJensen Jun 19, 2017

hdoupe Jun 19, 2017

martinholmer commented Jun 20, 2017

feenberg commented Jun 20, 2017 via email

martinholmer commented Jun 20, 2017

hdoupe commented Jun 20, 2017

martinholmer commented Jun 20, 2017 via email

Switch to use of new Python-generated puf.csv file #1429

Switch to use of new Python-generated puf.csv file #1429

Conversation

martinholmer commented Jun 14, 2017 • edited Loading

codecov-io commented Jun 14, 2017 • edited Loading

Codecov Report

martinholmer commented Jun 14, 2017

MattHJensen Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

hdoupe Jun 15, 2017

Choose a reason for hiding this comment

MattHJensen Jun 16, 2017 • edited Loading

Choose a reason for hiding this comment

hdoupe Jun 16, 2017

Choose a reason for hiding this comment

hdoupe Jun 19, 2017

Choose a reason for hiding this comment

MattHJensen Jun 19, 2017 • edited Loading

Choose a reason for hiding this comment

hdoupe Jun 19, 2017

Choose a reason for hiding this comment

MattHJensen Jun 19, 2017

Choose a reason for hiding this comment

hdoupe Jun 19, 2017

Choose a reason for hiding this comment

martinholmer commented Jun 20, 2017

feenberg commented Jun 20, 2017 via email

martinholmer commented Jun 20, 2017

hdoupe commented Jun 20, 2017

martinholmer commented Jun 20, 2017 via email

martinholmer commented Jun 14, 2017 •

edited

Loading

codecov-io commented Jun 14, 2017 •

edited

Loading

MattHJensen Jun 15, 2017 •

edited

Loading

MattHJensen Jun 16, 2017 •

edited

Loading

MattHJensen Jun 19, 2017 •

edited

Loading