Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to use of new Python-generated puf.csv file #1429

Merged
merged 9 commits into from
Jun 14, 2017
Merged

Switch to use of new Python-generated puf.csv file #1429

merged 9 commits into from
Jun 14, 2017

Conversation

martinholmer
Copy link
Collaborator

@martinholmer martinholmer commented Jun 14, 2017

This pull request makes the changes in Tax-Calculator caused by the use of the new Python-generated puf.csv input file described in taxdata pull request 100.

@codecov-io
Copy link

codecov-io commented Jun 14, 2017

Codecov Report

Merging #1429 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1429      +/-   ##
==========================================
- Coverage   99.64%   99.64%   -0.01%     
==========================================
  Files          40       40              
  Lines        2788     2785       -3     
==========================================
- Hits         2778     2775       -3     
  Misses         10       10
Impacted Files Coverage Δ
taxcalc/records.py 98.78% <100%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c8bbda...339de15. Read the comment docs.

@martinholmer martinholmer changed the title Use new Python-generated puf.csv file Switch to use of new Python-generated puf.csv file Jun 14, 2017
@martinholmer
Copy link
Collaborator Author

Tax-Calculator pull request #1429 is now being merged into the master branch.

You will need a new version of the puf.csv file in order to pass the requires_pufcsv tests and get the same taxcalc/comparison/reform_results.txt results.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun @codykallen
@PeterDSteinberg @brittainhard

@martinholmer martinholmer merged commit 0d30e13 into PSLmodels:master Jun 14, 2017
@martinholmer martinholmer deleted the new-pufcsv branch June 14, 2017 20:13
random_state=rn_seed)
rec_subsample = Records(data=subsample)
calc_subsample = Calculator(policy=Policy(), records=rec_subsample)
adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs)
# compare combined tax liability from full and sub samples for each year
taxes_subsample = adt_subsample.loc["Combined Liability ($b)"]
reltol = 0.01 # maximum allowed relative difference in tax liability
reltol = 0.04 # maximum allowed relative difference in tax liability
Copy link
Contributor

@MattHJensen MattHJensen Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersonfrailey, @hdoupe,

@martinholmer flagged this line change on a call today.

The normal differences between the 2% sample and the full sample rose sharply when we introduced the new file. This could be symptomatic of a problem, and it is certainly intriguing. Weights may be a source of interest -- Martin noticed that they are not used to define the 2% sample.

Could you investigate when you want a break from your other work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MattHJensen, I briefly looked into it. There's a weights parameter in in the pandas sample function. I set that equal to full_sample.s006 and got worse results. I had to raise reltol to 0.1 in order to pass the test.

Copy link
Contributor

@MattHJensen MattHJensen Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think caused us to need to increase retol moving from the 2% sample to the full sample?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MattHJensen, I'm not sure. I hoped that there was something going on in the test function. I looked at using the sample weights for the sub-sample selection and running the test with different seeds. However, this hasn't changed the results. With different seeds, the same 2% sample rate, the same 0.04 tolerance and no weights, this test fails about 20% of the time. If I use the weights, this test fails every time with a relative difference ranging from about 0.07 to 0.15. I haven't been able to figure out why using weights causes the the difference to increase.

I still think that there could be something going on with the random seed. Perhaps, we had a very small relative difference last time because we got lucky and drew a favorable sample.

Also, why do we expect the results from the full sample and 2% of the sample to be within 1% of each other?

I altered the function some to get these results:

@pytest.mark.requires_pufcsv
def test_agg(tests_path, puf_path):  # pylint: disable=redefined-outer-name
    """
    Test Tax-Calculator aggregate taxes with no policy reform using
    the full-sample puf.csv and a two-percent sub-sample of puf.csv
    """
    use_weights = False
    rn_seed = 0
    msgs = []
    n_tests = 20
    for i in range(0, n_tests):
        msg = ""
        rn_seed = np.random.randint(0,10000)  # to ensure sub-sample is always the same
        # pylint: disable=too-many-locals,too-many-statements
        nyrs = 10
        # create a Policy object (clp) containing current-law policy parameters
        clp = Policy()
        # create a Records object (rec) containing all puf.csv input records
        rec = Records(data=puf_path)
        # create a Calculator object using clp policy and puf records
        calc = Calculator(policy=clp, records=rec)
        calc_start_year = calc.current_year
        # create aggregate diagnostic table (adt) as a Pandas DataFrame object
        adt = multiyear_diagnostic_table(calc, nyrs)
        taxes_fullsample = adt.loc["Combined Liability ($b)"]
        # convert adt results to a string with a trailing EOL character
        adtstr = adt.to_string() + '\n'
        # generate differences between actual and expected results
        actual = adtstr.splitlines(True)
        aggres_path = os.path.join(tests_path, 'pufcsv_agg_expect.txt')
        with open(aggres_path, 'r') as expected_file:
            txt = expected_file.read()
        expected_results = txt.rstrip('\n\t ') + '\n'  # cleanup end of file txt
        expected = expected_results.splitlines(True)
        diff = difflib.unified_diff(expected, actual,
                                    fromfile='expected', tofile='actual', n=0)
        # convert diff generator into a list of lines:
        diff_lines = list()
        for line in diff:
            diff_lines.append(line)
        # test failure if there are any diff_lines
        if len(diff_lines) > 0:
            new_filename = '{}{}'.format(aggres_path[:-10], 'actual.txt')
            with open(new_filename, 'w') as new_file:
                new_file.write(adtstr)
            msg = 'PUFCSV AGG RESULTS DIFFER FOR FULL-SAMPLE\n'
            msg += '-------------------------------------------------\n'
            msg += '--- NEW RESULTS IN pufcsv_agg_actual.txt FILE ---\n'
            msg += '--- if new OK, copy pufcsv_agg_actual.txt to  ---\n'
            msg += '---                 pufcsv_agg_expect.txt     ---\n'
            msg += '---            and rerun test.                ---\n'
            msg += '-------------------------------------------------\n'
            raise ValueError(msg)
        # create aggregate diagnostic table using sub sample of records
        fullsample = pd.read_csv(puf_path)
        subfrac = 0.02  # sub-sample fraction
        if use_weights:  # normalize weights for sample
            fullsample['weights'] = fullsample.s006/fullsample.s006.sum()
            weights = 'weights'
        else: # set weights to None and pandas uses uniform dist to select samp
            weights = None
        subsample = fullsample.sample(frac=subfrac,  # pylint: disable=no-member
                                      random_state=rn_seed,
                                      weights=weights)
        weight_prop = subsample.s006.sum()/fullsample.s006.sum()
        rec_subsample = Records(data=subsample)
        calc_subsample = Calculator(policy=Policy(), records=rec_subsample)
        adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs)
        # compare combined tax liability from full and sub samples for each year
        taxes_subsample = adt_subsample.loc["Combined Liability ($b)"]
        reltol = 0.04  # maximum allowed relative difference in tax liability
        if not np.allclose(taxes_subsample, taxes_fullsample,
                           atol=0.0, rtol=reltol):
            msg = 'PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE\n'
            msg += 'WHEN subfrac = {:.3f} and reltol = {:.4f} '
            msg += 'and weight prop = {:.4f} and seed = {:.0f}\n'
            msg = msg.format(subfrac, reltol, weight_prop, rn_seed)
            it_sub = np.nditer(taxes_subsample, flags=['f_index'])
            it_all = np.nditer(taxes_fullsample, flags=['f_index'])
            while not it_sub.finished:
                cyr = it_sub.index + calc_start_year
                tax_sub = float(it_sub[0])
                tax_all = float(it_all[0])
                reldiff = abs(tax_sub - tax_all) / abs(tax_all)
                if reldiff > reltol:
                    msgstr = ' year,sub,full,reldif= {}\t{:.2f}\t{:.2f}\t{:.4f}\n'
                    msg += msgstr.format(cyr, tax_sub, tax_all, reldiff)
                it_sub.iternext()
                it_all.iternext()

        if len(msg) > 0:
            msgs.append(msg)
    if len(msgs) > 0:
        acc = (len(msgs)/float(n_tests))*100
        acc_msg = "Failure rate: {:.2f}%".format(acc)
        raise ValueError('\n'.join([acc_msg] + msgs))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MattHJensen I ran the test function from my previous comment on the previous data set and taxcalc 0.8.5. I got similar results to the ones produced with the newest data set and taxcalc 0.9.0:

With different seeds, the same 2% sample rate, the same 0.04 tolerance and no weights, this test fails about 20% of the time. If I use the weights, this test fails every time with a relative difference ranging from about 0.07 to 0.15.

It seems like the seed that was chosen in the first test function yielded a sample that produced results very similar to those produced from the full data set. I'm still not sure why using a weighted sample results in a larger relative difference than an unweighted sample.

Copy link
Contributor

@MattHJensen MattHJensen Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hdoupe.

Also, why do we expect the results from the full sample and 2% of the sample to be within 1% of each other?

We do not expect it, but it was nice that they were so close and interesting when the gap expanded so much.

It seems like the seed that was chosen in the first test function yielded a sample that produced results very similar to those produced from the full data set.

Got it. That makes sense. It sounds like we were lucky before. Do you think we should specify a seed that generally produces small differences?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MattHJensen, that makes sense.

It does seem like we got lucky before. For the purpose of this test, it seems like setting a random seed that generally produces small differences works well enough. Then, every time a new data set is released we would update the tolerance and the seed.

If we chose a different random seed every time then there would be cases where an unlucky sample is drawn and the test would fail even though nothing is wrong. This problem could be solved by running several iterations and as long as x% of the iterations pass then the test is passed. But that may be overkill for this problem. I re-wrote the test function to do this and it takes about a minute and a half to run versus 45 seconds for the original test function.

@pytest.mark.requires_pufcsv
def test_agg(tests_path, puf_path):  # pylint: disable=redefined-outer-name
    """
    Test Tax-Calculator aggregate taxes with no policy reform using
    the full-sample puf.csv and a two-percent sub-sample of puf.csv
    """
    # pylint: disable=too-many-locals,too-many-statements
    puf = pd.read_csv(puf_path)
    
    ...    
    
    # create aggregate diagnostic table using sub sample of records
    fullsample = puf.copy()
    use_weights = False
    rn_seed = 0
    msgs = []
    n_tests = 50
    for i in range(0, n_tests):
        msg = ""
        rn_seed = np.random.randint(0, 10000)  # to ensure sub-sample is always the same
        subfrac = 0.02  # sub-sample fraction
        if use_weights:  # normalize weights for sample
            fullsample['weights'] = fullsample.s006/fullsample.s006.sum()
            weights = 'weights'
        else: # set weights to None and pandas uses uniform dist to select samp
            weights = None
        subsample = fullsample.sample(frac=subfrac,  # pylint: disable=no-member
                                      random_state=rn_seed,
                                      weights=weights,
                                      axis=0)
        weight_prop = subsample.s006.sum()/fullsample.s006.sum()
        rec_subsample = Records(data=subsample)
        calc_subsample = Calculator(policy=Policy(), records=rec_subsample)
        adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs)
        # compare combined tax liability from full and sub samples for each year
        taxes_subsample = adt_subsample.loc["Combined Liability ($b)"]
        reltol = 0.04  # maximum allowed relative difference in tax liability
        if not np.allclose(taxes_subsample, taxes_fullsample,
                           atol=0.0, rtol=reltol):
            msg = 'PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE\n'
            msg += 'WHEN subfrac = {:.3f} and reltol = {:.4f} '
            msg += 'and weight prop = {:.4f} and seed = {:.0f}\n'
            msg = msg.format(subfrac, reltol, weight_prop, rn_seed)
            it_sub = np.nditer(taxes_subsample, flags=['f_index'])
            it_all = np.nditer(taxes_fullsample, flags=['f_index'])
            while not it_sub.finished:
                cyr = it_sub.index + calc_start_year
                tax_sub = float(it_sub[0])
                tax_all = float(it_all[0])
                reldiff = abs(tax_sub - tax_all) / abs(tax_all)
                if reldiff > reltol:
                    msgstr = ' year,sub,full,reldif= {}\t{:.2f}\t{:.2f}\t{:.4f}\n'
                    msg += msgstr.format(cyr, tax_sub, tax_all, reldiff)
                it_sub.iternext()
                it_all.iternext()

        if len(msg) > 0:
            msgs.append(msg)
    fail_rate = (len(msgs)/float(n_tests))*100
    if fail_rate > 0.3:
        fail_msg = "Failure rate: {:.2f}%".format(fail_rate)
        raise ValueError('\n'.join([fail_msg] + msgs))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdoupe, that was fun reading, but probably overkill as you say. The purpose of having a "generally friendly" seed for a particular dataset would be to tell TaxBrain to always use that seed when doing a "Quick Calculation".

image

It's not a huge deal, though, as our users should expect a mismatch between the 2% and full sample anyways, and it sounds like identifying the seed for every new taxpuf could be burdensome. Unless @hdoupe or someone else objects, I think we should close this issue and accept the wider tolerance. Thanks to @hdoupe's detective work, we know that nothing has fundamentally changed with the new dataset, we were just luckier before w the old seed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MattHJensen said

The purpose of having a "generally friendly" seed for a particular dataset would be to tell TaxBrain to always use that seed when doing a "Quick Calculation".

Ok, that makes sense. Thanks for reviewing.

@martinholmer
Copy link
Collaborator Author

I'm in agreement with what I've read in the #1429 conversation between @hdoupe and @MattHJensen about sampling the puf.csv input file. But several more things have to happen to implement what appears to be the consensus sampling plan.

(1) Need to find a random number seed that allows the 2% sample to produce results within %1 percent of the full-sample results. The only way I know to do that is hunt and peck for the "magic" seed. If a broad search doesn't produce a "magic" seed, then rethink the whole sampling scheme. For example, where did the 2% sampling rate come from? Is there any scientific basis for 2% (rather than a bigger percent)?

(2) Once a suitable seed is found in (1), then it must be used in TaxBrain. Does anybody know where the webapp-public repo code sets the value of the sampling seed?

(3) And finally, the TaxBrain web interface probably needs to say something about how the sampling results compare with the full-sample results. Otherwise, people have no idea about what to expect.

@feenberg
Copy link
Contributor

feenberg commented Jun 20, 2017 via email

@martinholmer
Copy link
Collaborator Author

@hdoupe and @MattHJensen, I think there may be a bug in the subsample test being discussed in #1429. Consider this code:

# create aggregate diagnostic table using sub sample of records
        fullsample = pd.read_csv(puf_path)
        subfrac = 0.02  # sub-sample fraction
        if use_weights:  # normalize weights for sample
            fullsample['weights'] = fullsample.s006/fullsample.s006.sum()
            weights = 'weights'
        else: # set weights to None and pandas uses uniform dist to select samp
            weights = None
        subsample = fullsample.sample(frac=subfrac,  # pylint: disable=no-member
                                      random_state=rn_seed,
                                      weights=weights)
        weight_prop = subsample.s006.sum()/fullsample.s006.sum()
        rec_subsample = Records(data=subsample)
        calc_subsample = Calculator(policy=Policy(), records=rec_subsample)
        adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs)

The Records constructor is called with data=subsample as its only argument.
Looking at the Records.__init__ code suggests that the weights will be read in from the puf_weights.csv file, which does not reflect the weight_prop value.

Maybe I'm confused, but it seems to me this issue should be explored more thoroughly in its own issue and pull request.

Unless I'm missing something, it seems as if the sampling-adjusted weights need to somehow be put into the Records object before the Calculator object is constructed and used.

@hdoupe
Copy link
Collaborator

hdoupe commented Jun 20, 2017

@martinholmer said

Looking at the Records.__init__ code suggests that the weights will be read in from the puf_weights.csv file, which does not reflect the weight_prop value.

I created weight_prop in an effort to figure out why the weighted sample produced more different results in terms of relative difference than the unweighted sample.

I'm looking at this section from Records.__init__

# weights must be same size as tax record data
if not self.WT.empty and self.dim != len(self.WT):
    frac = float(self.dim) / len(self.WT)
    self.WT = self.WT.iloc[self.index]
    self.WT = self.WT / frac

where self.dim = subsample size and len(self.WT) = full sample size

To me, it looks like the constructor handles this scenario just fine. Does this clear things up for you?

@martinholmer
Copy link
Collaborator Author

martinholmer commented Jun 20, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants