-
-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to use of new Python-generated puf.csv file #1429
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1429 +/- ##
==========================================
- Coverage 99.64% 99.64% -0.01%
==========================================
Files 40 40
Lines 2788 2785 -3
==========================================
- Hits 2778 2775 -3
Misses 10 10
Continue to review full report at Codecov.
|
Tax-Calculator pull request #1429 is now being merged into the master branch. You will need a new version of the @MattHJensen @feenberg @Amy-Xu @andersonfrailey @hdoupe @GoFroggyRun @codykallen |
random_state=rn_seed) | ||
rec_subsample = Records(data=subsample) | ||
calc_subsample = Calculator(policy=Policy(), records=rec_subsample) | ||
adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs) | ||
# compare combined tax liability from full and sub samples for each year | ||
taxes_subsample = adt_subsample.loc["Combined Liability ($b)"] | ||
reltol = 0.01 # maximum allowed relative difference in tax liability | ||
reltol = 0.04 # maximum allowed relative difference in tax liability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martinholmer flagged this line change on a call today.
The normal differences between the 2% sample and the full sample rose sharply when we introduced the new file. This could be symptomatic of a problem, and it is certainly intriguing. Weights may be a source of interest -- Martin noticed that they are not used to define the 2% sample.
Could you investigate when you want a break from your other work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MattHJensen, I briefly looked into it. There's a weights
parameter in in the pandas sample
function. I set that equal to full_sample.s006
and got worse results. I had to raise reltol
to 0.1 in order to pass the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think caused us to need to increase retol
moving from the 2% sample to the full sample?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MattHJensen, I'm not sure. I hoped that there was something going on in the test function. I looked at using the sample weights for the sub-sample selection and running the test with different seeds. However, this hasn't changed the results. With different seeds, the same 2% sample rate, the same 0.04 tolerance and no weights, this test fails about 20% of the time. If I use the weights, this test fails every time with a relative difference ranging from about 0.07 to 0.15. I haven't been able to figure out why using weights causes the the difference to increase.
I still think that there could be something going on with the random seed. Perhaps, we had a very small relative difference last time because we got lucky and drew a favorable sample.
Also, why do we expect the results from the full sample and 2% of the sample to be within 1% of each other?
I altered the function some to get these results:
@pytest.mark.requires_pufcsv
def test_agg(tests_path, puf_path): # pylint: disable=redefined-outer-name
"""
Test Tax-Calculator aggregate taxes with no policy reform using
the full-sample puf.csv and a two-percent sub-sample of puf.csv
"""
use_weights = False
rn_seed = 0
msgs = []
n_tests = 20
for i in range(0, n_tests):
msg = ""
rn_seed = np.random.randint(0,10000) # to ensure sub-sample is always the same
# pylint: disable=too-many-locals,too-many-statements
nyrs = 10
# create a Policy object (clp) containing current-law policy parameters
clp = Policy()
# create a Records object (rec) containing all puf.csv input records
rec = Records(data=puf_path)
# create a Calculator object using clp policy and puf records
calc = Calculator(policy=clp, records=rec)
calc_start_year = calc.current_year
# create aggregate diagnostic table (adt) as a Pandas DataFrame object
adt = multiyear_diagnostic_table(calc, nyrs)
taxes_fullsample = adt.loc["Combined Liability ($b)"]
# convert adt results to a string with a trailing EOL character
adtstr = adt.to_string() + '\n'
# generate differences between actual and expected results
actual = adtstr.splitlines(True)
aggres_path = os.path.join(tests_path, 'pufcsv_agg_expect.txt')
with open(aggres_path, 'r') as expected_file:
txt = expected_file.read()
expected_results = txt.rstrip('\n\t ') + '\n' # cleanup end of file txt
expected = expected_results.splitlines(True)
diff = difflib.unified_diff(expected, actual,
fromfile='expected', tofile='actual', n=0)
# convert diff generator into a list of lines:
diff_lines = list()
for line in diff:
diff_lines.append(line)
# test failure if there are any diff_lines
if len(diff_lines) > 0:
new_filename = '{}{}'.format(aggres_path[:-10], 'actual.txt')
with open(new_filename, 'w') as new_file:
new_file.write(adtstr)
msg = 'PUFCSV AGG RESULTS DIFFER FOR FULL-SAMPLE\n'
msg += '-------------------------------------------------\n'
msg += '--- NEW RESULTS IN pufcsv_agg_actual.txt FILE ---\n'
msg += '--- if new OK, copy pufcsv_agg_actual.txt to ---\n'
msg += '--- pufcsv_agg_expect.txt ---\n'
msg += '--- and rerun test. ---\n'
msg += '-------------------------------------------------\n'
raise ValueError(msg)
# create aggregate diagnostic table using sub sample of records
fullsample = pd.read_csv(puf_path)
subfrac = 0.02 # sub-sample fraction
if use_weights: # normalize weights for sample
fullsample['weights'] = fullsample.s006/fullsample.s006.sum()
weights = 'weights'
else: # set weights to None and pandas uses uniform dist to select samp
weights = None
subsample = fullsample.sample(frac=subfrac, # pylint: disable=no-member
random_state=rn_seed,
weights=weights)
weight_prop = subsample.s006.sum()/fullsample.s006.sum()
rec_subsample = Records(data=subsample)
calc_subsample = Calculator(policy=Policy(), records=rec_subsample)
adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs)
# compare combined tax liability from full and sub samples for each year
taxes_subsample = adt_subsample.loc["Combined Liability ($b)"]
reltol = 0.04 # maximum allowed relative difference in tax liability
if not np.allclose(taxes_subsample, taxes_fullsample,
atol=0.0, rtol=reltol):
msg = 'PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE\n'
msg += 'WHEN subfrac = {:.3f} and reltol = {:.4f} '
msg += 'and weight prop = {:.4f} and seed = {:.0f}\n'
msg = msg.format(subfrac, reltol, weight_prop, rn_seed)
it_sub = np.nditer(taxes_subsample, flags=['f_index'])
it_all = np.nditer(taxes_fullsample, flags=['f_index'])
while not it_sub.finished:
cyr = it_sub.index + calc_start_year
tax_sub = float(it_sub[0])
tax_all = float(it_all[0])
reldiff = abs(tax_sub - tax_all) / abs(tax_all)
if reldiff > reltol:
msgstr = ' year,sub,full,reldif= {}\t{:.2f}\t{:.2f}\t{:.4f}\n'
msg += msgstr.format(cyr, tax_sub, tax_all, reldiff)
it_sub.iternext()
it_all.iternext()
if len(msg) > 0:
msgs.append(msg)
if len(msgs) > 0:
acc = (len(msgs)/float(n_tests))*100
acc_msg = "Failure rate: {:.2f}%".format(acc)
raise ValueError('\n'.join([acc_msg] + msgs))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MattHJensen I ran the test function from my previous comment on the previous data set and taxcalc 0.8.5. I got similar results to the ones produced with the newest data set and taxcalc 0.9.0:
With different seeds, the same 2% sample rate, the same 0.04 tolerance and no weights, this test fails about 20% of the time. If I use the weights, this test fails every time with a relative difference ranging from about 0.07 to 0.15.
It seems like the seed that was chosen in the first test function yielded a sample that produced results very similar to those produced from the full data set. I'm still not sure why using a weighted sample results in a larger relative difference than an unweighted sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hdoupe.
Also, why do we expect the results from the full sample and 2% of the sample to be within 1% of each other?
We do not expect it, but it was nice that they were so close and interesting when the gap expanded so much.
It seems like the seed that was chosen in the first test function yielded a sample that produced results very similar to those produced from the full data set.
Got it. That makes sense. It sounds like we were lucky before. Do you think we should specify a seed that generally produces small differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MattHJensen, that makes sense.
It does seem like we got lucky before. For the purpose of this test, it seems like setting a random seed that generally produces small differences works well enough. Then, every time a new data set is released we would update the tolerance and the seed.
If we chose a different random seed every time then there would be cases where an unlucky sample is drawn and the test would fail even though nothing is wrong. This problem could be solved by running several iterations and as long as x% of the iterations pass then the test is passed. But that may be overkill for this problem. I re-wrote the test function to do this and it takes about a minute and a half to run versus 45 seconds for the original test function.
@pytest.mark.requires_pufcsv
def test_agg(tests_path, puf_path): # pylint: disable=redefined-outer-name
"""
Test Tax-Calculator aggregate taxes with no policy reform using
the full-sample puf.csv and a two-percent sub-sample of puf.csv
"""
# pylint: disable=too-many-locals,too-many-statements
puf = pd.read_csv(puf_path)
...
# create aggregate diagnostic table using sub sample of records
fullsample = puf.copy()
use_weights = False
rn_seed = 0
msgs = []
n_tests = 50
for i in range(0, n_tests):
msg = ""
rn_seed = np.random.randint(0, 10000) # to ensure sub-sample is always the same
subfrac = 0.02 # sub-sample fraction
if use_weights: # normalize weights for sample
fullsample['weights'] = fullsample.s006/fullsample.s006.sum()
weights = 'weights'
else: # set weights to None and pandas uses uniform dist to select samp
weights = None
subsample = fullsample.sample(frac=subfrac, # pylint: disable=no-member
random_state=rn_seed,
weights=weights,
axis=0)
weight_prop = subsample.s006.sum()/fullsample.s006.sum()
rec_subsample = Records(data=subsample)
calc_subsample = Calculator(policy=Policy(), records=rec_subsample)
adt_subsample = multiyear_diagnostic_table(calc_subsample, num_years=nyrs)
# compare combined tax liability from full and sub samples for each year
taxes_subsample = adt_subsample.loc["Combined Liability ($b)"]
reltol = 0.04 # maximum allowed relative difference in tax liability
if not np.allclose(taxes_subsample, taxes_fullsample,
atol=0.0, rtol=reltol):
msg = 'PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE\n'
msg += 'WHEN subfrac = {:.3f} and reltol = {:.4f} '
msg += 'and weight prop = {:.4f} and seed = {:.0f}\n'
msg = msg.format(subfrac, reltol, weight_prop, rn_seed)
it_sub = np.nditer(taxes_subsample, flags=['f_index'])
it_all = np.nditer(taxes_fullsample, flags=['f_index'])
while not it_sub.finished:
cyr = it_sub.index + calc_start_year
tax_sub = float(it_sub[0])
tax_all = float(it_all[0])
reldiff = abs(tax_sub - tax_all) / abs(tax_all)
if reldiff > reltol:
msgstr = ' year,sub,full,reldif= {}\t{:.2f}\t{:.2f}\t{:.4f}\n'
msg += msgstr.format(cyr, tax_sub, tax_all, reldiff)
it_sub.iternext()
it_all.iternext()
if len(msg) > 0:
msgs.append(msg)
fail_rate = (len(msgs)/float(n_tests))*100
if fail_rate > 0.3:
fail_msg = "Failure rate: {:.2f}%".format(fail_rate)
raise ValueError('\n'.join([fail_msg] + msgs))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hdoupe, that was fun reading, but probably overkill as you say. The purpose of having a "generally friendly" seed for a particular dataset would be to tell TaxBrain to always use that seed when doing a "Quick Calculation".
It's not a huge deal, though, as our users should expect a mismatch between the 2% and full sample anyways, and it sounds like identifying the seed for every new taxpuf could be burdensome. Unless @hdoupe or someone else objects, I think we should close this issue and accept the wider tolerance. Thanks to @hdoupe's detective work, we know that nothing has fundamentally changed with the new dataset, we were just luckier before w the old seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MattHJensen said
The purpose of having a "generally friendly" seed for a particular dataset would be to tell TaxBrain to always use that seed when doing a "Quick Calculation".
Ok, that makes sense. Thanks for reviewing.
I'm in agreement with what I've read in the #1429 conversation between @hdoupe and @MattHJensen about sampling the (1) Need to find a random number seed that allows the 2% sample to produce results within %1 percent of the full-sample results. The only way I know to do that is hunt and peck for the "magic" seed. If a broad search doesn't produce a "magic" seed, then rethink the whole sampling scheme. For example, where did the 2% sampling rate come from? Is there any scientific basis for 2% (rather than a bigger percent)? (2) Once a suitable seed is found in (1), then it must be used in TaxBrain. Does anybody know where the webapp-public repo code sets the value of the sampling seed? (3) And finally, the TaxBrain web interface probably needs to say something about how the sampling results compare with the full-sample results. Otherwise, people have no idea about what to expect. |
On Mon, 19 Jun 2017, Martin Holmer wrote:
I'm in agreement with what I've read in the #1429 conversation between @hdoupe and
@MattHJensen about sampling the puf.csv input file. But several more things have to
happen to implement what appears to be the consensus sampling plan.
(1) Need to find a random number seed that allows the 2% sample to produce results
within %1 percent of the full-sample results. The only way I know to do that is hunt
and peck for the "magic" seed. If a broad search doesn't produce a "magic" seed, then
rethink the whole sampling scheme. For example, where did the 2% sampling rate come
from? Is there any scientific basis for 2% (rather than a bigger percent)?
You could probably match one aggregate, or even two, but probably not more
than that. The 2% was chosen to get near instant results in fortran. Going
to an 8% sample would halve the typical error but quadruple the time.
dan
…
(2) Once a suitable seed is found in (1), then it must be used in TaxBrain. Does
anybody know where the webapp-public repo code sets the value of the sampling seed?
(3) And finally, the TaxBrain web interface probably needs to say something about how
the sampling results compare with the full-sample results. Otherwise, people have no
idea about what to expect.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVfeiMXlOXi8-qjx0bly31u_N6Et3ks5sFxVKgaJpZM4N5SZS.gif]
|
@hdoupe and @MattHJensen, I think there may be a bug in the subsample test being discussed in #1429. Consider this code:
The Records constructor is called with Maybe I'm confused, but it seems to me this issue should be explored more thoroughly in its own issue and pull request. Unless I'm missing something, it seems as if the sampling-adjusted weights need to somehow be put into the Records object before the Calculator object is constructed and used. |
@martinholmer said
I created I'm looking at this section from
where To me, it looks like the constructor handles this scenario just fine. Does this clear things up for you? |
Sent from my iPhone
On Jun 20, 2017, at 10:45 AM, Henry Doupe ***@***.***> wrote:
@martinholmer said
Looking at the Records.__init__ code suggests that the weights will be read in from the puf_weights.csv file, which does not reflect the weight_prop value.
I created weight_prop in an effort to figure out why the weighted sample produced more different results in terms of relative difference than the unweighted sample.
I'm looking at this section from Records.__init__
# weights must be same size as tax record data
if not self.WT.empty and self.dim != len(self.WT):
frac = float(self.dim) / len(self.WT)
self.WT = self.WT.iloc[self.index]
self.WT = self.WT / frac
where self.dim = subsample size and len(self.WT) = full sample size
To me, it looks like the constructor handles this scenario just fine. Does this clear things up for you?
Above code would seem to work only if doing unweighted sampling. Or am I missing something else?
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
This pull request makes the changes in Tax-Calculator caused by the use of the new Python-generated
puf.csv
input file described in taxdata pull request 100.