Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect income percentiles #45

Closed
codykallen opened this issue Nov 10, 2016 · 7 comments
Closed

Incorrect income percentiles #45

codykallen opened this issue Nov 10, 2016 · 7 comments

Comments

@codykallen
Copy link

The distribution of income quintiles in the PUF appears incorrect.

Using the Tax-Calculator, I found the following values for median tax unit income. The numbers in parentheses are the nominal median household incomes from the Census Bureau.
2013: $27,250 ($52,250)
2014: $28,022 ($53,657)
2015: $28,877 ($55,775)

The numbers calculated with TC are more consistent with median individual incomes, but AGI includes income from the primary and secondary earner. I think the only major discrepancy should come from married couples filing separately.

The Tax Policy Center also has some percentile distributions. Their numbers are in parentheses next to those estimated using TC (for 2013).
20th: $5894 ($21,000)
40th: $18,537 ($41,035)
60th: $38,659 ($67,200)
80th: $77,975 ($110,232)

If our percentile distributions are too far off, we can't do a distributional analysis of a tax plan.

Link to my workbook

cc @martinholmer @MattHJensen @Amy-Xu @andersonfrailey

@martinholmer
Copy link
Contributor

martinholmer commented Nov 11, 2016

@codykallen computed some AGI values at different points of the AGI distribution and was troubled by his results.

I've written a short Python script that tabulates the mean AGI in each AGI percentile.
Here is that script and below the script are the results it generates with some thoughts about what the results mean.

from taxcalc import *
calc = Calculator(policy=Policy(), records=Records())
calc.calc_all()
# extract sampling weight and calculated AGI into a Pandas DataFrame, df
record_columns = ['s006', 'c00100']
output = [getattr(calc.records, col) for col in record_columns]
df = pd.DataFrame(data=np.column_stack(output), columns=record_columns)
# create 'bins' column in df
df = add_weighted_income_bins(df, num_bins=100, income_measure='c00100')
# split df into groups specified by 'bins' column
gdf = df.groupby('bins', as_index=False)
# apply weighted_mean function to percentile-grouped AGI values
avg_agi_series = gdf.apply(weighted_mean, 'c00100')
for pctile in range(10, 100, 10):
    print pctile, '{:.0f}'.format(avg_agi_series[pctile])

Here is what this script produces.

$ cp ../tax-calculator-data/puf.csv .
$ python agi_values.py
You loaded data for 2009.
Your data include the following unused variables that will be ignored:
  filer
Your data have been extrapolated to 2013.
10 936
20 6187
30 12357
40 18939
50 27699
60 39262
70 55703
80 79704
90 122618
$ 

These results are close to what @codykallen generated in his notebook, so it would seem that the AGI values are very low. What about Expanded Income?

@andersonfrailey
Copy link
Collaborator

@martinholmer @codykallen

Using both of y'all's code with expanded income TC numbers are still low. Here is the income distribution:

20th: 11,649
40th: 25,280
60th: 45,194
80th: 83,796

And mean expanded income:

10 5237
20 11998
30 17960
40 25682
50 34451
60 45922
70 61408
80 85599
90 131142

Based on some digging into the distribution of the different sources of income and conversations with @codykallen and others, I believe part of the problem might be in stage II of our extrapolation process where the weights for each target are adjusted to ensure aggregate totals for things like interest income, dividends, and wages are hit.

We currently only target the distribution of wages and salaries. For everything else it is just the aggregate total. Adding more targets for other sources of income might help solve some of the problems.

After I finish working on the CPS file I can try adding the additional targets and report back on how they affected the income distributions we're seeing.

@MattHJensen
Copy link
Contributor

MattHJensen commented Nov 12, 2016

@andersonfrailey said:

After I finish working on the CPS file I can try adding the additional targets and report back on how they affected the income distributions we're seeing.

I'm in full agreement that adding the additional targets will be useful and might improve this situation.

But I'm not sure yet that our current distributional accuracy is as bad as any of the above analysis might make it appear.

First note that both of @codykallen's comps appear to be Census bureau data, as Census Bureau is the source of the TPC tables.

If we compare TD's AGI in 2014 against SOI AGI data in 2014 -- the latest year for which comparable administrative data is available -- the numbers look a little more reassuring: eyeballing from the "accumulated" section of SOI table 1.1, median AGI is around $35k.

image

Now that is still higher than the $28k that @codykallen generates from TD, but keep in mind that SOI's $35k only includes tax filers, while @codykallen's script appears to also include non-filers, so we should expect his number to be lower.

I think a fruitful next steps would be to (1) look at TC/TD's median w/o non-filers and compare that to the $35k and (2) dig into the differences between Census' income measure and AGI.

As for (1), any remaining discrepancy between SOI AGI and TC/TD AGI w/o nonfilers would hopefully be narrowed by @andersonfrailey's upcoming work on additional stage 2 targets. Perhaps the primary value of that stage 2 work is that it will mean that our distribution of income items like capital gains, dividends, interest will be more accurate and we could more accurately estimate revenue for reforms that target those income items. Those improvements won't necessarily be apparent in these aggregate AGI statistics, though.

As for (2) it could be that Census includes income that is excluded from AGI like transfer/welfare payments, employer provided health, as well as the other items that are included in our expanded income measure but not in AGI. If it turns out that this is the case, then the resulting differences are food for thought regarding our tab variable for distributional tables (as @codykallen notes) rather than relevant to our ability to do revenue analysis, and they won't be improved through stage 2 extrapolation, but rather by imputing major excluded income items.

For background on imputations that would be helpful, see #35 to see a list of open projects for expanding TD/TC's expanded income measure.

@martinholmer
Copy link
Contributor

martinholmer commented Nov 13, 2016

This comment on taxdata issue #45 follows up some suggestions made by @MattHJensen. Here we tabulate the puf.csv file aged to 2014 for mean AGI and mean expanded income (using the newest version of expanded income in Tax-Calculator pull request #1057) for each of several income percentiles. These results confirm Matt's notion that among filers, the mean AGI in the 50th AGI percentile is about the same as that reported by IRS-SOI (probably somewhere in the $34,000 to $35,000 range). Here are four sets of distributions for filers-only vs all puf units crossed by AGI vs expanded income as the income measure used to define the percentiles. After showing those results, the script used to generate these tabulations is shown.

-----------------------------------------------------------------
NOTE: filers_only=False implies unweighted count = 219814
2014 mean income by income percentiles [income = c00100]
10      864
20     6090
30    12444
40    19282
50    28494
60    40550
70    57512
80    82260
90   127343
-----------------------------------------------------------------
NOTE: filers_only=True implies unweighted count = 214121
2014 mean income by income percentiles [income = c00100]
10     4538
20    10482
30    16592
40    24523
50    34078
60    46646
70    63226
80    88946
90   133871
-----------------------------------------------------------------
NOTE: filers_only=False implies unweighted count = 219814
2014 mean income by income percentiles [income = _expanded_income]
10     5288
20    12163
30    18336
40    26365
50    35403
60    47332
70    63582
80    88636
90   136925
-----------------------------------------------------------------
NOTE: filers_only=True implies unweighted count = 214121
2014 mean income by income percentiles [income = _expanded_income]
10     6518
20    14131
30    21529
40    30162
50    40274
60    52721
70    69242
80    95697
90   143790
-----------------------------------------------------------------

Now the script.

filers_only = True
income = '_expanded_income'
from taxcalc import *
calc = Calculator(policy=Policy(), records=Records(), verbose=False)
calc.advance_to_year(2014)
calc.calc_all()
# extract variables into a Pandas DataFrame, df
record_columns = ['s006', 'c00100', '_expanded_income', 'filer']
output = [getattr(calc.records, col) for col in record_columns]
df = pd.DataFrame(data=np.column_stack(output), columns=record_columns)
if filers_only:
    df = df[df['filer'] == 1]
print ('NOTE: filers_only={} implies '
       'unweighted count = {}').format(filers_only, len(df.index))
# create 'bins' column in df
df = add_weighted_income_bins(df, num_bins=100, income_measure=income)
# split df into groups specified by 'bins' column
gdf = df.groupby('bins', as_index=False)
# apply weighted_mean function to percentile-grouped income values
mean_income = gdf.apply(weighted_mean, income)
print '2014 mean income by income percentiles [income = {}]'.format(income)
for pctile in range(10, 100, 10):
    print '{} {:8.0f}'.format(pctile, mean_income[pctile])

@feenberg @Amy-Xu @GoFroggyRun @andersonfrailey @codykallen

@martinholmer
Copy link
Contributor

@codykallen, Is taxdata issue #45, which you raised in Nov 2016, still unresolved from your point of view?

@codykallen
Copy link
Author

@martinholmer, I believe this can be considered complete. Closing now.

@martinholmer
Copy link
Contributor

Thanks @codykallen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants