diff --git a/README.md b/README.md index 80e4d7c..b573bac 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,14 @@ # PSID.jl + +This package produces a labeled panel of individuals with a consistent individual ID across time. You provide a JSON file describing the variables you want. An example input file can be found at [examples/user_input.json.](https://github.com/aaowens/PSID.jl/blob/master/examples/user_input.json). Currently only variables in the family files can be added, but in the future it should be possible to support variables in the individual files or the supplements. + +# Instructions + To add this package, use ``` (v1.2) pkg> add https://github.com/aaowens/PSID.jl ``` -This package produces a labeled panel of individuals with a consistent individual ID across time. You provide a JSON file describing the variables you want. An example input file can be found at [examples/user_input.json.](https://github.com/aaowens/PSID.jl/blob/master/examples/user_input.json). Currently only variables in the family files can be added, but in the future it should be possible to support variables in the individual files or the supplements. - Requirements: A list of data files required to be in the current directory can be found [here](https://github.com/aaowens/PSID.jl/blob/master/src/allfiles_hash.json). These files are 1. The PSID codebook in XML format. You can download this from me here https://drive.google.com/open?id=1nz1UaVGcj0ur2Bp3ev7a8agJbj0A5JTF . In the future there will be a way to download this from the PSID directly. @@ -20,11 +23,25 @@ julia> makePSID("user_input.json") ``` It will verify the required files exist and then construct the data. If successful, it will print `Finished constructing individual data, saved to output/allinds.csv` after about 5 minutes. +## The input JSON file +The file passed to `makePSID` describes the variables you want. +``` +{ + "name_user": "hours", + "varID": "V465", + "unit": "head" + }, + ``` + There are three fields, `name_user`, `varID`, and `unit`. `name_user` is a name chosen by you. `varID` is one of the codes assigned by the PSID to this variable. These can be looked up in the PSID [cross-year index](https://simba.isr.umich.edu/VS/i.aspx). For example, hours above can be found in the crosswalk at ` Family Public Data Index 01>WORK 02>Hours and Weeks 03>annual in prior year 04>head 05>total:`. Clicking on the variable info will show the the list of years and associated IDs when that variable is available. Choose any of the IDs for `varID`, it does not matter. `PSID.jl` will look up all available years for that variable in the crosswalk. You must also indicate the unit, which can be `head`, `spouse`, or `family`. This makes sure the variable is assigned to the correct individual. + + +# Features + This package provides the following features: 1. Automatically labels missing values by searching the value labels from the codebook for strings like "NA", "Inap.", or "Missing". 2. Tries to produce consistent value labels across years for categotical variables. This is difficult because the labels in the PSID sometimes change between years. This package uses an algorithm to try to harmonize the labels when possible by removing common subsets. For example, in one year race is labeled as "Asian" but in the next year it is "Asian, Pacific Islander". The first is a subset of the second, so the final label will be "Asian, Pacific Islander". When this is not possible, the final label will be "A or B or C" for however many incomparable labels were found. 3. Matches the individuals across time to produce a panel with consistent (ID, year) keys and their associated variables. -4. Produces consistent individual or spouse variables for individuals. In the input JSON file, you must indicate whether a variable is family level, household head level, or household spouse level. The final output will have variables of the form VAR_family, VAR_ind, or VAR_spouse. When the individual is a household head, VAR_ind will come from the household head version of that variable, and VAR_spouse will come from the household spouse version. If the individual is a household spouse, it is the reverse. +4. Produces consistent individual or spouse variables for individuals. In the input JSON file, you must indicate whether a variable is family level, household head level, or household spouse level. The final output will have variables of the form `VAR_family`, `VAR_ind`, or `VAR_spouse`. When the individual is a household head, `VAR_ind` will come from the household head version of that variable, and `VAR_spouse` will come from the household spouse version. If the individual is a household spouse, it is the reverse. Both individuals will get all family level variables. 5. It's easiest to track individuals, but this package also produces a consistent family ID by treating a family as a combination of head and spouse (if spouse exists). If you keep only household heads and drop years before 1970, (famid, year) should be an ID. This package is new and not well tested, please file issues if you find a bug.