Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow text specification of states, counties #4

Open
beechnut opened this issue Jul 12, 2013 · 10 comments
Open

Allow text specification of states, counties #4

beechnut opened this issue Jul 12, 2013 · 10 comments

Comments

@beechnut
Copy link
Contributor

Let users indicate states and counties by name instead of numerical code, using hash syntax.

@client.find('P0010001', county: 'Suffolk', state: 'MA')
@client.find('P0010001', county: 'Suffolk County', state: 'Massachusetts')
@client.find('P0010001', county: 25, state: 25)

Also should accept symbol as a wildcard field name, plural field names, and multiple 'level' values as an array:

@client.find('P0010001', :county)
@client.find('P0010001', :states)
@client.find('P0010001', states:[25,26])

This will mean the keys will be upcased to become API URL parameter names. The values will be looked up in a hash and converted to digits for the URL parameter values.

When multiple geometry parameters need to be specified for 'in', I imagine the following:

@client.find('P0010001', :submcd, {state: 72, county: 127, cousub: 79693})
@tyrauber
Copy link
Owner

This is good. I like it.

A rake task would be required to query each of the summary levels on a state by state basis and build out the dictionary with the responses. But this feature would also allow the gem to return human readable results:

[{"P0010001"=>"722023", "P0390001"=>"140412", "county"=>"Suffolk County", "state"=>"Massachusetts",  "state_id"=>"25", "county_id"=>"025"}] 

Just a couple issues to think through:

  • Dictionary Size. There are 19 summary levels. Some of the summary level names are manufactured from their number - TABBLOCK, TRACT, BG, CD - but most would require a translation: STATE, COUNTY, COUSUB, SUBMCD, PLACE, ANRC, AIANNH, AITS, CBSA, METDIV, CSA, SLDL, SLDU.
  • Some of the summary levels can only be looked up in relation to their parent level. Perhaps we return all matches across all parents? This seems like it would be ok, because results return the name, or id, of the resulting area.
  • It might be nice to query the dictionary directly: @client.locate('Suffolk') This would enable to people to look up the right location first before query for data.
  • It also might be nice to provide fuzzy search, perhaps using the amatch gem.

@beechnut
Copy link
Contributor Author

I like how you structured the return object. You're right - gives the user more to work with.

  • What would @client.locate 'Suffolk' return? (Not quite following yet.)
  • +1 to fuzzy matching.
  • Regarding dictionary size: noted. There's also the question of how much of the name to include. Do we call it "American Indian Area/Alaska Native Area/Hawaiian Home Land" or truncate it to "American Indian Area". I suppose we could have something like
{AIANNH: {short_name: 'American Indian Area', long_name: 'American Indian Area/Alaska Native Area/Hawaiian Home Land'}}

and return the short_name in the results object.

Parent-Independent Querying

I definitely want to be able to query objects independent of nesting, and it would be fantastic if the gem allowed this. However, I think that until the API itself can handle parent-independent querying, we should enable it only for objects nested one level deep.

Those levels would be:

  • CBSA
  • CSA
  • CD
  • SLDU
  • SLDL
  • ZCTA

Enabling parent-independent queries for fields that are nested one level deep, e.g. ZCTAs (nested in STATE), we only have to know the 52 IDs for the states, running something like:

(1..52).each { |id| @client.find("P0010001", "ZCTA5", "STATE:#{id}") }

However, for multiple nesting levels, we would need to know all of the ids of every level above it, and querying, say, all block groups would return tens of thousands of objects.

One more question here is how to look up, for example, a single state-independent ZCTA, as in:

@client.find('P0010001', zcta: 02139)

@beechnut
Copy link
Contributor Author

FYI I have the hash syntax (not the text lookup) for the level parameter in a good spot. Will jam on within tonight/tomorrow and push those changes soon. Still using numerical IDs.

@tyrauber
Copy link
Owner

Good stuff.

I created another gem around the same time as this one, census_shapes, which imports the census summary level boundaries into a postgis database.

There are a couple of files in there that might be of use here:

Additionally:

  • Regarding, @client.locate 'Suffolk' I was just suggesting a process to look up the ID for any given geography. @client.locate would do a dictionary lookup to match Suffolk and return all matches, regardless of summary level.

  • AIANNH is a summary level, not geography. All the geographies only have a name and an id. The census api doesn't even return state abbreviation, hence the abbreviations in the US states yaml file.

  • STATE ids are - unfortunately - not sequential. They range from 1-72. In no particular order.

  • ZCTA5 is independent of STATE, unfortunately. A handful of ZCTA5's cross state boundaries.

  • The main hierarchy structure is :

    STATE > COUNTY > TRACT > BLOCKGROUP > BLOCK

  • Also, worth noting, their id at each level indicates their parents

state (2 digits) county (3 digits) census tract (6 digits) block group (1 digit) block (2 digits)

Additionally, SD, CD, SLDU, SLDL and PLACE are under STATE.
And, VD and COUSUB are under COUNTY.

See the page 16 of the Census SF1 PDF

In regards to creating the geography dictionary, I would probably write a script to create yaml, like us_states, for every summary level. The only additional data I would add, would be parental hierarchy.

In fact, if I remember correctly, for the TIGER dataset every state has an SF1 file which serves as an index. That SF1 file contains a list of every geography in the state at every level. It's unfortunately not a csv, and not easily parsed, but I have some code somewhere that will do it.

With that being said, it might just be easier to write a rake task that queries the census api to build the index with the results.

@tyrauber
Copy link
Owner

Got the following message via email from github / @beechnut, but it didn't show up even though the email link brought me here. Pasting it in and commenting for posterity.

Thanks for the clarification on summary level vs geography. (Still learning!) Looks like geography is the descending center column on p.16 of the SF1 PDF, and the sumlevs are in the wings as well as the center column -- right?

I spent a month trying to grok that damn sf1 doc. Don't worry about it. When I say write geography I mean an actual physical geographic entity. Summary levels are types of geography as determined by the US Census. So California is a 'geography', and the sumlevel is 040, STATE. On page 16, that diagram shows the relationship between all sumlevels from a hierarchical point of view.

@client.locate
Thanks for clarifying. So if I understand, @client.lookup 'Suffolk' would return the ids for geographies that match (or amatch) 'Suffolk'. Ideally we'd see parent info here too, when relevant:

Suffolk County: [id: 25, state: 25, state_name: "Massachusetts"]
Suffolk County: [id: 103, state: 99, state_name: "State"]
Suffolk city: [id: 800, state: 99, state_name: "State"] # didn't know where the other Suffolks were. If only we had a function to return this! :D

Correct. Basically, I was just suggesting a way to do quick geographical look up - especially if we get fuzzy search in there - so you can find the proper geography before querying the census api.

Parent Relations
"Some of the summary levels can only be looked up in relation to their parent level. Perhaps we return all matches across all parents? This seems like it would be ok, because results return the name, or id, of the resulting area."

I think the gem should work in a manner consistent with the API: raise an error if the parameters are off, and return strictly the correct results. I do not want to force a user to sift through the results afterwards, and would much rather require them to get the parameters right before the request.

Agreed.

I'm thinking of this specifically for states and counties, but

Pseudocode for the workflow
if within
else
end

Perhaps, the post didn't come through because of the above code?

ZCTA sidenote
I'm lobbying the Census bureau to open up 860 so we don't have to work around this limitation.

860 is the sumlevel ID for ZCTA5 - Zip Code Tabulation Area. My understanding is that the way the Post Office assigns Zip Codes to new addresses is fairly organic, and therefore the boundaries for zipcodes are not very well structured and always in flux. ZCTA5's attempt to solve this by determining the majority zipcode for any given block, and then grouping blocks with the same zipcode into larger geographies, ZCTA5s.

P.S. Thing I wish I'd known a few days ago, so I'm writing it explicitly for any future devs who come through here: census_shapes.yml contains all the queryable summary levels, as defined by the Census Bureau in these two docs:
ACS5
SF1

Yeah, sorry, wish I would have remembered sooner. Been a while since I worked on this stuff.

@beechnut
Copy link
Contributor Author

Just posted the actual comment -- I'd accidentally hit 'Comment' before I was finished. Thank you for the seemingly precognizant feedback!

EDIT: Annnd now it looks like the actual comment didn't get posted. Ugh.

@beechnut
Copy link
Contributor Author

Anyway, what didn't come through was, with a YAML file containing states and counties, it's not hard to search for nested geographies.

---
- name: Massachusetts
  id: 25
  counties:
  - name: Plymouth County
    id: 23
  - name: Suffolk County
    id: 25
  - name: Worcester County
    id: 27
...

And to get the right ids (pseudocode):

@client.find('P0010001', county: 'Suffolk County', state:'MA')
y = YAML.load(File.read('lib/yml/states_test.yml'))

state = y.select{|e| e['abbr'] == within.value or e['name'] == within.value }.first #=> Object for Massachusetts
county = state[level.key.pluralize][level.value] #=> 25

@tyrauber
Copy link
Owner

Yeah, I thought of this. My only concern was that not all sumlevels are hierarchal under states. But perhaps those sumlevels are just at the root of the document like state? If that is the case then we would want to add a type field,

type: 'STATE'

To differentiate between the different root document sumlevels. But in that way you could have all the geographies in one document, which I am fine with.

@beechnut
Copy link
Contributor Author

Correct -- right now, all the API's sumlevels nest under state, or are top-level. If you take a look at these SF1 and ACS5 API docs, every field is prefixed with state-, or is top-level.

And as the API gets more complex, we'll be able to add other sumlevel documents for other roots.

@beechnut
Copy link
Contributor Author

beechnut commented Jul 3, 2014

One year later, I'm returning to this, as I'm going to need this gem for a project at work in the near future. In that year I've gotten much better at Ruby, so I'm looking forward to contributing again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants