Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Persistent administrative entity identifiers #3672

Open
jacobwhall opened this issue Mar 1, 2024 · 2 comments
Open

[FEATURE REQUEST] Persistent administrative entity identifiers #3672

jacobwhall opened this issue Mar 1, 2024 · 2 comments

Comments

@jacobwhall
Copy link

jacobwhall commented Mar 1, 2024

TL;DR: I would like to associate geoBoundaries data with other datasets. This is difficult to do because geoBoundaries does not persist boundary identifiers across versions. I suggest that geoBoundaries introduce persistent identifiers for administrative entities.

The 2020 geoBoundaries paper states in its opening paragraph:

The database is standardized using ISO 3166-1 alpha-3 encoding, and every boundary has a globally unique ID, allowing for integration with large-scale computational workflows.

The "globally unique ID" for each shape is described in this table:

The boundary ID, followed by the letter ‘B’ and a unique integer for each shape which is a member of that boundary.

...which glosses over the volatility of shape identifiers:

geoBoundaries version shapeID for Richmond, VA
v3.0.0 USA-ADM2-3_0_0-B672
v4.0.0 USA-ADM2-92793851B43358342
v5.0.0 52423323B78509502983349
v6.0.0 52423323B61032845323419

There is no (documented) way to reliably link geoBoundaries entities with data from other sources, or even with those from previous versions of geoBoundaries. Matching boundaries based on shapeName is bound to run into difficulties with regard to formatting, language differences, and formal name changes.

I will continue using Richmond as an example. There are many databases that catalog administrative entities, here are a few:

  • The National Archives has assigned the NAID 10045969 to Richmond
  • The International Standard Name Identifier (ISNI) for Richmond is 0000000405094755
  • Richmond has a WorldCat entity ID of E39PBJtCCxFgd9Kg99KkHbYgKd
  • In OpenStreetMap, the node representing Virginia has an ID of 2592301390
  • Who's On First, a database of stable place identifiers, has assigned Richmond the ID 101728675

Many more are listed on Richmond's Wikidata item page, which itself has the permanent reference Q43421.

If Richmond annexes more of Chesterfield, its identifier in the above databases is unlikely to change. I understand that many boundaries tracked in geoBoundaries are ever-changing, yet there remains a need for persistent identifiers. This would make it much easier to associate shapes in geoBoundaries with their associated entities in the above databases.

I believe there are two options for accomplishing this:

  1. Create new, persistent identifiers for each administrative boundary geoBoundaries tracks for external datasets to reference
  2. Reference an external dataset's identifiers in the metadata of each boundary in geoBoundaries

The first option might be the easiest to implement. The persistent identifiers could be added to Wikidata for example, enabling cross-dataset queries. This would allow for complex metadata to be associated with geoBoundaries.

Thank you for your consideration!

@DanRunfola
Copy link
Member

DanRunfola commented Mar 5, 2024

This is a really hard problem, because we want to ensure that unique geometries have unique codes - i.e., if you have the same geoboundaries ID, then you should be able to assume that the underlying geometry has not changed. Today, we actually hash the geometry itself to make the code, which is why you see changes - a change in our ID means that the geometry has changed. The problem here is that, of course, most changes are fairly small, which is really just resulting in an ID system that is highly instability (which is also undesirable).

We've discussed this a bunch with a range of actors, and what we're currently thinking is something like (lots of details that need to be figured out):

  1. Create a grid across the globe for each administrative level, at a resolution fine enough that it guarantees no two administrative units would overlap in the grid their centroid falls into (possibly a dynamic resolution implementation, where we start course and split as needed).
  2. Identify what grid cell the centroid of a given unit falls into.
  3. Create a persistent ID based on the combination of (A) the ISO code, (B) the ADM level, and (C) the grid ID. So an identifier would be something like "USA-ADM2-209948". The only case in which that would change is if the geometry changes enough that the grid cell it's centroid falls into changes, which would hopefully be a valid reason to change things up.

This would also allow us to provide a geometric-based join to other cases (i.e., UN SALB or P-Codes from OCHA) through a similar matching process to their datasets.

Basically: a "coarser resolution" version of what we do now, which would result in more stability at the cost of IDs not changing with every geometric shift.

Edit: Also, keep in mind that for much of the world we do not have place-names (or they are highly uncertain / unstable). So the ID has to be generated without text-based metadata, which is where the challenge comes in.

This sounds like a good dissertation chapter, by the way :)

@jacobwhall
Copy link
Author

jacobwhall commented Mar 7, 2024

Thanks for your response @DanRunfola

we want to ensure that unique geometries have unique codes - i.e., if you have the same geoboundaries ID, then you should be able to assume that the underlying geometry has not changed

This is an excellent idea, and geoBoundaries should continue to create geometry-specific identifiers. If a shape changes even a little bit, I think it is valuable for data consumers to see this change reflected in that shape's identifier.

Create a persistent ID based on the combination of (A) the ISO code, (B) the ADM level, and (C) the grid ID. So an identifier would be something like "USA-ADM2-209948". The only case in which that would change is if the geometry changes enough that the grid cell it's centroid falls into changes, which would hopefully be a valid reason to change things up.

This would also allow us to provide a geometric-based join to other cases (i.e., UN SALB or P-Codes from OCHA) through a similar matching process to their datasets.

Basically: a "coarser resolution" version of what we do now, which would result in more stability at the cost of IDs not changing with every geometric shift.

I think this approach could work as a supplement to the geometry hashes (or UUIDs) you established a need for above. This would provide sufficient persistence, making it worth the time to join geoBoundaries with a dataset like Wikidata. However, I wonder why a geometry-based approach is the best solution here. Why would geoBoundaries avoid direct relations with well-known administrative entity identifiers such as those I listed above?

Also, keep in mind that for much of the world we do not have place-names (or they are highly uncertain / unstable). So the ID has to be generated without text-based metadata, which is where the challenge comes in.

I understand that it may not always be possible to provide an administrative entity identifier. However, I suspect a vast majority of geoBoundaries data could be directly linked to existing entries in the databases I listed above. Am I underestimating how many places have uncertain names?

This sounds like a good dissertation chapter, by the way :)

Haha, I'd be happy to write a paper on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants