Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality Assurance #4

Open
nlehuby opened this issue Jan 18, 2018 · 6 comments
Open

Quality Assurance #4

nlehuby opened this issue Jan 18, 2018 · 6 comments

Comments

@nlehuby
Copy link
Contributor

nlehuby commented Jan 18, 2018

Some ideas to test the quality of our dataset:

Non closed boundaries
We need to log the list of the boundaries that could not be imported because they are not valid polygon / multipolygon

Hierarchy coherence

  • an object should never be bigger than its parent
  • child zones of a region should not overlap

Coverage stat and tests
By country statistics:
Compute the geographical coverage in states, cities, etc. (example: 88% city coverage, which means that 88% of the country territory is inside a city)

Persist expected values and test them in CI:
for example:

  • Each country must have a 100% state coverage
  • France country must have at least 99% city coverage
  • etc

Volumetric stat and tests
Stat:
same as below, but only raw numbers, without geographical concerns (example: Australia country has 17 states)

Test:

  • Australia should have 17 states
  • Australia should have between 1600 and 1700 cities
  • etc

Expected values for each country must be in a config file (CSV, YAML ?) and not inside the code source, so that anybody can update it if needed.

@Tristramg
Copy link
Contributor

Maybe the tests could be split in three:

  • what the tools might break I made a typo and confused country and county
  • what might break in OSM woops, the relation is not closed anymore
  • what has to be done Antarctica has still no matching to an administrative region

I’m not sure exactly why you want to have it in a separate repository. If someone wants to suggest a fix or have an alternative reality, I would still be simple, no? — or did you just mean that the configuration should be in a .yaml and not in a .rs, but still in the same repository?

@antoine-de
Copy link
Contributor

Nice categories, it seems fine for me.

I think also think it's ok to put the test in the same repository (and @nlehuby too 😉 ), we just want quality tests easily maintained (so no .rs)

@nlehuby
Copy link
Contributor Author

nlehuby commented Feb 8, 2018

Here is a proposal for a first step, only dealing with volumetric stat.
We may enrich this in the future to compute other stats and add other tests (such as the geographical ones listed in this issue) or create another dedicated tool.

Todo :
compute volumetric stats for each country
test the stats against expected values (this may be a py.test module)
provide a output format suitable to create a cool web dashboard
hosted in a dedicated new repo : cosmogony data dashboard (we can discuss the name ;) )

In :
a cosmogony file
a file with statistic references values by country

for instance a csv file :

wikidata_id zone_type expected_min expected_max is_known_failure
Q142 state 18 18
Q142 state_district 96 96
Q142 city 35000 36000
Q142 city_district 35000 36000 yes

Out :
a stat file with statistics values by country
the results of the tests

This could be a single csv file:

wikidata_id zone_type expected_min expected_max is_known_failure obtained test_status
Q142 state 18 18 18 ok
Q142 state_district 96 96 96 ok
Q142 city 35000 36000 36678 ko
Q142 city_district 35000 36000 yes 4560 ok
Q142 suburb 345 skip

@Tristramg
Copy link
Contributor

I like the general idea.

Where the data will be hosted, against what it will be tested doesn’t matter much for me (but I have a slight preference towards large mono-repos).

What do you mean with the wikidata_id? The property of that level? That might become a problem as cities can be of different type (think of the German Kreisfreie Stadt).

However, we could maybe add extra tests, like having 4 state_district Q202216 (département d’outre-mer) in France, as those might break easily with bad country shapes.

If we want no specific constraint, we can leave the wikidata_id empty.

Is that clear?

@nlehuby
Copy link
Contributor Author

nlehuby commented Feb 8, 2018

for now, wikidata_id stands for a country wikidata id (Q142 is France). It may be extended to any zone wikidata id in the future.

We could definitly use wikidata ontology to check the quality of our data. But I think your proposal adds a lot of complexity:
we will need to explore wikidata to map each of our zone with its wikidata properties (to know that Guadeloupe relation from our PBF is actually a Q202216 (overseas department of France))

and we may also need to map wikidata ontolology to libpostal zone type, country by country in the same way to what has be done for OSM ...
For instance, we will need to explicit that what we call

  • a state in France is actually an instance of Q36784 (regions of France) but not of Q22670030 (former regions of France)
  • a state_district in France, is actually an instance of Q6465 (department in France) or of Q202216 (overseas department of France), but with no P576 or P582 statement, or qualifier.

This seems possible and would add very valuable quality tests, but I really think we should start with a smaller task with no dependency to a wikidata dump ;)

@nlehuby
Copy link
Contributor Author

nlehuby commented Feb 9, 2018

init of reference values for countries stat: osm-without-borders/cosmogony-data-dashboard#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants