This repository is part of the Pelias project. Pelias is an open-source, open-data geocoder originally sponsored by Mapzen. Our official user documentation is here.
pelias-whosonfirst is a tool used for importing data from the Who's On First project from local files into a Pelias ElasticSearch store.
Node.js is required.
See Pelias software requirements for required and recommended versions.
To install the required Node.js module dependencies, download data for the entire planet (20GB+) and execute the importer, run:
npm install
npm run download
npm start
This importer is configured using the pelias-config
module.
The following configuration options are supported by this importer.
- Required: yes
- Default: ``
Full path to where Who's on First data is located (note: the included downloader script will automatically place the WOF data here, and is the recommended way to obtain WOF data)
- Required: no
- Default: ``
Set to a WOF ID or array of IDs to import data only for descendants of those records, rather than the entire planet.
You can use the Who's on First Spelunker or the source_id
field from any WOF result of a Pelias query to determine these values.
Specifying a value for importPlace
will download the full planet SQLite database (27GB). Support for individual country downloads may be added in the future
- Required: no
- Default:
false
Set to true to enable importing venue records. There are over 15 million venues so this option will add substantial download and disk usage requirements.
It is currently not recommended to import venues.
- Required: no
- Default:
true
Set to true to enable importing postalcode records. There are over 3 million postal code records.
- Required: no
- Default:
false
Set to true
for missing files from Who's on First bundles to stop the import process.
This flag is useful if you consider it vital that all Who's on First data is successfully imported, and can be helpful to guard against incomplete downloads or other types of failure.
- Required: no
- Default:
4
The maximum number of files to download simultaneously. Higher values can be faster, but can also cause donwload errors.
- Required: no
- Default:
https://dist.whosonfirst.org/
The location to download Who's on First data from. Changing this can be useful to use custom data, pin data to a specific date, etc.
- Required: no
- Default:
false
Set to true
to use Who's on First SQLite databases instead of GeoJSON bundles.
SQLite databases take up less space on disk and can be much more efficient to download and extract.
This option may become the default in the near future.
However, both the Who's on First processes to generate these files and the Pelias code to use them is new and not yet considered production ready.
The download
script will download the required bundles/sqlite databases into the datapath configured in imports.whosonfirst.datapath
.
To install the required node module dependencies and run the download script:
npm install
npm run download
## or
npm run download -- --admin-only # to only download hierarchy data, without venues or postalcodes
Note: The download script will always download data for the entire planet. Support for downloading data for specific countries is a possible future enhancement.
When using imports.whosonfirst.importPlace
, a new SQLite database will only be downloaded if new data is available. Otherwise, the existing download will be reused.
Warning: Who's on First data is big. Just the hierarchy data is tens of GB, and the full dataset is over 100GB on disk.
Additionally, Who's on First uses one file per record. In addition to lots of disk space,
you need lots of free inodes. On
Linux/Mac, df -ih
can show you how many free inodes you have.
Expect to use a few million inodes for Who's on First. You probably don't want to store multiple copies of the Who's on First data due to its disk requirements.
There are two major categories of Who's on First data supported: hierarchy (or admin) data, and venues.
Hierarchy data represents things like cities, countries, counties, boroughs, etc.
Venues represent individual places like the Statue of Liberty, a gas station, etc. Venues are subdivided by country, and sometimes regions within a country.
Currently, the supported hierarchy types are:
- borough
- continent
- country
- county
- dependency
- disputed
- empire
- localadmin
- locality
- macrocounty
- macrohood
- macroregion
- marinearea
- neighbourhood
- ocean
- region
- postalcodes (optional, see configuration)
Other types may be included in the future.
The Who's on First documentation has a description of all the types supported by Who's on First.
This project exposes a number of node streams for dealing with Who's on First data and metadata files:
metadataStream
: streams rows from a Who's on First metadata fileparseMetaFiles
: CSV parse stream configured for metadata file contentsloadJSON
: parallel stream that asynchronously loads GeoJSON filesrecordHasIdAndProperties
: rejects Who's on First records missing id or propertiesisActiveRecord
: rejects records that are superseded, deprecated, or otherwise inactiveisNotNullIslandRelated
: rejects Null Island and other records that intersect it (currently just postal codes at 0/0)recordHasName
: rejects records without namesconformsTo
: filter Who's on First records on a predicate (see lodash's conformsTo for more information)