Browse Source

Merge pull request #115 from pelias/installation-tweaks

Installation tweaks
pull/121/head
Julian Simioni 9 years ago
parent
commit
d762bf4a91
  1. 124
      installing.md

124
installing.md

@ -1,13 +1,27 @@
# Installing Pelias # Installing Pelias
Mapzen offers the Mapzen Search service in hopes that as many people as possible will use it, Mapzen offers the Mapzen Search service in hopes that as many people as possible will use it,
but we also encourage people to set up their own Pelias instance. Whether it's to import their own data, but we also encourage people to set up their own Pelias instance.
make their own tweaks to Pelias code, or to help with Pelias development, its important that we
document how this can be done. Similarly, while there are ways this process can be
[automated](https://github.com/pelias/vagrant), these instructions are written as if the setup is
manual, to illustrate all the moving pieces of Pelias.
## Gather the Ingredients For most cases, it's useful to have much of the installation process automated, so we suggest
looking at the [Pelias Vagrant image](https://github.com/pelias/vagrant).
However, for more in-depth usage, to learn more about the working of Pelias, or to contribute back,
manual setup is useful. These instructions will help you install Pelias from scratch manually.
## Installation Overview
The steps for fully installing Pelias look like this:
1. Decide which datasets and settings will be used
2. Download appropriate data
3. Download Pelias code, using the appropriate branches
4. Set up Elasticsearch
5. Install the Elasticsearch schema using pelias-schema
6. Use one or more importers to load data into Elasticsearch
7. Start the API server to begin handling queries
## System Requirements
In general, Pelias will require: In general, Pelias will require:
@ -15,26 +29,8 @@ In general, Pelias will require:
a single machine or across several a single machine or across several
* [Node.js](https://nodejs.org/) 0.12 or newer (Node 4 or 5 is recommended) * [Node.js](https://nodejs.org/) 0.12 or newer (Node 4 or 5 is recommended)
* Up to 100GB disk space to download and extract data * Up to 100GB disk space to download and extract data
* Lots of RAM. At least 2-4GB. A full North America OSM import just barely fits on a machine with 16GB RAM * Lots of RAM, 8GB is a good minimum. A full North America OSM import just fits in 16GB RAM
## Choose your branch
As part of the setup instructions below, you'll be downloading several Pelias packages from source
on Github. All of these packages offer 3 branches for various use cases. Based on your needs, you
should pick one of these branches and use the same one across all of the Pelias packages.
`production`: contains only code that has been tested against a full-planet build and is live on
Mapzen Search. This is the "safest" branch and it will change the least frequently, although we
generally release new code at least once a week.
`staging`: these branches contain the code that is currently being tested against a full planet
build for imminent release to Mapzen Search. It's useful to track what code will be going out in the
next release, but not much else.
`master`: master branches contain the latest code that has passed code review, unit/integration
tests, and is ready to be included in the next release. While we try to avoid it, the nature of the
master branch is that it will sometimes be broken. That said, these are the branches to use for
development of new features.
## Choose your datasets ## Choose your datasets
@ -42,13 +38,13 @@ Pelias can currently import data from four different sources. The contents and d
sources are available on our [data sources page](./data_sources). Here we'll just focus on what to sources are available on our [data sources page](./data_sources). Here we'll just focus on what to
download for each one. download for each one.
### Whosonfirst ### Who's on First
There are two ways to download Whosonfirst data. The first is to use the pre-created There are two ways to download Who's on First data. The first is to use the pre-created
[bundles](https://whosonfirst.mapzen.com/bundles/). These consist of a series of archives that can [bundles](https://whosonfirst.mapzen.com/bundles/). These consist of a series of archives that can
be easily extracted (instructions are on the page). be easily extracted (instructions are on the page).
For more advanced uses, or to contribute back to Whosonfirst, use the For more advanced uses, or to contribute back to Who's on First, use the
[whosonfirst-data](https://github.com/whosonfirst/whosonfirst-data) Github repository. Again, there [whosonfirst-data](https://github.com/whosonfirst/whosonfirst-data) Github repository. Again, there
are [instructions](https://github.com/whosonfirst/whosonfirst-data#git-and-github). Note that this are [instructions](https://github.com/whosonfirst/whosonfirst-data#git-and-github). Note that this
repo requires [git-lfs](https://git-lfs.github.com/), a lot of bandwidth, and 27GB (currently) of repo requires [git-lfs](https://git-lfs.github.com/), a lot of bandwidth, and 27GB (currently) of
@ -58,25 +54,25 @@ disk space.
The [pelias/geonames](https://github.com/pelias/geonames/#importing-data) importer contains code and The [pelias/geonames](https://github.com/pelias/geonames/#importing-data) importer contains code and
instructions for downloading Geonames data automatically. Individual countries, or the entire planet instructions for downloading Geonames data automatically. Individual countries, or the entire planet
(1.3GB) can be specified. (1.3GB compressed) can be specified.
### Openaddresses ### OpenAddresses
The Openaddresses project includes [numerous download options](https://results.openaddresses.io/), The OpenAddresses project includes [numerous download options](https://results.openaddresses.io/),
all of which are `.zip` downloads. The full dataset is several gigabytes, but there are numerous all of which are `.zip` downloads. The full dataset is just over 3 gigabytes compressed, but there
subdivision options. In any case, the `.zip` files simply need to be extracted to a directory of are numerous subdivision options. In any case, the `.zip` files simply need to be extracted to a
your choice, and Pelias can be configured to either import every `.csv` in that directory, or only directory of your choice, and Pelias can be configured to either import every `.csv` in that
selected files. directory, or only selected files.
### Openstreetmap ### OpenStreetMap
Openstreetmap has a nearly limitless array of download options, and any of them should work as long as OpenStreetMap has a nearly limitless array of download options, and any of them should work as long as
they're in [PBF](http://wiki.openstreetmap.org/wiki/PBF_Format) format. Generally the files will they're in [PBF](http://wiki.openstreetmap.org/wiki/PBF_Format) format. Generally the files will
have the extension `.osm.pbf`. Good sources include the [Mapzen Metro Extracts](https://mapzen.com/data/metro-extracts/) have the extension `.osm.pbf`. Good sources include the [Mapzen Metro Extracts](https://mapzen.com/data/metro-extracts/)
(feel free to submit pull requests for additional cities or regions if needed), and planet files (feel free to submit pull requests for additional cities or regions if needed), and planet files
listed on the [OSM wiki](http://wiki.openstreetmap.org/wiki/Planet.osm). listed on the [OSM wiki](http://wiki.openstreetmap.org/wiki/Planet.osm). A full planet PBF is about
36GB.
## Choose your import settings
## Choose your import options
There are several options that should be discussed before starting any data imports, as they require There are several options that should be discussed before starting any data imports, as they require
a compromise between import speed and resulting data quality and richness. a compromise between import speed and resulting data quality and richness.
@ -85,10 +81,10 @@ a compromise between import speed and resulting data quality and richness.
Most data that is imported by Pelias comes to us incomplete: many data sources don't supply what we Most data that is imported by Pelias comes to us incomplete: many data sources don't supply what we
call admin hierarchy information: the neighbourhood, city, country, or other region that contains call admin hierarchy information: the neighbourhood, city, country, or other region that contains
the record. In Openaddresses, for example, many records contain only a housenumber, street name, and the record. In OpenAddresses, for example, many records contain only a housenumber, street name, and
coordinates. coordinates.
Fortunately, Whosonfirst contains a well-developed set of geometries for all admin regions from the Fortunately, Who's on First contains a well-developed set of geometries for all admin regions from the
neighbourhood to continent level. Through neighbourhood to continent level. Through
[point-in-polygon](https://en.wikipedia.org/wiki/Point_in_polygon) lookup, our importers can [point-in-polygon](https://en.wikipedia.org/wiki/Point_in_polygon) lookup, our importers can
[derive](https://github.com/pelias/wof-admin-lookup) this information! [derive](https://github.com/pelias/wof-admin-lookup) this information!
@ -98,13 +94,13 @@ Because geometry data is quite large, expect to use about 6GB of RAM (not disk)
for this geometry data. And because of the complexity of the required calculations, imports with for this geometry data. And because of the complexity of the required calculations, imports with
admin lookup are up to 10 times slower than without. admin lookup are up to 10 times slower than without.
Whosonfirst, of course, always includes full hierarchy information because it's built into the Who's on First, of course, always includes full hierarchy information because it's built into the
dataset itself, so there's no tradeoff to be made. Whosonfirst data will always import quite fast dataset itself, so there's no tradeoff to be made. Who's on First data will always import quite fast
and with full hierarchy information. and with full hierarchy information.
### Address Deduplication ### Address Deduplication
Openaddresses data contains lots of addresses, but it also contains lots of duplicate data. To help OpenAddresses data contains lots of addresses, but it also contains lots of duplicate data. To help
reduce this problem we've built an [address-deduplicator](https://github.com/pelias/address-deduplicator) reduce this problem we've built an [address-deduplicator](https://github.com/pelias/address-deduplicator)
that can be run at import. It uses the [OpenVenues deduplicator](https://github.com/openvenues/address_deduper) that can be run at import. It uses the [OpenVenues deduplicator](https://github.com/openvenues/address_deduper)
to remove records that are near each other and have names that are likely to be duplicates. Note to remove records that are near each other and have names that are likely to be duplicates. Note
@ -131,6 +127,25 @@ are possible without too much extra effort. To set expectations, a cluster of 4
[r3.xlarge](https://aws.amazon.com/ec2/instance-types/) AWS instances running Elasticsearch, and one [r3.xlarge](https://aws.amazon.com/ec2/instance-types/) AWS instances running Elasticsearch, and one
c4.8xlarge instance running the importers can complete a full planet build in about two days. c4.8xlarge instance running the importers can complete a full planet build in about two days.
## Choose your Pelias code branch
As part of the setup instructions below, you'll be downloading several Pelias packages from source
on Github. All of these packages offer 3 branches for various use cases. Based on your needs, you
should pick one of these branches and use the same one across all of the Pelias packages.
`production`: contains only code that has been tested against a full-planet build and is live on
Mapzen Search. This is the "safest" branch and it will change the least frequently, although we
generally release new code at least once a week.
`staging`: these branches contain the code that is currently being tested against a full planet
build for imminent release to Mapzen Search. It's useful to track what code will be going out in the
next release, but not much else.
`master`: master branches contain the latest code that has passed code review, unit/integration
tests, and is ready to be included in the next release. While we try to avoid it, the nature of the
master branch is that it will sometimes be broken. That said, these are the branches to use for
development of new features.
## Installation ## Installation
### Download the Pelias repositories ### Download the Pelias repositories
@ -166,7 +181,7 @@ Elasticsearch port).
By taking a look at the [default config](https://github.com/pelias/config/blob/master/config/defaults.json#L2), By taking a look at the [default config](https://github.com/pelias/config/blob/master/config/defaults.json#L2),
you can see the Elasticsearch configuration looks something like this: you can see the Elasticsearch configuration looks something like this:
```json ```js
{ {
"esclient": { "esclient": {
"hosts": [{ "hosts": [{
@ -184,7 +199,7 @@ config is sent along to the [elasticsearch-js](https://github.com/elastic/elasti
any of its [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/configuration.html) any of its [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/configuration.html)
are valid. are valid.
The other major section, `imports`, defiens settings for each importer. The defaults look like this: The other major section, `imports`, defines settings for each importer. The defaults look like this:
```json ```json
{ {
@ -208,21 +223,22 @@ The other major section, `imports`, defiens settings for each importer. The defa
"whosonfirst": { "whosonfirst": {
"datapath": "/mnt/pelias/whosonfirst" "datapath": "/mnt/pelias/whosonfirst"
} }
}
} }
``` ```
As you can see, the default datapaths are meant to be changed. This is also where you can enable As you can see, the default datapaths are meant to be changed. This is also where you can enable
admin lookup by overriding the default value. admin lookup by overriding the default value.
Two caveats to this config section. First, the array structure of the Openstreetmap `import` section Two caveats to this config section. First, the array structure of the OpenStreetMap `import` section
suggests you can specify multiple files to import. Unfortunately, you can't, although we'd like to suggests you can specify multiple files to import. Unfortunately, you can't, although we'd like to
[support that in the future](https://github.com/pelias/openstreetmap/issues/55). [support that in the future](https://github.com/pelias/openstreetmap/issues/55).
Second, note that the Openaddresses section does _not_ have an `adminLookup` flag. The Openaddresses Second, note that the OpenAddresses section does _not_ have an `adminLookup` flag. The OpenAddresses
importer only supports controlling this option by a command line flag currently. Again this is importer only supports controlling this option by a command line flag currently. Again this is
something [we'd like to fix](https://github.com/pelias/openaddresses/issues/51). See the importer something [we'd like to fix](https://github.com/pelias/openaddresses/issues/51). See the importer
[readme](https://github.com/pelias/openaddresses/blob/master/README.md) for details on how to [readme](https://github.com/pelias/openaddresses/blob/master/README.md) for details on how to
configure admin lookup and deduplication for Openaddresses. configure admin lookup and deduplication for OpenAddresses.
### Install Elasticsearch ### Install Elasticsearch
@ -267,11 +283,11 @@ reindex all your data after making schema changes.
Now that the schema is set up, you're ready to begin importing data! Now that the schema is set up, you're ready to begin importing data!
Our [goal](https://github.com/pelias/pelias/issues/255) is that eventually you'll be able to run all Our [goal](https://github.com/pelias/pelias/issues/255) is that eventually you'll be able to run all
the importers with simply `cd $importer_directory; npm start`. Unfortunately only the Whosonfirst the importers with simply `cd $importer_directory; npm start`. Unfortunately only the Who's on First
and Openstreetmap importers works that way right now. and OpenStreetMap importers works that way right now.
For [Geonames](https://github.com/pelias/geonames/) and [Openaddresses](https://github.com/pelias/openaddresses), For [Geonames](https://github.com/pelias/geonames/) and [OpenAddresses](https://github.com/pelias/openaddresses),
please see their respective READMEs, which detail the process of running them. By the way, ~we'd please see their respective READMEs, which detail the process of running them. By the way, we'd
love to see pull requests that allow them to read configuration from `pelias.json` like the other love to see pull requests that allow them to read configuration from `pelias.json` like the other
importers. importers.

Loading…
Cancel
Save