Merge pull request #223 from pelias/install-refresh

Refresh installation docs
6 years ago · 7443c2271f
6 changed files with 582 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -27,7 +27,11 @@ _Not sure which Endpoint to use? We have a [page](search-workflows.md) for that_
 - [Pelias data sources](data-sources.md)

 ### Running your own Pelias
- [Pelias installation guide](https://github.com/pelias/pelias/blob/master/INSTALL.md)
+- [Getting started](getting_started_install.md) Start here if you're looking to install Pelias
+- [Pelias from scratch](pelias_from_scratch.md) More in-depth instructions for installing Pelias
+- [Full planet build considerations](full_planet_considerations.md) Special information on running a full planet Pelias build
+- [Service descriptions](services.md) A description of all the Pelias services, and when they are used
+- [Software Requirements](requirements.md) A list of all software requirements for Pelias

 ### Pelias project development
 - [Release notes](release-notes.md). See notable changes in Pelias over time
--- a/full_planet_considerations.md
+++ b/full_planet_considerations.md
@ -0,0 +1,122 @@
+# Considerations for full-planet builds
+
+Pelias is designed to work with data ranging from a small city to the entire planet. Small cities do
+not require particularly significant resources and should be easy. However, full planet builds
+present many of their own challenges.
+
+Current [full planet builds](https://pelias-dashboard.geocode.earth) weigh in at around 550 million
+documents, and require about 375GB total storage in Elasticsearch.
+
+Fortunately, because of services like AWS and the scalability of Elasticsearch, full planet builds
+are possible without too much extra effort. The process is no different, it just requires more
+hardware and takes longer.
+
+To set expectations, a cluster of 4 [r4.xlarge](https://aws.amazon.com/ec2/instance-types/) AWS
+instances (30GB RAM each) running Elasticsearch, and one m4.4xlarge instance running the importers
+and PIP service can complete a full planet build in about two days.
+
+## Recommended processes
+
+### Use Docker containers and orchestration
+
+We strongly recommend using Docker to run Pelias. All our services include Dockerfiles and the
+resulting images are pushed to [Docker Hub](https://hub.docker.com/r/pelias/) by our CI. Using these
+images will drastically reduce the amount of work it takes to set up Pelias and will ensure you are
+on a known good configuration, minimizing the number of issues you will encounter.
+
+Additionally, there are many great tools for managing container workloads. Simple ones like
+[docker-compose](https://github.com/pelias/docker/) can be used for small installations, and more
+complex tools like [Kubernetes](https://github.com/pelias/kubernetes) can be great for larger
+installations. Pelias is extensively tested on both.
+
+### Use separate Pelias installations for indexing and production traffic
+
+The requirements for performant and reliable Elasticsearch clusters are very different for importing
+new data compared to serving queries. It is _highly_ recommended to use one cluster to do imports,
+save the resulting Elasticsearch index into a snapshot, and then load that snapshot into the cluster
+used to perform actual geocoding.
+
+### Shard count
+
+Historically, Mapzen Search has used 24 Elasticsearch shards for its builds. However, our latest
+guidance from the Elasticsearch team is that shards should be no larger than 50GB, but otherwise
+having as few shards as possible is best. At [geocode.earth](https://geocode.earth) we are
+experimenting with 12 shard builds, and may eventually move to 6. We would appreciate performance
+feedback from anyone doing large builds.
+
+The `elasticsearch` section of `pelias.json` can be used to configure the shard count.
+
+```js
+{
+  "elasticsearch": {
+    "settings": {
+      "index": {
+        "number_of_shards": "5",
+      }
+    }
+  }
+}
+```
+
+### Force merge your Elasticsearch indices
+
+Pelias Elasticserach indices are generally static, as we do not recommend querying from and
+importing to an Elasticsearch cluster simultaneously. In such cases, the highest levels of
+performance can be achieved by [force-merging](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html) the Elasticsearch index.
+
+## Recommended hardware
+
+For a production ready instance of Pelias, capable of supporting a few hundred queries per second
+across a full planet build, a setup like the following should be sufficient.
+
+### Elasticsearch cluster for importing
+
+The main requirement of Elasticsearch is that it has lots of disk. 400GB across the
+cluster is a good minimum. Increased CPU power is useful to achieve a higher throughput for queries,
+but not as important as RAM.
+
+
+### Elasticsearch cluster for querying
+
+For queries, essentially the only bottleneck is CPU, although more RAM is helpful so Elasticsearch
+data can be cached. On AWS, `c5` instances are significantly more performant than even the `c4`
+instances, and should be used if high performance is needed.
+
+_Example configuration:_ 4 `c5.4xlarge` (16 CPU, 32GB RAM) to serve 250 RPS
+
+### Importer machine
+
+The importers are each single-threaded Node.js processes, which require around 8GB of RAM
+each with admin lookup enabled. Faster CPUs will help increase the import speed. Running multiple
+importers in parallel is recommended if the importer machine has enough RAM and CPU to support them.
+
+_Example configuration:_ 1 `c4.4xlarge` (16 CPU, 30GB RAM), running two parallel importers
+
+### Pelias services
+
+Each Pelias service has different memory and CPU requirements. Here are some rough guidelines:
+
+#### API
+RAM: 200MB per instance
+CPU: Single threaded, one instance can serve around 500 RPS
+Disk: None
+
+#### Placeholder
+RAM: 200MB per instance
+CPU: Single threaded, supports [clustering](https://nodejs.org/api/cluster.html)
+Disk: Requires about 2GB for a full planet index
+
+#### Libpostal
+RAM: 3GB per instance
+CPU: Multi-threaded, but extremely fast. A single core can serve 8000+ RPS
+Disk: about 2-3GB of data storage required
+
+### PIP
+RAM: ~6GB
+CPU: 2 cores per instance recommended, which is enough to serve 5000-7000 RPS
+
+### Interpolation
+RAM: 3GB per instance currently (please follow our efforts to [un-bundle
+libpostal](https://github.com/pelias/interpolation/issues/106) from the interpolation service)
+CPU: Single core. One instance can serve around 200RPS
+Disk: 40GB needed for a full planet interpolation dataset
--- a/getting_started_install.md
+++ b/getting_started_install.md
@ -0,0 +1,35 @@
+## Getting started with Pelias
+
+Looking to install and set up Pelias? You've come to the right place. We have several different
+tools and pieces of documentation to help you.
+
+### Installing for the first time?
+
+We _strongly_ recommend using our [Docker](http://github.com/pelias/docker/) based installation for
+your first install. It removes the need to deal with most of the complexity and dependencies of
+Pelias. On a fast internet connection you should be able to get a small city like Portland, Oregon
+installed in under 30 minutes.
+
+### Want to go more in depth?
+
+The Pelias docker installation should work great for any small area, and is great for managing the
+different Pelias services during development. However, we understand not everyone can or wants to
+use Docker, and that people want more details on how things work.
+
+For this, we have our [from scratch installation guide](pelias_from_scratch.md)
+
+### Installing in production?
+
+By far the most well tested way to install Pelias is to use [Kubernetes](https://github.com/pelias/kubernetes).
+Kubernetes is perfect for managing systems that have many different components, like Pelias.
+
+We would love to add additional, well tested ways to install Pelias in production. Reach out to us
+if you have something to share or want to get started.
+
+### Doing a full planet build?
+
+Running Pelias for a city or small country is pretty easy. However, due to the amount of data
+involved, a full planet build is harder to pull off.
+
+See our [full planet build guide](full_planet_considerations.md) for some recommendations on how to
+make it easier and more performant
--- a/pelias_from_scratch.md
+++ b/pelias_from_scratch.md
@ -0,0 +1,317 @@
+# Installing Pelias from Scratch
+
+These instructions will help you set up the Pelias geocoder from scratch. We strongly recommend
+using our [Docker](http://github.com/pelias/docker/) tools for your first Pelias installation.
+
+However, for more in-depth usage, or to learn more about the internals of Pelias, use this guide.
+
+It assumes some knowledge of the command line and Node.js, but we'd like as many people as possible
+to be able to install Pelias, so if anything is confusing, please don't hesitate to reach out. We'll
+do what we can to help and also improve the documentation.
+
+## Installation Overview
+
+These are the steps for fully installing Pelias:
+1. [Check that the hardware and software requirements are met](#system-requirements)
+1. [Decide which datasets to use and download them](#choose-your-datasets)
+1. [Download the Pelias code](#download-the-pelias-repositories)
+1. [Customize Pelias Configuration file `~/pelias.json`](#customize-pelias-config)
+1. [Install the Elasticsearch schema using pelias-schema](#set-up-the-elasticsearch-schema)
+1. [Use one or more importers to load data into Elasticsearch](#run-the-importers)
+1. [Install and start the Pelias services](#install-and-start-the-pelias-services)
+1. [Start the API server to begin handling queries](#start-the-api)
+
+
+## System Requirements
+
+See our [software requirements](requirements.md) and insure all of them are installed before moving forward
+
+### Hardware recommendations
+* At a minimum 50GB disk space to download, extract, and process data
+* Lots of RAM, 8GB is a good minimum for a small import like a single city or small country. A full North America OSM import just fits in 16GB RAM
+
+## Choose your datasets
+Pelias can currently import data from [four different sources](data-sources.md), using five different importers.
+
+Only one dataset is _required_: [Who's on First](https://whosonfirst.org/). This dataset is used to enrich all data imported into Pelias with [administrative information](glossary.md). For more on this process, see the [wof-admin-lookup](https://github.com/pelias/wof-admin-lookup) documentation.
+
+**Note:** You don't have to run the `whosonfirst` importer, but you do have to have Who's on First
+data available on disk for use by the other importers.
+
+Here's an overview of how to download each dataset.
+
+### Who's on First
+
+The [Who's on First](https://github.com/pelias/whosonfirst#downloading-the-data) importer can download all the Who's
+on First data quickly and easily.
+
+### Geonames
+
+The [pelias/geonames](https://github.com/pelias/geonames/#installation) importer contains code and
+instructions for downloading Geonames data automatically. Individual countries, or the entire planet
+(1.3GB compressed) can be specified.
+
+### OpenAddresses
+
+The Pelias Openaddresses importer can [download specific files from
+OpenAddresses](https://github.com/pelias/openaddresses/#data-download).
+
+Additionally, the [OpenAddresses](https://results.openaddresses.io/) project includes numerous download options,
+all of which are `.zip` downloads. The full dataset is just over 6 gigabytes compressed (the
+extracted files are around 30GB), but there are numerous subdivision options.
+
+### OpenStreetMap
+
+OpenStreetMap (OSM) has a nearly limitless array of download options, and any of them should work as long as
+they're in [PBF](http://wiki.openstreetmap.org/wiki/PBF_Format) format. Generally the files will
+have the extension `.osm.pbf`. Good sources include [download.geofabrik.de](http://download.geofabrik.de/), [Nextzen Metro Extracts](https://metro-extracts.nextzen.org/), [Interline OSM Extracts](https://www.interline.io/osm/extracts/), and planet files listed on the [OSM wiki](http://wiki.openstreetmap.org/wiki/Planet.osm).
+A full planet PBF file is about 41GB.
+
+#### Street Data (Polylines)
+
+To download and import [street data](https://github.com/pelias/polylines#download-data) from OSM, a separate importer is used that operates on a preprocessed dataset
+derived from the OSM planet file.
+
+## Installation
+
+### Download the Pelias repositories
+
+At a minimum, you'll need
+1. [Pelias schema](https://github.com/pelias/schema/)
+2. The Pelias [API](https://github.com/pelias/api/) and other Pelias services
+3. Importer(s)
+
+
+Here's a bash snippet that will download all the repositories (they are all small enough that you don't
+have to worry about the space of the code itself), check out the production branch (which is
+probably the one you want), and install all the node module dependencies.
+
+```bash
+for repository in schema whosonfirst geonames openaddresses openstreetmap polylines api placeholder
+interpolation pip-service; do
+	git clone https://github.com/pelias/${repository}.git # clone from Github
+	pushd $repository > /dev/null                         # switch into importer directory
+	git checkout production                               # or remove this line to stay with master
+	npm install                                           # install npm dependencies
+	popd > /dev/null                                      # return to code directory
+done
+```
+
+<details>
+  <summary>Not sure which branch to use?</summary>
+
+Pelias uses three diferent branches as part of our release process.
+
+`production` **(recommended)**: contains only code that has been well tested, generally against a
+full-planet build. This is the "safest" branch and it will change the least frequently, although we
+generally release new code at least once a week.
+
+`staging`: these branches contain the code that is currently being tested against a full planet
+build for imminent release. It's useful to track what code will be going out in the next release,
+but not much else.
+
+`master`: master branches contain the latest code that has passed code review, unit/integration
+tests, and is reasonably functional. While we try to avoid it, the nature of the master branch is
+that it will sometimes be broken. That said, these are the branches to use for development of new
+features.
+</details>
+
+### Customize Pelias Config
+
+Nearly all configuration for Pelias is driven through a single config file: `pelias.json`. By
+default, Pelias will look for this file in your home directory, but you can configure where it
+looks. For more details, see the [pelias-config](https://github.com/pelias/config) repository.
+
+#### Where on the network to find Elasticsearch
+
+Pelias will by default look for Elasticsearch on `localhost` at port 9200 (the standard
+Elasticsearch port).
+Take a look at the [default config](https://github.com/pelias/config/blob/master/config/defaults.json#L2). You can see the Elasticsearch configuration looks something like this:
+
+```js
+{
+  "esclient": {
+  "hosts": [{
+    "host": "localhost",
+    "port": 9200
+  }]
+
+  ... // rest of config
+}
+```
+
+If you want to connect to Elasticsearch somewhere else, change `localhost` as needed. You can
+specify multiple hosts if you have a large cluster. In fact, the entire `esclient` section of the
+config is sent along to the [elasticsearch-js](https://github.com/elastic/elasticsearch-js) module, so
+any of its [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/configuration.html)
+are valid.
+
+#### Where to find the downloaded data files
+The other major section, `imports`, defines settings for each importer.  `adminLookup` has it's own section and its value applies to all importers. The defaults look like this:
+
+```json
+{
+  "imports": {
+    "adminLookup": {
+      "enabled": true
+    },
+    "geonames": {
+      "datapath": "/mnt/pelias/geonames",
+    },
+    "openstreetmap": {
+      "datapath": "/mnt/pelias/openstreetmap",
+      "leveldbpath": "/tmp",
+      "import": [{
+        "filename": "planet.osm.pbf"
+      }]
+    },
+    "openaddresses": {
+      "datapath": "/mnt/pelias/openaddresses",
+      "files": []
+    },
+    "whosonfirst": {
+      "datapath": "/mnt/pelias/whosonfirst"
+    },
+    "polyline": {
+      "datapath": "/mnt/pelias/polyline",
+      "files": []
+    }
+  }
+}
+```
+
+Note: The datapath must be an _absolute path._
+As you can see, the default datapaths are meant to be changed.
+
+### Install Elasticsearch
+
+Please refer to the [official 2.4 install docs](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/setup.html) for how to install Elasticsearch.
+
+Be sure to modify the Elasticsearch [heap size](https://www.elastic.co/guide/en/elasticsearch/guide/2.x/heap-sizing.html) as appropriate to your machine.
+
+Make sure Elasticsearch is running and connectable, and then you can continue with the Pelias
+specific setup and importing. Using a plugin like [Sense](https://github.com/bleskes/sense) [(Chrome extension)](https://chrome.google.com/webstore/detail/sense-beta/lhjgkmllcaadmopgmanpapmpjgmfcfig?hl=en), [head](https://mobz.github.io/elasticsearch-head/)
+or [Marvel](https://www.elastic.co/products/marvel) can help monitor Elasticsearch as you import
+data.
+
+### Set up the Elasticsearch Schema
+
+Pelias requires specific configuration settings for both performance and accuracy reasons. Fortunately, now that your `pelias.json` file is configured with how to connect to Elasticsearch,
+the schema repository can automatically create the Pelias index and configure it exactly as needed.
+
+```bash
+cd schema                      # assuming you have just run the bash snippet to download the repos from earlier
+node scripts/create_index.js
+```
+The Elasticsearch Schema is analogous to the layout of a table in a traditional relational database,
+like MySQL or PostgreSQL. While Elasticsearch attempts to auto-detect a schema that works when
+inserting new data, this generally leads to non-optimal results. In the case of Pelias, inserting
+data without first applying the Pelias schema will cause all queries to fail completely.
+
+### Run the importers
+
+Now that the schema is set up, you're ready to begin importing data.
+
+For each importer, you can start the import process with the `npm start` command:
+
+```bash
+cd importer_directory; npm start
+```
+
+Depending on how much data you've imported, now may be a good time to grab a coffee.
+You can expect around 800-2000 inserts per second.
+
+The order of imports does not matter. Multiple importers can be run in parallel to speed up the setup process.
+Each of our importers operates independent of the data that is already in Elasticsearch.
+For example, you can import OSM data without importing WOF data first.
+
+#### Aside: When to delete the data already in Elasticsearch
+
+If you have previously run a build, and are looking to start another one, it generally a good idea
+to delete the existing Pelias index and re-create it. Here's how:
+
+```bash
+# !! WARNING: this will remove all your data from pelias!!
+node scripts/drop_index.js      # it will ask for confirmation first
+node scripts/create_index.js
+```
+
+When is this necessary? Here's a guideline: when in doubt, delete the index, re-create it, and start
+fresh.
+
+This is because Elasticsearch has no analog to a schema migration like a relational database, and
+all the importers start over when re-run.
+
+The only time when this isn't necessary is if the following conditions are true:
+1. You are trying to re-import the exact same data again (for example, because the build failed, or
+   you are testing changes to an importer)
+2. The Pelias schema has not changed
+
+## Install and start the Pelias Services
+
+Pelias is made up of several different services, each providing a specific aspect of Pelias's
+functionality.
+
+The [list of Pelias services](services.md) descibes the functionality of each service, and can be
+used to determine if you need to install that service. It also includes links to setup instructions
+for each service.
+
+When in doubt, install everything except the interpolation engine (it requires a long download or
+build process).
+
+### Configure `pelias.json` for services
+
+The Pelias API needs to know about each of the other services available to it. Once again, this is
+configured in `pelias.json`. The following section will tell the API to use all services running
+locally and on their default ports.
+
+```js
+{
+  "api": {
+    "services": {
+      "placeholder": {
+        "url": "http://localhost:3000"
+      },
+      "libpostal": {
+        "url": "http://localhost:8080"
+      },
+      "pip": {
+        "url": "http://localhost:3102"
+      },
+      "interpolation"
+        "url": "http://localhost:3000"
+      }
+    }
+  }
+}
+```
+
+### Start the API
+
+Now that the API knows how to connect to Elasticsearch and all other Pelias services, all that is
+required to start the API is:
+
+```
+npm start
+```
+
+## Geocode with Pelias
+
+Pelias should now be up and running and will respond to your queries.
+
+For a quick check, a request to `http://localhost:3100` should display a link to the documentation
+for handy reference.
+
+*Here are some queries to try:*
+
+[http://localhost:3100/v1/search?text=london](http://localhost:3100/v1/search?text=london): a search
+for the city of London.
+
+[http://localhost:3100/v1/autocomplete?text=londo](http://localhost:3100/v1/autocomplete?text=londo): another query for London, but using the autocomplete endpoint which supports partial matches and is intended to be sent queries as a user types (note the query is for `londo` but London is returned)
+
+[http://localhost:3100/v1/reverse?point.lon=-73.986027&point.lat=40.748517](http://localhost:3100/v1/reverse?point.lon=-73.986027&point.lat=40.748517): a reverse geocode for results near the Empire State Building in New York City.
+
+For information on everything Pelias can do, see our [documentation
+index](README.md).
+
+Happy geocoding!
--- a/requirements.md
+++ b/requirements.md
@ -0,0 +1,41 @@
+# Pelias Sofware requirements
+
+This is the list of all software requirements for Pelias. We highly recommend using our
+[Docker images](https://hub.docker.com/r/pelias/) to avoid having to even attempt to correctly
+install all our dependencies yourself.
+
+## Node.js
+
+Version 6 or newer
+
+Most Pelias code is written in Node.js. Node.js 8 is recommended.
+Node.js 10 is not as well tested with Pelias yet, but should offer notable performance increases and
+may become the recommendation soon.
+
+We will probably drop support for Node.js 6 in the near future, so that we can use the many features
+supported only in version 8 and above.
+
+## Elasticsearch
+
+Version 2.3 or 2.4
+
+The core data storage for Pelias is Elasticsearch. We recommend the latest in the 2.4 release line.
+
+We do not _yet_ support Elasticsearch 5 or 6, but work is [ongoing](https://github.com/pelias/pelias/issues/461)
+
+## SQLite
+
+Version 3.11 or newerr
+
+Some components of Pelias need a relational database, and Elasticsarch does not provide good
+relational support. We use SQLite in these cases since it's simple to manage and quite performant.
+
+## Libpostal
+
+Pelias relies heavily on the [Libpostal](https://github.com/openvenues/libpostal#installation)
+address parser. Libpostal requires about 4GB of disk space to download all the required data.
+
+## Windows Support
+
+Pelias is not well tested on Windows, but we do wish to support it, and will accept patches to fix
+any issues with Windows support.
--- a/services.md
+++ b/services.md
@ -0,0 +1,62 @@
+# Pelias services
+
+A running Pelias installation is composed of several different services. Each service is well suited
+to a particular task.
+
+## Service Use Cases
+
+Here's a list of which services provide which features in Pelias. If you don't need everything Pelias
+does, you may be able to get by without installing and running all the Pelias services
+
+| Service       | /v1/search   | /v1/autocomplete | /v1/reverse  | /v1/reverse (coarse) | Multiple language support (any endpoint) |
+| ------        | -----        | -----            | ---------    | -------              | ----- |
+| API           | **required** | **required**     | **required** | **required**         | **required** |
+| Placeholder   | **required** |                  |              |                      | **required** |
+| Libpostal     | **required** |                  |              |                      | |
+| PIP           |              |                  | recommended  | **required**         | |
+| Interpolation | optional     |                  |              |                      | |
+
+## Descriptions
+
+### [API](https://github.com/pelias/api)
+
+This is the core of Pelias. It talks to all other services (if available), Elasticsearch, and
+provides the interface for all queries to Pelias.
+
+### [Placeholder](https://github.com/pelias/placeholder)
+
+Placeholder is used specifically to handle the relational component of geocoding. Placeholder
+understands, for example, that Paris is a city in a country called France, but that there is another
+city called Paris in the state of Texas, USA.
+
+Placeholder also stores the translations of administrative areas in multiple languages. Therefore it
+is required if any support for multiple languages is desired.
+
+Currently, Placeholder is used only for forward geocoding on the `/v1/search` endpoint. In the
+future, it will also be used for autocomplete.
+
+### Libpostal
+
+Libpostal is a library that provides an address parser using a statistical natural language processing
+model trained on OpenStreetMap, OpenAddresses, and other open data. It is quite good at parsing
+fully specified input, but cannot handle autocomplete very well.
+
+The data required for Libpostal to run is around 3GB, and has to be loaded into memory, so this
+service is fairly expensive to run, even for small installations.
+
+Unlike the other Pelias services, we didn't actually write a Pelias Libpostal service.  We recommend
+using the [go-whosonfirst-libpostal](https://github.com/whosonfirst/go-whosonfirst-libpostal)
+service created by the [Who's on First](https://whosonfirst.org) team.
+
+## [Point-in-Polygon (PIP)](https://github.com/pelias/pip-service)
+
+The PIP service loads polygon data representing the boundaries of cities, states, regions, countries
+etc into memory, and can perform calculations on that geometric data. Its used to determine if a
+given point lies in a particular polygon. Thus, it's highly recommended for reverse geocoding.
+
+## [Interpolation](https://github.com/pelias/interpolation)
+
+The interpolation service combines street geometries with known addresses and address ranges, to
+allow estimating the position of addresses that might exist, but aren't in existing open
+data sources. It is only used by the `/v1/search` endpoint, but [autocomplete support may be added in
+the future](https://github.com/pelias/interpolation/issues/131).