Merge pull request #223 from pelias/install-refresh

Refresh installation docs
6 years ago · 7443c2271f
6 changed files with 582 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -27,7 +27,11 @@ _Not sure which Endpoint to use? We have a [page](search-workflows.md) for that_
 - [Pelias data sources](data-sources.md)
 ### Running your own Pelias
- [Pelias installation guide](https://github.com/pelias/pelias/blob/master/INSTALL.md)
+- [Getting started](getting_started_install.md) Start here if you're looking to install Pelias
 - [Pelias from scratch](pelias_from_scratch.md) More in-depth instructions for installing Pelias
 - [Full planet build considerations](full_planet_considerations.md) Special information on running a full planet Pelias build
 - [Service descriptions](services.md) A description of all the Pelias services, and when they are used
 - [Software Requirements](requirements.md) A list of all software requirements for Pelias
 ### Pelias project development
 - [Release notes](release-notes.md). See notable changes in Pelias over time
--- a/full_planet_considerations.md
+++ b/full_planet_considerations.md
@ -0,0 +1,122 @@
 # Considerations for full-planet builds
 Pelias is designed to work with data ranging from a small city to the entire planet. Small cities do
 not require particularly significant resources and should be easy. However, full planet builds
 present many of their own challenges.
 Current [full planet builds](https://pelias-dashboard.geocode.earth) weigh in at around 550 million
 documents, and require about 375GB total storage in Elasticsearch.
 Fortunately, because of services like AWS and the scalability of Elasticsearch, full planet builds
 are possible without too much extra effort. The process is no different, it just requires more
 hardware and takes longer.
 To set expectations, a cluster of 4 [r4.xlarge](https://aws.amazon.com/ec2/instance-types/) AWS
 instances (30GB RAM each) running Elasticsearch, and one m4.4xlarge instance running the importers
 and PIP service can complete a full planet build in about two days.
 ## Recommended processes
 ### Use Docker containers and orchestration
 We strongly recommend using Docker to run Pelias. All our services include Dockerfiles and the
 resulting images are pushed to [Docker Hub](https://hub.docker.com/r/pelias/) by our CI. Using these
 images will drastically reduce the amount of work it takes to set up Pelias and will ensure you are
 on a known good configuration, minimizing the number of issues you will encounter.
 Additionally, there are many great tools for managing container workloads. Simple ones like
 [docker-compose](https://github.com/pelias/docker/) can be used for small installations, and more
 complex tools like [Kubernetes](https://github.com/pelias/kubernetes) can be great for larger
 installations. Pelias is extensively tested on both.
 ### Use separate Pelias installations for indexing and production traffic
 The requirements for performant and reliable Elasticsearch clusters are very different for importing
 new data compared to serving queries. It is _highly_ recommended to use one cluster to do imports,
 save the resulting Elasticsearch index into a snapshot, and then load that snapshot into the cluster
 used to perform actual geocoding.
 ### Shard count
 Historically, Mapzen Search has used 24 Elasticsearch shards for its builds. However, our latest
 guidance from the Elasticsearch team is that shards should be no larger than 50GB, but otherwise
 having as few shards as possible is best. At [geocode.earth](https://geocode.earth) we are
 experimenting with 12 shard builds, and may eventually move to 6. We would appreciate performance
 feedback from anyone doing large builds.
 The `elasticsearch` section of `pelias.json` can be used to configure the shard count.
 ```js
 {
  "elasticsearch": {
    "settings": {
      "index": {
        "number_of_shards": "5",
      }
    }
  }
 }
 ```
 ### Force merge your Elasticsearch indices
 Pelias Elasticserach indices are generally static, as we do not recommend querying from and
 importing to an Elasticsearch cluster simultaneously. In such cases, the highest levels of
 performance can be achieved by [force-merging](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html) the Elasticsearch index.
 ## Recommended hardware
 For a production ready instance of Pelias, capable of supporting a few hundred queries per second
 across a full planet build, a setup like the following should be sufficient.
 ### Elasticsearch cluster for importing
 The main requirement of Elasticsearch is that it has lots of disk. 400GB across the
 cluster is a good minimum. Increased CPU power is useful to achieve a higher throughput for queries,
 but not as important as RAM.
 ### Elasticsearch cluster for querying
 For queries, essentially the only bottleneck is CPU, although more RAM is helpful so Elasticsearch
 data can be cached. On AWS, `c5` instances are significantly more performant than even the `c4`
 instances, and should be used if high performance is needed.
 _Example configuration:_ 4 `c5.4xlarge` (16 CPU, 32GB RAM) to serve 250 RPS
 ### Importer machine
 The importers are each single-threaded Node.js processes, which require around 8GB of RAM
 each with admin lookup enabled. Faster CPUs will help increase the import speed. Running multiple
 importers in parallel is recommended if the importer machine has enough RAM and CPU to support them.
 _Example configuration:_ 1 `c4.4xlarge` (16 CPU, 30GB RAM), running two parallel importers
 ### Pelias services
 Each Pelias service has different memory and CPU requirements. Here are some rough guidelines:
 #### API
 RAM: 200MB per instance
 CPU: Single threaded, one instance can serve around 500 RPS
 Disk: None
 #### Placeholder
 RAM: 200MB per instance
 CPU: Single threaded, supports [clustering](https://nodejs.org/api/cluster.html)
 Disk: Requires about 2GB for a full planet index
 #### Libpostal
 RAM: 3GB per instance
 CPU: Multi-threaded, but extremely fast. A single core can serve 8000+ RPS
 Disk: about 2-3GB of data storage required
 ### PIP
 RAM: ~6GB
 CPU: 2 cores per instance recommended, which is enough to serve 5000-7000 RPS
 ### Interpolation
 RAM: 3GB per instance currently (please follow our efforts to [un-bundle
 libpostal](https://github.com/pelias/interpolation/issues/106) from the interpolation service)
 CPU: Single core. One instance can serve around 200RPS
 Disk: 40GB needed for a full planet interpolation dataset
--- a/getting_started_install.md
+++ b/getting_started_install.md
@ -0,0 +1,35 @@
 ## Getting started with Pelias
 Looking to install and set up Pelias? You've come to the right place. We have several different
 tools and pieces of documentation to help you.
 ### Installing for the first time?
 We _strongly_ recommend using our [Docker](http://github.com/pelias/docker/) based installation for
 your first install. It removes the need to deal with most of the complexity and dependencies of
 Pelias. On a fast internet connection you should be able to get a small city like Portland, Oregon
 installed in under 30 minutes.
 ### Want to go more in depth?
 The Pelias docker installation should work great for any small area, and is great for managing the
 different Pelias services during development. However, we understand not everyone can or wants to
 use Docker, and that people want more details on how things work.
 For this, we have our [from scratch installation guide](pelias_from_scratch.md)
 ### Installing in production?
 By far the most well tested way to install Pelias is to use [Kubernetes](https://github.com/pelias/kubernetes).
 Kubernetes is perfect for managing systems that have many different components, like Pelias.
 We would love to add additional, well tested ways to install Pelias in production. Reach out to us
 if you have something to share or want to get started.
 ### Doing a full planet build?
 Running Pelias for a city or small country is pretty easy. However, due to the amount of data
 involved, a full planet build is harder to pull off.
 See our [full planet build guide](full_planet_considerations.md) for some recommendations on how to
 make it easier and more performant
--- a/pelias_from_scratch.md
+++ b/pelias_from_scratch.md
@ -0,0 +1,317 @@
 # Installing Pelias from Scratch
 These instructions will help you set up the Pelias geocoder from scratch. We strongly recommend
 using our [Docker](http://github.com/pelias/docker/) tools for your first Pelias installation.
 However, for more in-depth usage, or to learn more about the internals of Pelias, use this guide.
 It assumes some knowledge of the command line and Node.js, but we'd like as many people as possible
 to be able to install Pelias, so if anything is confusing, please don't hesitate to reach out. We'll
 do what we can to help and also improve the documentation.
 ## Installation Overview
 These are the steps for fully installing Pelias:
 1. [Check that the hardware and software requirements are met](#system-requirements)
 1. [Decide which datasets to use and download them](#choose-your-datasets)
 1. [Download the Pelias code](#download-the-pelias-repositories)
 1. [Customize Pelias Configuration file `~/pelias.json`](#customize-pelias-config)
 1. [Install the Elasticsearch schema using pelias-schema](#set-up-the-elasticsearch-schema)
 1. [Use one or more importers to load data into Elasticsearch](#run-the-importers)
 1. [Install and start the Pelias services](#install-and-start-the-pelias-services)
 1. [Start the API server to begin handling queries](#start-the-api)
 ## System Requirements
 See our [software requirements](requirements.md) and insure all of them are installed before moving forward
 ### Hardware recommendations
 * At a minimum 50GB disk space to download, extract, and process data
 * Lots of RAM, 8GB is a good minimum for a small import like a single city or small country. A full North America OSM import just fits in 16GB RAM
 ## Choose your datasets
 Pelias can currently import data from [four different sources](data-sources.md), using five different importers.
 Only one dataset is _required_: [Who's on First](https://whosonfirst.org/). This dataset is used to enrich all data imported into Pelias with [administrative information](glossary.md). For more on this process, see the [wof-admin-lookup](https://github.com/pelias/wof-admin-lookup) documentation.
 **Note:** You don't have to run the `whosonfirst` importer, but you do have to have Who's on First
 data available on disk for use by the other importers.
 Here's an overview of how to download each dataset.
 ### Who's on First
 The [Who's on First](https://github.com/pelias/whosonfirst#downloading-the-data) importer can download all the Who's
 on First data quickly and easily.
 ### Geonames
 The [pelias/geonames](https://github.com/pelias/geonames/#installation) importer contains code and
 instructions for downloading Geonames data automatically. Individual countries, or the entire planet
 (1.3GB compressed) can be specified.
 ### OpenAddresses
 The Pelias Openaddresses importer can [download specific files from
 OpenAddresses](https://github.com/pelias/openaddresses/#data-download).
 Additionally, the [OpenAddresses](https://results.openaddresses.io/) project includes numerous download options,
 all of which are `.zip` downloads. The full dataset is just over 6 gigabytes compressed (the
 extracted files are around 30GB), but there are numerous subdivision options.
 ### OpenStreetMap
 OpenStreetMap (OSM) has a nearly limitless array of download options, and any of them should work as long as
 they're in [PBF](http://wiki.openstreetmap.org/wiki/PBF_Format) format. Generally the files will
 have the extension `.osm.pbf`. Good sources include [download.geofabrik.de](http://download.geofabrik.de/), [Nextzen Metro Extracts](https://metro-extracts.nextzen.org/), [Interline OSM Extracts](https://www.interline.io/osm/extracts/), and planet files listed on the [OSM wiki](http://wiki.openstreetmap.org/wiki/Planet.osm).
 A full planet PBF file is about 41GB.
 #### Street Data (Polylines)
 To download and import [street data](https://github.com/pelias/polylines#download-data) from OSM, a separate importer is used that operates on a preprocessed dataset
 derived from the OSM planet file.
 ## Installation
 ### Download the Pelias repositories
 At a minimum, you'll need
 1. [Pelias schema](https://github.com/pelias/schema/)
 2. The Pelias [API](https://github.com/pelias/api/) and other Pelias services
 3. Importer(s)
 Here's a bash snippet that will download all the repositories (they are all small enough that you don't
 have to worry about the space of the code itself), check out the production branch (which is
 probably the one you want), and install all the node module dependencies.
 ```bash
 for repository in schema whosonfirst geonames openaddresses openstreetmap polylines api placeholder
 interpolation pip-service; do
 	git clone https://github.com/pelias/${repository}.git # clone from Github
 	pushd $repository > /dev/null                         # switch into importer directory
 	git checkout production                               # or remove this line to stay with master
 	npm install                                           # install npm dependencies
 	popd > /dev/null                                      # return to code directory
 done
 ```
 <details>
  <summary>Not sure which branch to use?</summary>
 Pelias uses three diferent branches as part of our release process.
 `production` **(recommended)**: contains only code that has been well tested, generally against a
 full-planet build. This is the "safest" branch and it will change the least frequently, although we
 generally release new code at least once a week.
 `staging`: these branches contain the code that is currently being tested against a full planet
 build for imminent release. It's useful to track what code will be going out in the next release,
 but not much else.
 `master`: master branches contain the latest code that has passed code review, unit/integration
 tests, and is reasonably functional. While we try to avoid it, the nature of the master branch is
 that it will sometimes be broken. That said, these are the branches to use for development of new
 features.
 </details>
 ### Customize Pelias Config
 Nearly all configuration for Pelias is driven through a single config file: `pelias.json`. By
 default, Pelias will look for this file in your home directory, but you can configure where it
 looks. For more details, see the [pelias-config](https://github.com/pelias/config) repository.
 #### Where on the network to find Elasticsearch
 Pelias will by default look for Elasticsearch on `localhost` at port 9200 (the standard
 Elasticsearch port).
 Take a look at the [default config](https://github.com/pelias/config/blob/master/config/defaults.json#L2). You can see the Elasticsearch configuration looks something like this:
 ```js
 {
  "esclient": {
  "hosts": [{
    "host": "localhost",
    "port": 9200
  }]
  ... // rest of config
 }
 ```
 If you want to connect to Elasticsearch somewhere else, change `localhost` as needed. You can
 specify multiple hosts if you have a large cluster. In fact, the entire `esclient` section of the
 config is sent along to the [elasticsearch-js](https://github.com/elastic/elasticsearch-js) module, so
 any of its [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/configuration.html)
 are valid.
 #### Where to find the downloaded data files
 The other major section, `imports`, defines settings for each importer.  `adminLookup` has it's own section and its value applies to all importers. The defaults look like this:
 ```json
 {
  "imports": {
    "adminLookup": {
      "enabled": true
    },
    "geonames": {
      "datapath": "/mnt/pelias/geonames",
    },
    "openstreetmap": {
      "datapath": "/mnt/pelias/openstreetmap",
      "leveldbpath": "/tmp",
      "import": [{
        "filename": "planet.osm.pbf"
      }]
    },
    "openaddresses": {
      "datapath": "/mnt/pelias/openaddresses",
      "files": []
    },
    "whosonfirst": {
      "datapath": "/mnt/pelias/whosonfirst"
    },
    "polyline": {
      "datapath": "/mnt/pelias/polyline",
      "files": []
    }
  }
 }
 ```
 Note: The datapath must be an _absolute path._
 As you can see, the default datapaths are meant to be changed.
 ### Install Elasticsearch
 Please refer to the [official 2.4 install docs](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/setup.html) for how to install Elasticsearch.
 Be sure to modify the Elasticsearch [heap size](https://www.elastic.co/guide/en/elasticsearch/guide/2.x/heap-sizing.html) as appropriate to your machine.
 Make sure Elasticsearch is running and connectable, and then you can continue with the Pelias
 specific setup and importing. Using a plugin like [Sense](https://github.com/bleskes/sense) [(Chrome extension)](https://chrome.google.com/webstore/detail/sense-beta/lhjgkmllcaadmopgmanpapmpjgmfcfig?hl=en), [head](https://mobz.github.io/elasticsearch-head/)
 or [Marvel](https://www.elastic.co/products/marvel) can help monitor Elasticsearch as you import
 data.
 ### Set up the Elasticsearch Schema
 Pelias requires specific configuration settings for both performance and accuracy reasons. Fortunately, now that your `pelias.json` file is configured with how to connect to Elasticsearch,
 the schema repository can automatically create the Pelias index and configure it exactly as needed.
 ```bash
 cd schema                      # assuming you have just run the bash snippet to download the repos from earlier
 node scripts/create_index.js
 ```
 The Elasticsearch Schema is analogous to the layout of a table in a traditional relational database,
 like MySQL or PostgreSQL. While Elasticsearch attempts to auto-detect a schema that works when
 inserting new data, this generally leads to non-optimal results. In the case of Pelias, inserting
 data without first applying the Pelias schema will cause all queries to fail completely.
 ### Run the importers
 Now that the schema is set up, you're ready to begin importing data.
 For each importer, you can start the import process with the `npm start` command:
 ```bash
 cd importer_directory; npm start
 ```
 Depending on how much data you've imported, now may be a good time to grab a coffee.
 You can expect around 800-2000 inserts per second.
 The order of imports does not matter. Multiple importers can be run in parallel to speed up the setup process.
 Each of our importers operates independent of the data that is already in Elasticsearch.
 For example, you can import OSM data without importing WOF data first.
 #### Aside: When to delete the data already in Elasticsearch
 If you have previously run a build, and are looking to start another one, it generally a good idea
 to delete the existing Pelias index and re-create it. Here's how:
 ```bash
 # !! WARNING: this will remove all your data from pelias!!
 node scripts/drop_index.js      # it will ask for confirmation first
 node scripts/create_index.js
 ```
 When is this necessary? Here's a guideline: when in doubt, delete the index, re-create it, and start
 fresh.
 This is because Elasticsearch has no analog to a schema migration like a relational database, and
 all the importers start over when re-run.
 The only time when this isn't necessary is if the following conditions are true:
 1. You are trying to re-import the exact same data again (for example, because the build failed, or
   you are testing changes to an importer)
 2. The Pelias schema has not changed
 ## Install and start the Pelias Services
 Pelias is made up of several different services, each providing a specific aspect of Pelias's
 functionality.
 The [list of Pelias services](services.md) descibes the functionality of each service, and can be
 used to determine if you need to install that service. It also includes links to setup instructions
 for each service.
 When in doubt, install everything except the interpolation engine (it requires a long download or
 build process).
 ### Configure `pelias.json` for services
 The Pelias API needs to know about each of the other services available to it. Once again, this is
 configured in `pelias.json`. The following section will tell the API to use all services running
 locally and on their default ports.
 ```js
 {
  "api": {
    "services": {
      "placeholder": {
        "url": "http://localhost:3000"
      },
      "libpostal": {
        "url": "http://localhost:8080"
      },
      "pip": {
        "url": "http://localhost:3102"
      },
      "interpolation"
        "url": "http://localhost:3000"
      }
    }
  }
 }
 ```
 ### Start the API
 Now that the API knows how to connect to Elasticsearch and all other Pelias services, all that is
 required to start the API is:
 ```
 npm start
 ```
 ## Geocode with Pelias
 Pelias should now be up and running and will respond to your queries.
 For a quick check, a request to `http://localhost:3100` should display a link to the documentation
 for handy reference.
 *Here are some queries to try:*
 [http://localhost:3100/v1/search?text=london](http://localhost:3100/v1/search?text=london): a search
 for the city of London.
 [http://localhost:3100/v1/autocomplete?text=londo](http://localhost:3100/v1/autocomplete?text=londo): another query for London, but using the autocomplete endpoint which supports partial matches and is intended to be sent queries as a user types (note the query is for `londo` but London is returned)
 [http://localhost:3100/v1/reverse?point.lon=-73.986027&point.lat=40.748517](http://localhost:3100/v1/reverse?point.lon=-73.986027&point.lat=40.748517): a reverse geocode for results near the Empire State Building in New York City.
 For information on everything Pelias can do, see our [documentation
 index](README.md).
 Happy geocoding!
--- a/requirements.md
+++ b/requirements.md
@ -0,0 +1,41 @@
 # Pelias Sofware requirements
 This is the list of all software requirements for Pelias. We highly recommend using our
 [Docker images](https://hub.docker.com/r/pelias/) to avoid having to even attempt to correctly
 install all our dependencies yourself.
 ## Node.js
 Version 6 or newer
 Most Pelias code is written in Node.js. Node.js 8 is recommended.
 Node.js 10 is not as well tested with Pelias yet, but should offer notable performance increases and
 may become the recommendation soon.
 We will probably drop support for Node.js 6 in the near future, so that we can use the many features
 supported only in version 8 and above.
 ## Elasticsearch
 Version 2.3 or 2.4
 The core data storage for Pelias is Elasticsearch. We recommend the latest in the 2.4 release line.
 We do not _yet_ support Elasticsearch 5 or 6, but work is [ongoing](https://github.com/pelias/pelias/issues/461)
 ## SQLite
 Version 3.11 or newerr
 Some components of Pelias need a relational database, and Elasticsarch does not provide good
 relational support. We use SQLite in these cases since it's simple to manage and quite performant.
 ## Libpostal
 Pelias relies heavily on the [Libpostal](https://github.com/openvenues/libpostal#installation)
 address parser. Libpostal requires about 4GB of disk space to download all the required data.
 ## Windows Support
 Pelias is not well tested on Windows, but we do wish to support it, and will accept patches to fix
 any issues with Windows support.
--- a/services.md
+++ b/services.md
@ -0,0 +1,62 @@
 # Pelias services
 A running Pelias installation is composed of several different services. Each service is well suited
 to a particular task.
 ## Service Use Cases
 Here's a list of which services provide which features in Pelias. If you don't need everything Pelias
 does, you may be able to get by without installing and running all the Pelias services
 | Service       | /v1/search   | /v1/autocomplete | /v1/reverse  | /v1/reverse (coarse) | Multiple language support (any endpoint) |
 | ------        | -----        | -----            | ---------    | -------              | ----- |
 | API           | **required** | **required**     | **required** | **required**         | **required** |
 | Placeholder   | **required** |                  |              |                      | **required** |
 | Libpostal     | **required** |                  |              |                      | |
 | PIP           |              |                  | recommended  | **required**         | |
 | Interpolation | optional     |                  |              |                      | |
 ## Descriptions
 ### [API](https://github.com/pelias/api)
 This is the core of Pelias. It talks to all other services (if available), Elasticsearch, and
 provides the interface for all queries to Pelias.
 ### [Placeholder](https://github.com/pelias/placeholder)
 Placeholder is used specifically to handle the relational component of geocoding. Placeholder
 understands, for example, that Paris is a city in a country called France, but that there is another
 city called Paris in the state of Texas, USA.
 Placeholder also stores the translations of administrative areas in multiple languages. Therefore it
 is required if any support for multiple languages is desired.
 Currently, Placeholder is used only for forward geocoding on the `/v1/search` endpoint. In the
 future, it will also be used for autocomplete.
 ### Libpostal
 Libpostal is a library that provides an address parser using a statistical natural language processing
 model trained on OpenStreetMap, OpenAddresses, and other open data. It is quite good at parsing
 fully specified input, but cannot handle autocomplete very well.
 The data required for Libpostal to run is around 3GB, and has to be loaded into memory, so this
 service is fairly expensive to run, even for small installations.
 Unlike the other Pelias services, we didn't actually write a Pelias Libpostal service.  We recommend
 using the [go-whosonfirst-libpostal](https://github.com/whosonfirst/go-whosonfirst-libpostal)
 service created by the [Who's on First](https://whosonfirst.org) team.
 ## [Point-in-Polygon (PIP)](https://github.com/pelias/pip-service)
 The PIP service loads polygon data representing the boundaries of cities, states, regions, countries
 etc into memory, and can perform calculations on that geometric data. Its used to determine if a
 given point lies in a particular polygon. Thus, it's highly recommended for reverse geocoding.
 ## [Interpolation](https://github.com/pelias/interpolation)
 The interpolation service combines street geometries with known addresses and address ranges, to
 allow estimating the position of addresses that might exist, but aren't in existing open
 data sources. It is only used by the `/v1/search` endpoint, but [autocomplete support may be added in
 the future](https://github.com/pelias/interpolation/issues/131).