From 94c4b7d09302458328c596eb4adf069dc5de1dca Mon Sep 17 00:00:00 2001 From: Diana Shkolnikov Date: Thu, 27 Oct 2016 11:12:02 -0400 Subject: [PATCH 1/2] Moving installation instruction to pelias/pelias This is not being displayed anywhere in Mapzen Search docs and belongs in pelias.io. Moving to pelias/pelias will ensure there only a single place to update and reference in the future. --- installing.md | 347 -------------------------------------------------- 1 file changed, 347 deletions(-) delete mode 100644 installing.md diff --git a/installing.md b/installing.md deleted file mode 100644 index 2ca45b7..0000000 --- a/installing.md +++ /dev/null @@ -1,347 +0,0 @@ -# Installing Pelias - -Mapzen offers the Mapzen Search service in hopes that as many people as possible will use it, -but we also encourage people to set up their own Pelias instance. - -For most cases, it's useful to have much of the installation process automated, so we suggest -looking at the [Pelias Vagrant image](https://github.com/pelias/vagrant). - -However, for more in-depth usage, to learn more about the working of Pelias, or to contribute back, -manual setup is useful. These instructions will help you install Pelias from scratch manually. - -## Installation Overview - -The steps for fully installing Pelias look like this: - -1. Decide which datasets and settings will be used -2. Download appropriate data -3. Download Pelias code, using the appropriate branches -4. Set up Elasticsearch -5. Install the Elasticsearch schema using pelias-schema -6. Use one or more importers to load data into Elasticsearch -7. Install the libpostal text analyzer (optional) -8. Start the API server to begin handling queries - -## System Requirements - -In general, Pelias will require: - -* A working [Elasticsearch](https://www.elastic.co/products/elasticsearch) 2.3 cluster. It can be on - a single machine or across several -* [Node.js](https://nodejs.org/) 4.0 or newer (the latest in the Node 4 or 6 series is recommended). Node.js 0.10 and 0.12 are no longer supported -* Up to 100GB disk space to download and extract data -* Lots of RAM, 8GB is a good minimum. A full North America OSM import just fits in 16GB RAM - - -## Choose your datasets - -Pelias can currently import data from four different sources. The contents and description of these -sources are available on our [data sources page](./data-sources.md). Here we'll just focus on what to -download for each one. - -### Who's on First - -The [Who's on First](https://github.com/pelias/whosonfirst#data) importer can download all the Who's -on First data quickly and easily. See the README for the most up to date instructions. - -### Geonames - -The [pelias/geonames](https://github.com/pelias/geonames/#importing-data) importer contains code and -instructions for downloading Geonames data automatically. Individual countries, or the entire planet -(1.3GB compressed) can be specified. - -### OpenAddresses -The OpenAddresses project includes [numerous download options](https://results.openaddresses.io/), -all of which are `.zip` downloads. The full dataset is just over 6 gigabytes compressed (the -extracted files are around 30GB), but there are numerous subdivision options. In any case, the -`.zip` files simply need to be extracted to a directory of your choice, and Pelias can be configured -to either import every `.csv` in that directory, or only selected files. - -### OpenStreetMap - -OpenStreetMap has a nearly limitless array of download options, and any of them should work as long as -they're in [PBF](http://wiki.openstreetmap.org/wiki/PBF_Format) format. Generally the files will -have the extension `.osm.pbf`. Good sources include the [Mapzen Metro Extracts](https://mapzen.com/data/metro-extracts/) -(which has popular cities available immediately, or custom areas that take only -a few minutes to build), and planet files listed on the [OSM wiki](http://wiki.openstreetmap.org/wiki/Planet.osm). -A full planet PBF file is about 36GB. - -## Choose your import settings - -There are several options that should be discussed before starting any data imports, as they require -a compromise between import speed and resulting data quality and richness. - -### Admin Lookup - -Most data that is imported by Pelias comes to us incomplete: many data sources don't supply what we -call admin hierarchy information: the neighbourhood, city, country, or other region that contains -the record. In OpenAddresses, for example, many records contain only a housenumber, street name, and -coordinates. - -Fortunately, Who's on First contains a well-developed set of geometries for all admin regions from the -neighbourhood to continent level. Through -[point-in-polygon](https://en.wikipedia.org/wiki/Point_in_polygon) lookup, our importers can -[derive](https://github.com/pelias/wof-admin-lookup) this information! - -The downsides to enabling admin lookup are increased memory requirements and longer import times. -Because geometry data is quite large, expect to use about 6GB of RAM (not disk) during import just -for this geometry data. And because of the complexity of the required calculations, imports with -admin lookup are up to 10 times slower than without. - -Who's on First, of course, always includes full hierarchy information because it's built into the -dataset itself, so there's no tradeoff to be made. Who's on First data will always import quite fast -and with full hierarchy information. - -### Address Deduplication - -OpenAddresses data contains lots of addresses, but it also contains lots of duplicate data. To help -reduce this problem we've built an [address-deduplicator](https://github.com/pelias/address-deduplicator) -that can be run at import. It uses the [OpenVenues deduplicator](https://github.com/openvenues/address_deduper) -to remove records that are near each other and have names that are likely to be duplicates. Note -that it's considerably smarter than simply doing exact comparisons of names and coordinates: it uses -[Geohash prefixes](https://en.wikipedia.org/wiki/Geohash) to compare nearby records, and the -[libpostal address normalizer](https://github.com/openvenues/libpostal#examples-of-normalization) to -compare names, so it can tell that records with `101 Main St` and `101 Main Street` are likely to -refer to the same place. - -Unfortunately, our current implementation is very slow, and requires about 50GB of scratch disk -space during a full planet import. It's worth noting that Mapzen Search currently does _not_ -deduplicate any data, although we hope to improve the performance of deduplication and resume using -it eventually. - -## Considerations for full-planet builds - -As may be evident from the dataset section above, importing all the data in all four supported datasets is -worthy of its own discussion. Current [full planet builds](https://pelias-dashboard.mapzen.com/pelias) -weigh in at over 320 million documents, and require about 230GB total storage in Elasticsearch. -Needless to say, a full planet build is not likely to succeed on most personal computers. - -Fortunately, because of services like AWS and the scalability of Elasticsearch, full planet builds -are possible without too much extra effort. To set expectations, a cluster of 4 -[r3.xlarge](https://aws.amazon.com/ec2/instance-types/) AWS instances running Elasticsearch, and one -c4.8xlarge instance running the importers can complete a full planet build in about two days. - -## Choose your Pelias code branch - -As part of the setup instructions below, you'll be downloading several Pelias packages from source -on Github. All of these packages offer 3 branches for various use cases. Based on your needs, you -should pick one of these branches and use the same one across all of the Pelias packages. - -`production`: contains only code that has been tested against a full-planet build and is live on -Mapzen Search. This is the "safest" branch and it will change the least frequently, although we -generally release new code at least once a week. - -`staging`: these branches contain the code that is currently being tested against a full planet -build for imminent release to Mapzen Search. It's useful to track what code will be going out in the -next release, but not much else. - -`master`: master branches contain the latest code that has passed code review, unit/integration -tests, and is ready to be included in the next release. While we try to avoid it, the nature of the -master branch is that it will sometimes be broken. That said, these are the branches to use for -development of new features. - -## Installation - -### Download the Pelias repositories - -At a minimum, you'll need the Pelias [schema](https://github.com/pelias/schema/) and -[api](https://github.com/pelias/api/) repositories, as well as at least one of the importers. Here's -a bash snippet that will download all the repositories (they are all small enough that you don't -have to worry about the space of the code itself), check out the production branch (which is -probably the one you want), and install all the node module dependencies. - -```bash -for repository in schema api whosonfirst geonames openaddresses openstreetmap; do - git clone git@github.com:pelias/${repository}.git - pushd $repository > /dev/null - git checkout production # or staging, or remove this line to stay with master - npm install - popd > /dev/null -done -``` - -### Customize Pelias Config - -Nearly all configuration for Pelias is driven through a single config file: `pelias.json`. By -default, Pelias will look for this file in your home directory, but you can configure where it -looks. For more details, see the [pelias-config](https://github.com/pelias/config) repository. - -The two main things of note to configure are where on the network to find Elasticsearch, and where -to find the downloaded data files. - -Pelias will by default look for Elasticsearch on `localhost` at port 9200 (the standard -Elasticsearch port). - -By taking a look at the [default config](https://github.com/pelias/config/blob/master/config/defaults.json#L2), -you can see the Elasticsearch configuration looks something like this: - -```js -{ - "esclient": { - "hosts": [{ - "host": "localhost", - "port": 9200 - }] - - ... // rest of config -} -``` - -If you want to connect to Elasticsearch somewhere else, change `localhost` as needed. You can -specify multiple hosts if you have a large cluster. In fact, the entire `esclient` section of the -config is sent along to the [elasticsearch-js](https://github.com/elastic/elasticsearch-js) module, so -any of its [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/configuration.html) -are valid. - -The other major section, `imports`, defines settings for each importer. The defaults look like this: - -```json -{ - "imports": { - "geonames": { - "datapath": "./data", - "adminLookup": false - }, - "openstreetmap": { - "datapath": "/mnt/pelias/openstreetmap", - "adminLookup": false, - "leveldbpath": "/tmp", - "import": [{ - "filename": "planet.osm.pbf" - }] - }, - "openaddresses": { - "datapath": "/mnt/pelias/openaddresses", - "adminLookup": false, - "files": [] - }, - "whosonfirst": { - "datapath": "/mnt/pelias/whosonfirst" - } - } -} -``` - -As you can see, the default datapaths are meant to be changed. This is also where you can enable -admin lookup by overriding the default value. - -### Elasticsearch Configuration - -Of special note in `pelias.json` are Elasticsearch settings. The [default](https://github.com/pelias/config/blob/master/config/defaults.json) -settings (see the `elasticsearch` section) will be fine for development, but in particular the shard count should be -increased for production use. Mapzen Search uses 24 shards in production (for a full planet build). -Smaller installations should probably at least use the Elasticsearch default of 5 shards: - -```js -{ - "elasticsearch": { - "settings": { - "index": { - "number_of_shards": "5", - } - } - } -} -``` - -### Install Elasticsearch - -Other than requiring Elasticsearch 2.3, nothing special in the Elasticsearch setup is required for -Pelias, so please refer to the [official 2.3 install docs](https://www.elastic.co/guide/en/elasticsearch/reference/2.3/setup.html). - -Older versions of Elasticsearch are not supported. - -Make sure Elasticsearch is running and connectable, and then you can continue with the Pelias -specific setup and importing. Using a plugin like [head](https://mobz.github.io/elasticsearch-head/) -or [Marvel](https://www.elastic.co/products/marvel) can help monitor Elasticsearch as you import -data. - -If you're using a terminal, you can also search and/or monitor Elasticsearch using their [APIs.](https://www.elastic.co/guide/en/elasticsearch/reference/2.3/api-conventions.html) - -**Note:** On large imports, Elasticsearch can be very sensitive to memory issues. Be sure to modify it's [heap size](https://www.elastic.co/guide/en/elasticsearch/guide/2.x/heap-sizing.html) from the default confiration to something more appropriate to your machine. - -### Set up the Elasticsearch Schema - -The Elasticsearch Schema is analogous to the layout of a table in a traditional relational database, -like MySQL or PostgreSQL. While Elasticsearch attempts to auto-detect a schema that works when -inserting new data, this generally leads to non-optimal results. In the case of Pelias, inserting -data without first applying the Pelias schema will cause all queries to fail completely: Pelias -requires specific configuration settings for both performance and accuracy reasons. - -Fortunately, now that your `pelias.json` file is configured with how to connect to Elasticsearch, -the Schema repository can automatically create the Pelias index and configure it exactly as needed: - -```bash -cd schema # assuming you've just run the bash snippet to download the repos from earlier -node scripts/create_index.js -``` - -If you want to reset the schema later (to start over with a new import or because the schema code -has been updated), you can drop the index and start over like so: - -```bash -# !! WARNING: this will remove all your data from pelias!! -node scripts/drop_index.js # it will ask for confirmation first -node scripts/create_index.js -``` - -Note that Elasticsearch has no analogy to a database migration, so you generally have to delete and -reindex all your data after making schema changes. - -### Run the importers - -Now that the schema is set up, you're ready to begin importing data. - -For all importers except for Geonames, you can start the import process with the `npm start` -command: - -```bash -cd $importer_directory; npm start -``` - -For the [Geonames](https://github.com/pelias/geonames/) importer, please see its -[README](https://github.com/pelias/geonames/blob/master/README.md) file for the most up to date -instructions. We are working towards making all the importers have [the same interface](https://github.com/pelias/pelias/issues/255), -so the Geonames importer will behave the same as the others soon. - -Depending on how much data you've imported, now may be a good time to grab a coffee. Without admin -lookup, the fastest speeds you'll see are around 10,000 records per second. With admin lookup, -expect around 800-2000 inserts per second. - -### Install Libpostal (optional, but recommended) - -Pelias is now able to use the [libpostal](https://github.com/openvenues/libpostal) address parser, -which greatly increases the quality of search results. Libpostal must be installed on the machines -running the Pelias API, and requires about 4GB of disk space to download all the required data. This -data represents a statistical natural language processing model of address parsing trained on -OpenStreetMap data. The API will also require about 2GB of memory (it used only a few hundred -before), to store the needed data for queries. - -First, install libpostal following its [installation docs](https://github.com/openvenues/libpostal#installation). -This will also download the training data, so be sure to have enough free disk space. - -Next, configure the Pelias API to use libpostal (it won't by default) by adding a section like this -to `pelias.json`: - -```json -{ - "api": { - "textAnalyzer": "libpostal" - } -} -``` - -In the future, libpostal may become the default, and we may drop support for -[addressit](https://github.com/DamonOehlman/addressit), the current default text parser. Until then, -the `textAnalyzer` property can be changed back to `addressit` (or removed) to stop using libpostal. - -Once configured, the API will use libpostal via the [node-postal](https://github.com/openvenues/node-postal) -NPM module. - -### Start the API - -As soon as you have any data in Elasticsearch, you can start running queries against the -[Pelias API server](https://github.com/pelias/api/). - -Again thanks to `pelias.json`, the API already knows how to connect to Elasticsearch, so all that's -required to star the API is `npm start`. You can now send queries to `http://localhost:3100/`! From d040111bcab97a0d3cf731e0d639b84202238c90 Mon Sep 17 00:00:00 2001 From: Diana Shkolnikov Date: Thu, 27 Oct 2016 13:11:54 -0400 Subject: [PATCH 2/2] replace text with link to new location --- installing.md | 9 +++++++++ 1 file changed, 9 insertions(+) create mode 100644 installing.md diff --git a/installing.md b/installing.md new file mode 100644 index 0000000..ad03436 --- /dev/null +++ b/installing.md @@ -0,0 +1,9 @@ +Dear Pelias user, + +Hey there! :wave: You know how this used to be the place you could find the Pelias installation guide? +Well it didn't really belong here and we had to move it to another place. :grimacing: + +Please head over to [pelias/pelias/INSTALL.md](https://github.com/pelias/pelias/blob/master/INSTALL.md) to find it. +It can also be viewed on [pelias.io](http://pelias.io/install.html). + +See you there!