First draft

9 years ago · f71fac7742
1 changed files with 285 additions and 0 deletions
--- a/installing.md
+++ b/installing.md
@ -0,0 +1,285 @@
 # Installing Pelias
 Mapzen offers the Mapzen Search service in hopes that as many people as possible will use it,
 but we also encourage people to set up their own Pelias instance. Whether it's to import their own data,
 make their own tweaks to Pelias code, or to help with Pelias development, its important that we
 document how this can be done. Similarly, while there are ways this process can be
 [automated](https://github.com/pelias/vagrant), these instructions are written as if the setup is
 manual, to illustrate all the moving pieces of Pelias.
 ## Gather the Ingredients
 In general, Pelias will require:
 * A working [Elasticsearch](https://www.elastic.co/products/elasticsearch) 1.7 cluster. It can be on
  a single machine or across several
 * [Node.js](https://nodejs.org/) 0.12 or newer (Node 4 or 5 is recommended)
 * Up to 100GB disk space to download and extract data
 * Lots of RAM. A full North America OSM import just barely fits on a machine with 16GB RAM
 ## Choose your branch
 As part of the setup instructions below, you'll be downloading several Pelias packages from source
 on Github. All of these packages offer 3 branches for various use cases. Based on your needs, you
 should pick one of these branches and use the same one across all of the Pelias packages.
 `production`: contains only code that has been tested against a full-planet build and is live on
 Mapzen Search. This is the "safest" branch and it will change the least frequently, although we
 generally release new code at least once a week.
 `staging`: these branches contain the code that is currently being tested against a full planet
 build for imminent release to Mapzen Search. It's useful to track what code will be going out in the
 next release, but not much else.
 `master`: master branches contain the latest code that has passed code review, unit/integration
 tests, and is ready to be included in the next release. While we try to avoid it, the nature of the
 master branch is that it will sometimes be broken. That said, these are the branches to use for
 development of new features.
 ## Choose your datasets
 Pelias can currently import data from four different sources. The contents and description of these
 sources are available on our [data sources page](./data_sources). Here we'll just focus on what to
 download for each one.
 ### Whosonfirst
 There are two ways to download Whosonfirst data. The first is to use the pre-created
 [bundles](https://whosonfirst.mapzen.com/bundles/). These consist of a series of archives that can
 be easily extracted (instructions are on the page).
 For more advanced uses, or to contribute back to Whosonfirst, use the
 [whosonfirst-data](https://github.com/whosonfirst/whosonfirst-data) Github repository. Again, there
 are [instructions](https://github.com/whosonfirst/whosonfirst-data#git-and-github). Note that this
 repo requires [git-lfs](https://git-lfs.github.com/), a lot of bandwidth, and 27GB (currently) of
 disk space.
 ### Geonames
 The [pelias/geonames](https://github.com/pelias/geonames/#importing-data) importer contains code and
 instructions for downloading Geonames data automatically. Individual countries, or the entire planet
 (1.3GB) can be specified.
 ### Openaddresses
 The Openaddresses project includes [numerous download options](https://results.openaddresses.io/),
 all of which are `.zip` downloads. The full dataset is several gigabytes, but there are numerous
 subdivision options. In any case, the `.zip` files simply need to be extracted to a directory of
 your choice, and Pelias can be configured to either import every `.csv` in that directory, or only
 selected files.
 ### Openstreetmap
 Openstreetmap has a nearly limitless array of download options, and any of them will work as long as
 they're in [PBF](http://wiki.openstreetmap.org/wiki/PBF_Format) format. Generally the files will
 have the extension `.osm.pbf`. Good sources include the [Mapzen Metro Extracts](https://mapzen.com/data/metro-extracts/)
 (feel free to submit pull requests for additional cities or regions if needed), and planet files
 listed on the [OSM wiki](http://wiki.openstreetmap.org/wiki/Planet.osm).
 ## Choose your import options
 There are several options that should be discussed before starting any data imports, as they require
 a compromise between import speed and resulting data quality and richness.
 ### Admin Lookup
 Most data that is imported by Pelias comes to us incomplete: many data sources don't supply what we
 call admin hierarchy information: the neighbourhood, city, country, or other region that contains
 the record. In Openaddresses, for example, many records contain only a housenumber, street name, and
 coordinates.
 Fortunately, Whosonfirst contains a well-developed set of geometries for all admin regions from the
 neighbourhood to continent level. Through
 [point-in-polygon](https://en.wikipedia.org/wiki/Point_in_polygon) lookup, our importers can
 [derive](https://github.com/pelias/wof-admin-lookup) this information!
 The downsides to enabling admin lookup are increased memory requirements and longer import times.
 Because geometry data is quite large, expect to use about 6GB of RAM (not disk) during import just
 for this geometry data. And because of the complexity of the required calculations, imports with
 admin lookup are up to 10 times slower than without.
 Whosonfirst, of course, always includes full hierarchy information because it's built into the
 dataset itself, so there's no tradeoff to be made. Whosonfirst data will always import quite fast
 and with full hierarchy information.
 ### Address Deduplication
 Openaddresses data contains lots of addresses, but it also contains lots of duplicate data. To help
 reduce this problem we've built an [address-deduplicator](https://github.com/pelias/address-deduplicator)
 that can be run at import. It uses the [OpenVenues deduplicator](https://github.com/openvenues/address_deduper)
 to remove records that are near each other and have names that are likely to be duplicates. Note
 that it's considerably smarter than simply doing exact comparisons of names and coordinates: it uses
 [Geohash prefixes](https://en.wikipedia.org/wiki/Geohash) to compare nearby records, and the
 [libpostal address normalizer](https://github.com/openvenues/libpostal#examples-of-normalization) to
 compare names, so it can tell that records with `101 Main St` and `101 Main Street` are likely to
 refer to the same place.
 Unfortunately, our current implementation is very slow, and requires about 50GB of scratch disk
 space during a full planet import.
 ## Considerations for full-planet builds
 As may be evident from the dataset section above, importing all the data in all four supported datasets is
 worthy of its own discussion. Current [full planet builds](https://pelias-dashboard.mapzen.com/pelias)
 weigh in at over 300 million documents, and about 140GB total storage in Elasticsearch. Needless to
 say, a full planet build is not likely to succeed on most personal computers.
 Fortunately, because of services like AWS and the scalability of Elasticsearch, full planet builds
 are possible without too much extra effort. To set expectations, a cluster of 4
 [r3.xlarge](https://aws.amazon.com/ec2/instance-types/) AWS instances running Elasticsearch, and one
 c4.8xlarge instance running the importers can complete a full planet build in about two days.
 ## Installation
 ### Download the Pelias repositories
 At a minimum, you'll need the Pelias [schema] and [api] repositories, as well as at least one of the
 importers. Here's a bash snippet that will download all the repositories (they are all small enough
 that you don't have to worry about the space of the code itself), check out the production branch
 (which is probably the one you want), and install all the node module dependencies.
 ```bash
 for repository in schema api whosonfirst geonames openaddresses openstreetmap; do
 	git clone git@github.com:pelias/${repository}.git
 	pushd $repository > /dev/null
 	git checkout production # or staging, or remove this line to stay with master
 	npm install
 	popd > /dev/null
 done
 ```
 ### Customize Pelias Config
 Nearly all configuration for Pelias is driven through a single config file: `pelias.json`. By
 default, Pelias will look for this file in your home directory, but you can configure where it
 looks. For more details, see the [pelias-config](https://github.com/pelias/config) repository.
 The two main things of note to configure are where on the network to find Elasticsearch, and where
 to find the downloaded data files.
 Pelias will by default look for Elasticsearch on `localhost` at port 9200 (the standard
 Elasticsearch port).
 By taking a look at the [default config](https://github.com/pelias/config/blob/master/config/defaults.json#L2),
 you can see the Elasticsearch configuration looks something like this:
 ```json
 {
  esclient: {
  "hosts": [{
    "host": "localhost",
    "port": 9200
  }]
  ... // rest of config
 }
 ```
 If you want to connect to Elasticsearch somewhere else, change `localhost` as needed. You can
 specify multiple hosts if you have a large cluster. In fact, the entire `esclient` section of the
 config is sent along to the [elasticsearch-js](https://github.com/elastic/elasticsearch-js) module, so
 any of its [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/configuration.html)
 are valid.
 The other major section, `imports`, defiens settings for each importer. The defaults look like this:
 ```json
 {
 "imports": {
    "geonames": {
      "datapath": "./data",
      "adminLookup": false
    },
    "openstreetmap": {
      "datapath": "/mnt/pelias/openstreetmap",
 	  "adminLookup": false,
      "leveldbpath": "/tmp",
      "import": [{
        "filename": "planet.osm.pbf"
      }]
    },
    "openaddresses": {
      "datapath": "/mnt/pelias/openaddresses",
      "files": []
    },
    "whosonfirst": {
      "datapath": "/mnt/pelias/whosonfirst"
    }
 }
 ```
 As you can see, the default datapaths are meant to be changed. This is also where you can enable
 admin lookup by overriding the default value.
 Two caveats to this config section. First, the array structure of the Openstreetmap `import` section
 suggests you can specify multiple files to import. Unfortunately, you can't, although we'd like to
 [support that in the future](https://github.com/pelias/openstreetmap/issues/55).
 Second, note that the Openaddresses section does _not_ have an `adminLookup` flag. The Openaddresses
 importer only supports controlling this option by a command line flag currently. Again this is
 something [we'd like to fix](https://github.com/pelias/openaddresses/issues/51). See the importer
 [readme](https://github.com/pelias/openaddresses/blob/master/README.md) for details on how to
 configure admin lookup and deduplication for Openaddresses.
 ### Install Elasticsearch
 Other than requiring Elasticsearch 1.7, nothing special in the Elasticsearch setup is required for
 Pelias, so please refer to the [official 1.7 install docs](https://www.elastic.co/guide/en/elasticsearch/reference/1.7/setup.html).
 Make sure Elasticsearch is running and connectable, and then you can continue with the Pelias
 specific setup and importing. Using a plugin like [head](https://mobz.github.io/elasticsearch-head/)
 or [Marvel](https://www.elastic.co/products/marvel) can help monitor Elasticsearch as you import
 data.
 ### Set up the Elasticsearch Schema
 The Elasticsearch Schema is analogous to the layout of a table in a traditional relational database,
 like MySQL or PostgreSQL. While Elasticsearch attempts to auto-detect a schema that works when
 inserting new data, this generally leads to non-optimal results. In the case of Pelias, inserting
 data without first applying the Pelias schema will cause all queries to fail completely: Pelias
 requires specific configuration settings for both performance and accuracy reasons.
 Fortunately, now that your `pelias.json` file is configured with how to connect to Elasticsearch,
 the Schema repository can automatically create the Pelias index and configure it exactly as needed:
 ```bash
 cd schema # assuming you've just run the bash snippet to download the repos from earlier
 node scripts/create_index.js
 ```
 If you want to reset the schema later (to start over with a new import or because the schema code
 has been updated), you can drop the index and start over like so:
 ```bash
 # warning: this will remove all your data from pelias~
 node scripts/drop_index.js # it will ask for confirmation first
 node scripts/create_index.js
 ```
 Note that Elasticsearch has no analogy to a database migration, so you generally have to delete and
 reindex all your data after making schema changes.
 ### Run the importers
 Now that the schema is set up, you're ready to begin importing data!
 Our [goal](https://github.com/pelias/pelias/issues/255) is that eventually you'll be able to run all
 the importers with simply `cd $importer_directory; npm start`. Unfortunately only the Whosonfirst
 and Openstreetmap importers works that way right now.
 For [Geonames](https://github.com/pelias/geonames/) and [Openaddresses](https://github.com/pelias/openaddresses),
 please see their respective READMEs, which detail the process of running them. We'd love to see pull
 requests that allow them to read configuration from `pelias.json` like the other importers.
 Depending on how much data you've imported, now may be a good time to grab a coffee. Without admin
 lookup, the fastest speeds you'll see are around 10,000 records per second. With admin lookup,
 1000/sec is pretty fast.
 ### Start the API
 As soon as you have any data in Elasticsearch, you can start running queries against the
 [Pelias API server](https://github.com/pelias/api/).
 Again thanks to `pelias.json`, the API already knows how to connect to Elasticsearch, so all that's
 required to star the API is `npm start`. You can now send queries to `http://localhost:3100/`!