Add full planet build considerations

7 years ago · ecf0e459fb
1 changed files with 108 additions and 0 deletions
--- a/full_planet_considerations.md
+++ b/full_planet_considerations.md
@ -0,0 +1,108 @@
+# Considerations for full-planet builds
+
+Pelias is designed to work with data ranging from a small city to the entire planet. Small cities do
+not require particularly significant resources and should be easy. However, full planet builds
+present many of their own challenges.
+
+Current [full planet builds](https://pelias-dashboard.geocode.earth) weigh in at around 550 million
+documents, and require about 375GB total storage in Elasticsearch.
+
+Fortunately, because of services like AWS and the scalability of Elasticsearch, full planet builds
+are possible without too much extra effort. The process is no different, it just requires more
+hardware and takes longer.
+
+To set expectations, a cluster of 4 [r4.xlarge](https://aws.amazon.com/ec2/instance-types/) AWS
+instances (30GB RAM each) running Elasticsearch, and one m4.4xlarge instance running the importers
+and PIP service can complete a full planet build in about two days.
+
+## Recommended processes
+
+### Use Docker containers and orchestration
+
+We strongly recommend using Docker to run Pelias. All our services include Dockerfiles and the
+resulting images are pushed to [Docker Hub](https://hub.docker.com/r/pelias/) by our CI. Using these
+images will drastically reduce the amount of work it takes to set up Pelias and will ensure you are
+on a known good configuration, minimizing the number of issues you will encounter.
+
+Additionally, there are many great tools for managing container workloads. Simple ones like
+[docker-compose](https://github.com/pelias/docker/) can be used for small installations, and more
+complex tools like [Kubernetes](https://github.com/pelias/kubernetes) can be great for larger
+installations. Pelias is extensively tested on both.
+
+### Use separate Pelias installations for indexing and production traffic
+
+The requirements for performant and reliable Elasticsearch clusters are very different for importing
+new data compared to serving queries. It is _highly_ recommended to use one cluster to do imports,
+save the resulting Elasticsearch index into a snapshot, and then load that snapshot into the cluster
+used to perform actual geocoding.
+
+### Shard count
+
+Historically, Mapzen Search has used 24 Elasticsearch shards for its builds. However, our latest
+guidance from the Elasticsearch team is that shards should be no larger than 50GB, but otherwise
+having as few shards as possible is best. At [geocode.earth](https://geocode.earth) we are
+experimenting with 12 shard builds, and may eventually move to 6. We would appreciate performance
+feedback from anyone doing large builds.
+
+### Force merge your Elasticsearch indices
+
+Pelias Elasticserach indices are generally static, as we do not recommend querying from and
+importing to an Elasticsearch cluster simultaneously. In such cases, the highest levels of
+performance can be achieved by [force-merging](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html) the Elasticsearch index.
+
+## Recommended hardware
+
+For a production ready instance of Pelias, capable of supporting a few hundred queries per second
+across a full planet build, a setup like the following should be sufficient.
+
+### Elasticsearch cluster for importing
+
+The main requirement of Elasticsearch is that it has lots of disk. 400GB across the
+cluster is a good minimum. Increased CPU power is useful to achieve a higher throughput for queries,
+but not as important as RAM.
+
+
+### Elasticsearch cluster for querying
+
+For queries, essentially the only bottleneck is CPU, although more RAM is helpful so Elasticsearch
+data can be cached. On AWS, `c5` instances are significantly more performant than even the `c4`
+instances, and should be used if high performance is needed.
+
+_Example configuration:_ 4 `c5.4xlarge` (16 CPU, 32GB RAM) to serve 250 RPS
+
+### Importer machine
+
+The importers are each single-threaded Node.js processes, which require around 8GB of RAM
+each with admin lookup enabled. Faster CPUs will help increase the import speed. Running multiple
+importers in parallel is recommended if the importer machine has enough RAM and CPU to support them.
+
+_Example configuration:_ 1 `c4.4xlarge` (16 CPU, 30GB RAM), running two parallel importers
+
+### Pelias services
+
+Each Pelias service has different memory and CPU requirements. Here are some rough guidelines:
+
+#### API
+RAM: 200MB per instance
+CPU: Single threaded, one instance can serve around 500 RPS
+Disk: None
+
+#### Placeholder
+RAM: 200MB per instance
+CPU: Single threaded, supports [clustering](https://nodejs.org/api/cluster.html)
+Disk: Requires about 2GB for a full planet index
+
+#### Libpostal
+RAM: 3GB per instance
+CPU: Multi-threaded, but extremely fast. A single core can serve 8000+ RPS
+Disk: about 2-3GB of data storage required
+
+### PIP
+RAM: ~6GB
+CPU: 2 cores per instance recommended, which is enough to serve 5000-7000 RPS
+
+### Interpolation
+RAM: 3GB per instance currently (please follow our efforts to [un-bundle
+libpostal](https://github.com/pelias/interpolation/issues/106) from the interpolation service)
+CPU: Single core. One instance can serve around 200RPS
+Disk: 40GB needed for a full planet interpolation dataset