From ecf0e459fb3b8edb14b547ea56e52d59b20ea3ae Mon Sep 17 00:00:00 2001 From: Julian Simioni Date: Sun, 15 Jul 2018 17:23:00 -0400 Subject: [PATCH] Add full planet build considerations --- full_planet_considerations.md | 108 ++++++++++++++++++++++++++++++++++ 1 file changed, 108 insertions(+) create mode 100644 full_planet_considerations.md diff --git a/full_planet_considerations.md b/full_planet_considerations.md new file mode 100644 index 0000000..7e8fff4 --- /dev/null +++ b/full_planet_considerations.md @@ -0,0 +1,108 @@ +# Considerations for full-planet builds + +Pelias is designed to work with data ranging from a small city to the entire planet. Small cities do +not require particularly significant resources and should be easy. However, full planet builds +present many of their own challenges. + +Current [full planet builds](https://pelias-dashboard.geocode.earth) weigh in at around 550 million +documents, and require about 375GB total storage in Elasticsearch. + +Fortunately, because of services like AWS and the scalability of Elasticsearch, full planet builds +are possible without too much extra effort. The process is no different, it just requires more +hardware and takes longer. + +To set expectations, a cluster of 4 [r4.xlarge](https://aws.amazon.com/ec2/instance-types/) AWS +instances (30GB RAM each) running Elasticsearch, and one m4.4xlarge instance running the importers +and PIP service can complete a full planet build in about two days. + +## Recommended processes + +### Use Docker containers and orchestration + +We strongly recommend using Docker to run Pelias. All our services include Dockerfiles and the +resulting images are pushed to [Docker Hub](https://hub.docker.com/r/pelias/) by our CI. Using these +images will drastically reduce the amount of work it takes to set up Pelias and will ensure you are +on a known good configuration, minimizing the number of issues you will encounter. + +Additionally, there are many great tools for managing container workloads. Simple ones like +[docker-compose](https://github.com/pelias/docker/) can be used for small installations, and more +complex tools like [Kubernetes](https://github.com/pelias/kubernetes) can be great for larger +installations. Pelias is extensively tested on both. + +### Use separate Pelias installations for indexing and production traffic + +The requirements for performant and reliable Elasticsearch clusters are very different for importing +new data compared to serving queries. It is _highly_ recommended to use one cluster to do imports, +save the resulting Elasticsearch index into a snapshot, and then load that snapshot into the cluster +used to perform actual geocoding. + +### Shard count + +Historically, Mapzen Search has used 24 Elasticsearch shards for its builds. However, our latest +guidance from the Elasticsearch team is that shards should be no larger than 50GB, but otherwise +having as few shards as possible is best. At [geocode.earth](https://geocode.earth) we are +experimenting with 12 shard builds, and may eventually move to 6. We would appreciate performance +feedback from anyone doing large builds. + +### Force merge your Elasticsearch indices + +Pelias Elasticserach indices are generally static, as we do not recommend querying from and +importing to an Elasticsearch cluster simultaneously. In such cases, the highest levels of +performance can be achieved by [force-merging](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html) the Elasticsearch index. + +## Recommended hardware + +For a production ready instance of Pelias, capable of supporting a few hundred queries per second +across a full planet build, a setup like the following should be sufficient. + +### Elasticsearch cluster for importing + +The main requirement of Elasticsearch is that it has lots of disk. 400GB across the +cluster is a good minimum. Increased CPU power is useful to achieve a higher throughput for queries, +but not as important as RAM. + + +### Elasticsearch cluster for querying + +For queries, essentially the only bottleneck is CPU, although more RAM is helpful so Elasticsearch +data can be cached. On AWS, `c5` instances are significantly more performant than even the `c4` +instances, and should be used if high performance is needed. + +_Example configuration:_ 4 `c5.4xlarge` (16 CPU, 32GB RAM) to serve 250 RPS + +### Importer machine + +The importers are each single-threaded Node.js processes, which require around 8GB of RAM +each with admin lookup enabled. Faster CPUs will help increase the import speed. Running multiple +importers in parallel is recommended if the importer machine has enough RAM and CPU to support them. + +_Example configuration:_ 1 `c4.4xlarge` (16 CPU, 30GB RAM), running two parallel importers + +### Pelias services + +Each Pelias service has different memory and CPU requirements. Here are some rough guidelines: + +#### API +RAM: 200MB per instance +CPU: Single threaded, one instance can serve around 500 RPS +Disk: None + +#### Placeholder +RAM: 200MB per instance +CPU: Single threaded, supports [clustering](https://nodejs.org/api/cluster.html) +Disk: Requires about 2GB for a full planet index + +#### Libpostal +RAM: 3GB per instance +CPU: Multi-threaded, but extremely fast. A single core can serve 8000+ RPS +Disk: about 2-3GB of data storage required + +### PIP +RAM: ~6GB +CPU: 2 cores per instance recommended, which is enough to serve 5000-7000 RPS + +### Interpolation +RAM: 3GB per instance currently (please follow our efforts to [un-bundle +libpostal](https://github.com/pelias/interpolation/issues/106) from the interpolation service) +CPU: Single core. One instance can serve around 200RPS +Disk: 40GB needed for a full planet interpolation dataset