Short autocomplete inputs are very difficult to serve in a performant
and low-latency way. With shorter inputs, many more documents match for
just about any input string.
In our testing, one to three character input texts generally match up to
100 million documents out of a 560 million document full planet build.
There's really no way to make scoring 100 million documents fast,
so in order to achieve acceptable performance (ideally, <100ms P99
latency), it's worth looking at ways to either avoid querying
Elasticsearch or reducing the scope of autocomplete queries.
Short autocomplete queries without a focus.point parameter can be
cached. There are only 47,000 possible 1-3 character alphanumerical
inputs. At this time, caching is outside the scope of Pelias itself but
can easily be implemented with Varnish, Nginx, Fastly, Cloudfront, and
lots of other tools and services.
Queries with a `focus.point` are effectively uncachable however, since
the coordinate chosen will often be unique.
This PR uses the `focus.point` coordinate to build a
hard filter limiting the search to documents only within a certain
radius of the coordinate. This can reduce the number of documents
searched and improve performance, while still returning results that are
useful.
It takes two parameters, driven by `pelias-config`:
- `api.autocomplete.focusHardLimitTextLength': the maximum length of text
for which a hard distance filter will be constructed
- `api.autocomplete.focusHardLimitMultiplier`: the length of the input
text will be multiplied by this number to get the total hard filter
radius in kilometers.
For example, with `focusHardLimitTextLength` 4, and
`focusHardLimitMultiplier` 50, the following hard filters would be
constructed:
| text length | max distance |
| ---- | ----|
| 1 | 50 |
| 2 | 100 |
| 3 | 150 |
| 4+ | unlimited |
This adds a structured and detailed log line for each Elasticsearch
query.
It includes information like the total number of Elasticsearch hits, how
long Elasticsearch took to process the request, query parameters, etc.
This is extremely useful for later analysis as the structured nature of
the query allows for powerful filtering.
It's possible for the `text` input to /v1/autocomplete to be of non-zero
length after trimming whitespace and quotes, but still be insufficient
to use for geocoding.
One common case is that it contains only commas, slashes, or other
delimiters.
Our query logic currently does not handle this case, and will generate
Elasticsearch queries that do not have a primary `must` clause and end
up searching every document in the index. These queries are slow, take
up cluster resources, and are not useful.
By detecting unsubstantial inputs, we can prevent this.
By definition, all boundary.country query matches will either be
identical, or not a match. Thus, it does not make sense to put the query
clause for boundary.country in the `must` section of the query.
In theory, because our queries would generally combine this `must`
clause with others, there shouldn't be any performance improvement (or
regression) from this change.
However, semantically, this clause fits better as a `filter`, and in the
case of a bug causing a degenerate query with the `boundary.country`
query clause as the only one under the `must` section, this would have a
big impact.
In the case where a min lat/lon is larger than a max lat/lon, the error
message was a bit confusing as it did not show the actual property name
or the values that are causing errors.
This condition will cause Elasticsearch to throw an error, we should
catch it outselves first.
The error is more friendly than the case where min>max, but still an
error.
Connects https://github.com/pelias/api/pull/1050
If bounding box lat/lon values are outside the correct range,
Elasticsearch throws very alarming errors.
With a little validation code we can provide more friendly and
actionable error messages.
Fixes https://github.com/pelias/pelias/issues/750