Expensive queries in Elasticsearch can wreak havoc on a cluster if not properly implemented. They are best to be avoided entirely whenever possible. In this post we’ll show you some ways to recognize and avoid expensive queries to keep your cluster running in tip-top shape.
Expensive queries in Elasticsearch refer to queries that require excessive CPU/memory utilization. They can cause a multitude of problems including high response times, timeouts and can even bring about the dreaded HTTP service errors. In this post we’ll show you some ways to recognize and avoid expensive queries to keep your cluster running in tip-top shape.
We discuss Elasticsearch, but the same applies to OpenSearch which is an Elasticsearch fork and shares the same core.
The official Elasticsearch docs categorizes expensive queries into several categories, and we can just follow those categories to explain each.
We often refer to this as doing a "full table scan" even though Elasticsearch doesn't maintain any tables. It's a way to say when running those query types, we are not leveraging any indices and just compute everything ad-hoc on query time, thus slowing down searches and requiring more and more resources.
It's easy to see why script queries could be considered expensive. Scripts are compiled and then cached for reuse. They have to be evaluated during each invocation of the query. Scripts do not use Elasticsearch’s index structures or related optimizations, thus they typically run slower than built-in query DSL. This does not mean all script queries will take several seconds to run. Short scripts that are just a few lines of codes can typically be evaluated in milliseconds. The problem lies with larger, more complex scripts. Do you remember the Big-O notation they taught in Computer Science classes? You want your scripts to be as close to linear (O(N)) as possible. The more complex your script is, the longer it will take to run and the slower your query will be.
You should try to avoid scripting whenever possible. For example, if your script is simply adding two values together from two other fields and then assessing whether one value is greater than the other - Consider doing this operation and comparison from within your indexing application or ingest pipeline and storing the result as a separate field. Then you can use a simple term query to evaluate whether this field is true or false. This approach has the drawback of taking additional time at indexing time and making your index size grow with the additional field, but it could be well worth the trade-off to speed up your query.
While it is possible to do, it is not recommended to query a field that is not indexed and has doc values enabled, unless you are going to be querying this field very infrequently. The only reason to do this in the first place is for storage considerations. Setting this field configuration is essentially a trade-off: less query performance for less disk usage. That being said, this field configuration could be a viable option for those that could use the disk savings and don’t plan on querying the field. And you can always reindex if you do end up needing to query that field more in the future, right?
Due to the way Elasticsearch and OpenSearch operate, some queries are executed in a different way than you'd think. In fact, most queries in Elasticsearch and OpenSearch will be reevaluated and rewritten using the primitive Term and Bool queries. That means many types of queries may hide complexities and slowness sometimes, that you should be aware of.
These queries use regular expressions to evaluate matches on text fields. These can be very expensive depending on the regex used, but in recent versions of Elasticsearch the ‘wildcard’ field type has been added to allow for some internal optimizations that speed these queries up significantly. Because of this, it is recommended that any field that will be queried regularly with a regular expression query have a wildcard field type.
It is absolutely discouraged to allow queries with leading wildcards, regardless of indexing method used. You should check the Reverse Token Filter as an alternative, instead.
Range queries are better used on date and numeric fields types. They are most effectively used in the filter context to narrow your result set down. There are internal optimizations that are done on these fields so range queries can perform faster. These same optimizations are not possible for text/keyword fields which makes the range queries perform poorly, since a large list of prospect terms is being built and a query is generated from it. If that list is large, it will mean a slow query.
For the very same reason of potentially large prospect terms list, prefix queries will perform poorly when used on fields with high cardinality and especially when the prefix used in query is short.
One way to mitigate this would be to use the ‘index_prefixes’ mapping parameter for text fields known to be used for prefixes. Setting this parameter enables internal optimizations that speed up the prefix queries significantly. Therefore it is recommended to use this mapping parameter for any field in which you will be running the prefix query on a regular basis. It is another optimization that trades storage space and indexing latency for query speed, so you may not want to set this parameter for fields that use the prefix query infrequently (or not at all).
Fuzziness in a query means that it has to account for multiple different variations of the query string. The different variations it allows for include changing, removing and/or inserting a character as well as transposing two adjacent characters. It is most often used to allow users to have errors/typos in a search query and still return relevant results. This query type is considered expensive because of all of the different query string variations that have to be accounted for. Each variation takes additional processing power and memory utilization.
The scriptscore query allows you to generate a score based on the logic defined in a script. It can be completely different from the built-in scoring mechanism of Elasticsearch or it can incorporate the built-in score and modify it. These queries are considered expensive for the same reasons mentioned above in the ‘Script Queries’ section above.
Joining queries are queries used on nested fields (nested query) or documents with a parent-child relationship (haschild & hasparent queries). They are considered expensive because there is additional processing required to preserve the relationships at query time. In general, documents (and the fields within them) in Elasticsearch are meant to be as decoupled and flat as possible. Whenever possible, you should try to follow this edict. This is not always possible, however. Sometimes using nested or parent-child fields is essential. If this is the case, try to use these relationships as sparingly as possible.
Naming some of these query types ‘expensive’ is sort of a misnomer. The queries themselves may be quite efficient when run in the right circumstances, but you can run into the same problems as the expensive queries above if you’re not careful.
There is a very good reason Elasticsearch has a default ‘size’ value for maximum number of results (10,000). Elasticsearch queries can span across multiple shards and consume heap memory and time that is proportional to the ‘from’ and ‘size’ parameters. While increasing the ‘index.max_result_window’ allows you to return more results than the default value allows, returning very large datasets in a single query response can wreak havoc on your cluster by consuming large portions of heap memory and processing power. Instead, it is recommended to paginate the search results with the ‘search_after’ API.
Similarly to the queries that return documents with large result sets, very large documents can have the same effect. For this case, there is a simple workaround, however. Within your query you can specify the fields you want returned in the response, allowing Elasticsearch to only process a small subset of the large document.
When querying large datasets (terabytes of data), you have to optimize both your query and your indexing structure in order for your queries to perform effectively. For example, if you have a time-series based dataset you can do an index-time sort based on the date/time field that your data uses. Then when querying this data, you can use a filter based on that date (e.g. get the last 3mo worth of data that matches ‘xyz’). Because the data is presorted, the Elasticsearch query can limit its search to only the shards that contain data within that time frame, eliminating the need to search across potentially hundreds of other shards.
When using a multi-match query to search across multiple fields, it can really slow your query down if there are dozens or hundreds of fields to search across. Each field that is query takes additional processing power and memory utilization. Ideally you want to limit your searches only to the fields that contain the most relevant data. For example, if your search query is for an ecommerce site and you are searching across a media URL field, it’s not likely you will get any relevant results from that field, so it would be best not to add it to your query.
Expensive queries can wreak havoc on a cluster if not properly implemented. They are best to be avoided entirely whenever possible. Whenever they are considered necessary, it is best to implement them in the most efficient manner possible in order to take advantage of the built-in caching mechanisms. Hopefully these suggestions can help you build much more efficient queries in your search application!
You may also want to check our Query Analytics solution in Pulse for Elasticsearch. It was designed to isolate the queries that are heavier to run, diagnose them and recommend alternative, faster methods of achieving the same. Sounds interesting? Reach out to us to see a demo.