The costs associated with Elasticsearch's n-gram tokenizer are not documented enough, and it's being widely used with severe consequences to cluster cost and performance. In this post we will go through the use-cases where it's useful, and suggest alternative, more efficient approaches.
Elasticsearch and OpenSearch are the de-facto standard engines for text search, and both are often used as search engines over text documents corpus - for internal usage or serving external users.
Most searches are done using the default, built-in analyzers and tokenizers that break the text into tokens. Those are usually just words. For example, using the default analyzer
The quick brown fox jumps over the lazy dog becomes a list of words
[ the, quick, brown, fox, jumps, over, the, lazy, dog ], each of those is then searchable. This is what we often call “full-text search”, that is - finding a document by a single word or a phrase that exists in it.
For some use cases more specialized analyzers are used which do things a little differently, not necessarily emitting words in a human language, but some other output that is useful in other ways.
One of those specialized components in Elasticsearch is the n-gram tokenizer. It allows you to tokenize the text differently and by that enables some different and additional use-cases. Let’s explore what it does, how it can be used, and what common misuse we often see.
The n-gram tokenizer creates "chunks" of characters, emitting tokens that are not necessarily words like the standard analyzer, but are consecutive characters from those words instead. For the string "Quick Fox" the output would be (using a configuration of min length 1 and max length 2 for grams):
[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]
You can see, rather than just emitting the two words [ quick, fox ] like the standard analyzer would, the n-gram tokenizer emits “grams” of characters from the words, and those are the tokens that end up in the index.
In this instance, you would go from two tokens using the standard analyzer to 17 using the n-gram tokenizer. Over 8x the number of tokens were emitted, and this is with a minimal configuration for the min and max grams and only a two word (or 9 character) search string! You can just imagine how the number of tokens can explode with larger search strings and/or a larger gap between min and max grams (more on this later). You can read more about the ngram tokenizer here.
Another variant is the Edge n-gram tokenizer which emits less tokens thus is suitable for fewer of the use-cases detailed below, but is still rather expensive to use.
There is a lot of advice on the net about cases where you supposedly should use n-grams, and the advice is usually surrounding very specific use cases.
- Spelling mistakes - Typos are common in search. When someone is typing "dainel" and actually searching for "daniel", the default analyzers will fail to make a match. Using n-grams it is possible to overcome this, as there will be several token matches (even if not all of them) for that search such as "da" and "el". Hence the advice to use the ngram analyzer to still support matching in those cases.
- Search-as-you-type - Performing a search query for every character a user types in. This is an attempt to provide valid results before the user even finishes typing the query and making an educated guess as to what a user might want to search for (usually based on some text the user has already entered). For example, suggesting the user search for the term ‘socket wrench’ if he/she types in ‘socke’.
- Prefix searches - Searching for text at the beginning of a term (e.g. "qu" should match "quick"). Since the prefix will be indexed as a token of its own, this will “just work” without requiring the use of a prefix query, which is categorized as an “expensive query”.
- Suffix searches - The same as prefix searches, but often required to match by just the last characters. This is very often a requirement with things like license plates and phone numbers search (e.g. search the last 4 digits of a phone number)
- Infix searches - Similarly, a requirement to find a set of characters in the middle of the word. We usually see this in phone- or license-plate searches; or in products that have previously been using a SQL engine and the LIKE operator and were just converted to use Elasticsearch.
Though n-grams can be used for each of these use cases, there are often better, more efficient means to get the desired results. Moreover, if n-grams are to be used, there are often better, more efficient ways to configure them. Before we look at these alternatives, let’s review some of the more common mistakes that are made when using n-grams.
The root issue with ngrams, is the explosion of work they bring with them. Like we saw, a simple 2 word phrase is indexed as 18 tokens in the most simple case. This means much more CPU is required for indexing or search, whenever ngrams are generated from the text; and also a whole lot of storage. Indexes with ngram fields are significantly larger than what they could have been.
N-gram search also pretty much destroys relevance ranking - tf/idf and BM25 ranking is in effect but pretty much useless for all intents and purposes.
Those costs become a lot worse when n-grams are misused or misconfigured:
- Using a large token min-max gap - This was alluded to in the example given in the overview. It is a very common problem where we see a min gram of 1 and max gram of 10 and even sometimes much higher than that. This eventually results in an exponential explosion of tokens in the index which leads to significantly bloated disk usage and very slow indexing and search operations. Modern Elasticsearch versions are now forbidding this, but it’s still very common to see in older clusters.
- Using the n-gram tokenizer at both index and query time - This means you are not only increasing your indexing pressure by emitting and storing all these tokens to the index, but you are also instructing Elasticsearch to tokenize the incoming text that a user types into the search box into all of these tokens as well and bloat the query. Analyzing the input text with the n-gram tokenizer would emit dozens of tokens that then need to be searched which would lead to very poor query performance, and except for the spelling error correction use-case, it’s just pure waste of resources.
- Tokenizing unneeded character classes - The n-gram tokenizer lets you configure what character classes you wish to tokenize including letters, numbers, punctuation, etc. Oftentimes it can be configured to include all or most of the character classes when some of these characters provide no real value to the search experience. This just ends up leading to more tokens being stored and searched over when there will never really be any matches on these tokens.
Now that we’ve reviewed some of the common mistakes that are made when using the n-gram tokenizer and the costs associated with using it, let’s review some of the alternatives.
To recommend some alternatives, let’s look back at the list of use cases for n-grams:
- Spelling mistakes and Query suggestions - Spelling mistakes can be better handled by correctly using the term and phrase suggesters for matching on spelling mistakes or mistypes. A great “did you mean?” functionality can be achieved by using either, or sometimes both, of those Suggester APIs in tandem with great search; thus eliminating the use of ngrams almost completely (because Phrase Suggester will use ngrams internally).
- Search-as-you-type - This use case is actually where n-grams shine, but it has to be implemented correctly in order to have the most meaningful impact. Utilizing the search-as-you-type field type is typically the simplest, most elegant solution. It even includes an optimization for prefix queries that transforms them into term queries for increased performance. We also recommend using the completion suggester and context suggester as alternatives, more disk and CPU efficient approaches.
- Prefix searches - Elasticsearch specifically has its own built-in prefix query that is optimized for this type of search. If using prefix queries on a field, you can set the index_prefixes mapping parameter to better improve query efficiency. Again, take care not to set the
max_charsparameter to a large value. The default of five will suffice for most use cases, but this is not a requirement and prefix queries can run great on any field as long as you make sure the prefix is at least 2-3 characters long.
- Suffix searches - The reverse token filter can efficiently be used for this use case. Here’s how it works: essentially the text is stored in reverse order, so in the case of a phone number such as 855-123-4567 it would be stored as 7654321558 (omit the dashes in analysis). Then when searching for the last 4 digits of the phone number you would use a match_phrase_prefix query to match the reversed output. Searching for 4567 would reverse the input to 7654 and the match_phrase would match the reversed text.
- Infix searches - In our experience, this is often a misconception and a “bug” in product requirements. Oftentimes the performance and cost trade-off are not well-understood; or a requirement for a product coming from a SQL engine and trying to replace the LIKE operator there. Instead, the word delimiter graph token filter can be used for use cases requiring infix searches. Instead of using n-grams which are a brute-force approach, use this token filter which surgically extracts meaningful tokens to search on (while also optionally keeping existing tokens), thus enabling infix searches for things that matter with exponentially less cost.
As with all technologies, some features in Elasticsearch and OpenSearch need to be better understood to avoid mis-use. Unfortunately, the costs associated with n-grams are not documented enough, and there are too many blog posts out there just suggesting this feature without worrying about the consequences. Our experts were able to cut cluster costs by half and sometimes more by simply avoiding ngrams. Feel free to reach out to consult us about your clusters, and let’s see how we can boost performance and reduce costs for your cluster too!