There's More Lucene in Solr than You Think!
There is an interesting blog post about Lucene & Solr consultancy and training services and how these two technologies are perceived by different companies and their technical teams. The blog post highlights how many Solr users do not realize the importance of understanding the concepts of Lucene and provides some interesting examples too.
Apache Solr and Lucene 3.6.0 released
The Lucene PMC is pleased to announce the release of Apache Solr and Lucene 3.6.0. As this may be the last release from the 3.x line of releases, it is highly recommended that users upgrade. The releases include the new Kuromoji morphological analysis framework for Japanese, improvements to suggester implementations and query-time joining, a new SolrJ client connector based on Apache HttpComponents, and a wide variety of bug fixes. Solr and Lucene 3.6.0 can be downloaded from here and here respectively.
Different ways to make auto suggestions with Solr
Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That's a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.
There are different ways to make auto complete suggestions with Solr. You can find many articles and examples on the internet, but making the right choice is not always easy. The goal of this post is compare the available options in order to identify the best solution tailored to your needs, rather than describe any one specific approach in depth.
It's common practice to make auto-suggestions based on the indexed data. In fact a user is usually looking for something that can be found within the index, that's why we'd like to show the words that are similar to the current query and at the same time relevant within the index. On the other hand, it is recommended to provide query suggestions; we can for example capture and index on a specific solr core all the user queries which return more than zero results, so we can use those information to make auto-suggestions as well. What actually matters is that we are going to make suggestions based on what's inside the index; for this purpose it's not relevant if the index contains user queries or “normal data”, the solutions we are going to consider can be applied in both cases.
Some questions before starting
In order to make the right choice you should first of all ask yourself some questions:
- Which Solr version are you working with? If we're working with an old version (1.x for example) it is worth an upgrade. If you can't upgrade you'll probably have less options to choose from, unless you're willing to manually apply some patches.
- Do you want to make single term or multiple term suggestions? You should basically decide if you want to suggest single words which can complete the word the user has partially written, or even complete sentences.
- Do you want to filter the suggestions based on the actual search? The user could have previously selected a facet entry, filtering his results to a specific subset. Every search should match with that specific context, so it is common practice to have the auto-suggestions reflect the user filters. Unfortunately some of the solutions we have available don't support any filter.
- How do you want to sort the auto-suggestions? It's important to show on top the best suggestion, and each solution you are going to explore has a different sorting option.
- Do you want to make auto-suggestions based on multivalued fields? Multivalued fields are for example commonly used for tags, since every document can have more than one tag and do you want to suggest a tag while the user is typing it.
- Do you want to make auto-suggestions based on prefix queries or even infix queries? While it's always possible to suggest words starting with a prefix, not all the solutions are able to suggest words that contain the actual query.
- What's the impact of each solution in terms of performance and index size? The answer depends on the index you're working with and needs to take into account that some solutions can increase the index size, while all of them will affect performance.
Faceting using the prefix parameter
The first option we have is available in Solr 1.2 and based on a special facet that includes only the results starting with a prefix, which the user has partially typed, making use of the facet.prefix parameter. This solution works only for single term suggestions starting with a particular prefix (not infix) and you can sort results only alphabetically or by count. It works even with multi valued fields, and is possible to apply any filter queries to have the suggestions reflecting the current context of the search.
Use of NGrams as part of the analysis chain
The second solution is available from Solr 1.3 and relies on the use of NGramFilterFactory or EdgeNGramFilterFactory as part of the analysis chain. It means you'll have a specific field which makes possible to search on it through wildcard queries, typing word fragments. Every word in the index will be split into several NGrams; you can reduce the number of NGrams (and the size of the index) by increasing the minGramSize parameter or switching to the EdgeNGramFilterFactory which works in only one direction, by default from the beginning edge of an input token. With NGramFilterFactory you can use infix and prefix queries, while with EdgeNGramFilterFactory only prefix queries. This looks like a really flexible way to make auto-suggestions since it relies on a specific field with its configurable processors chain. You can easily filter your results and have them sorted based on relevance, also using boosting and the eDisMax query parser. Furthermore, this solution is faster than the previous one. On the other hand, if we want to make auto-suggestions based on a field which contains many terms, we should consider that the index size will considerably increase since we are indexing for each term a number of terms equals to term length – minGramSize (using EdgeNGrams). This option would work even with multi valued fields, but the index size would obviously increase even more.
Use of the TermsComponent
One more solution, available from Solr 1.4, is based on the use of the TermsComponent, which provides access to the indexed terms in a field and the number of documents that match each term. This option is even faster than the previous one, you can make prefix queries using the terms.prefix parameter or infix queries using the terms.regex parameter available starting from Solr 3.1. Only single term suggestions are possible, and unfortunately you can't apply any filter. Furthermore, user queries will not be analyzed in any way; you'll have access to raw indexed data, which means you could have problems with whitespaces or case-sensitive queries, since you'll be searching directly through the indexed terms.
Use of the Suggester
Due to the limitations of the above solutions, Solr developers have worked on a new component created exactly for this task. This option is the most recent and recommended one, available since Solr 3.1 and based on the SpellCheckComponent, the same you can use to make spelling correction. What’s new is the SolrSpellChecker implementation to make suggestions, called Suggester, which actually makes use of the lucene suggest module. All has started with the SOLR-1316 issue, based on which the Suggester was created. Then the collate functionality has been improved with the SOLR-2010 issue. After that, the task has been finalized with LUCENE-3135 by backporting to the 3.x branch the lucene suggest module, which is actually used from the Solr Suggester class. This solution has its own separate index which you can automatically build on every commit. Using collation you can have multi-term suggestions. Furthermore, it is possible to use a custom dictionary instead of the index content, which makes the current solution even more flexible.
The following table contains pros and cons for each solution I mentioned above, from the slowest to the fastest one. Even if the last option is the most flexible, it requires more tuning. Of course more power means also more responsibility, so if your requirements are just single term suggestions with filtering and you don't have particular performance problems, the facet old fashioned way works perfectly out of the box.
This blog entry has hopefully shown you some ways in which you can use auto-suggestions with Solr and the related pros and cons. I hope this will help you in making the right choices from the beginning tailored to your requirements. Please do share any additional considerations I may not have covered and your experiences. Also, we're intrigued to hear how you deal with the same problems in your search applications. Leave a comment or ask a question if you have any doubt too!
Berlin Buzzwords is back and will take place on 4th & 5th June 2012! It's a conference for developers and users of open source software projects, focusing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations courtesy of international speakers specific to the three tags: "search", "store" and "scale". It goes without saying of course that many of the contributors from our SearchWorkings.org community site will be present too.
Furthermore, based on last years success there will also be an opportunity to participate in search related training sessions (with a massive discount for Berlin Buzzwords attendees), organized by some of our contributors, for more information click here.
Lucene Revolution 2012 will take place in Boston on May 7-10. It's the largest conference for the Apache Lucene / Solr open source search community. A large contingency of the project committers will be there, as well as most of the 400+ fellow Lucene / Solr enthusiasts. The two-day agenda consists of 40 sessions, workshops, panels and keynotes dedicated to all things related to open source search. We're proud that many of our contributors have been invited as speakers, more information available here. Furthermore, gain deeper insights into Lucene, Solr and Big Data by attending a two-day training workshop, which will take place May 7-8. Register now and take advantage of special savings! Visit lucenerevolution.org for more information.