Apache Solr is an open source Java based search server, which is built on top of Lucene. It comes in two flavors:
- A standalone enterprise application which can be installed as a services. It has a RESTful architecture and is platform and programming language agnostic. Updates (document indexing) and queries are executed using simple HTTP GET and POST messages.
- An embedded library that can be installed as an embedded service within another application. The communication with this service is via a direct API and does not require an intermediate communication protocol such as HTTP.
Although performance-wise the embedded solution is superior to the stand alone one, the latter is still regarded as the preferred approach in most cases, mainly due to its language/platform neutral nature and its enterprise readiness in terms of clustering and infrastructure topology.
Much of Solr's indexing and searching is implemented on top of the Lucene library and other complementing libraries. Some of the features supported by Solr out of the box include spell checking, result highlighting and a 'more like this' suggestions. One of the most powerful features Solr adds to Lucene is the notion of a schema. While with Lucene it is possible to index any document with any set of fields, Solr's schema enforces more structure to the indexed data which in turn enables it to make more assumptions on the data, thus optimize its work even further. Solr also includes multiple forms of caching such as document, query result and field caching.
To handle search requests, Solr provides RequestHandlers, which can be extended and configured to support almost any kind of request. RequestHandlers can either handle the request themselves, or they can pass it onto a series of SearchComponents. SearchComponents are smaller more re-usable parts of the search process and can be shared by multiple RequestHandlers. A commonly used SearchComponent is FacetSearchComponent, which implements faceted classification search. Having computed the result of a search, Solr provides ResponseWriters which allow the result to be provided in almost any format.
Solr is highly configurable through a series of XML documents such as solr-config.xml and schema.xml. In these documents it is possible to define the schema for the Documents, add and remove text analyzers, adjust caching mechanisms and control RequestHandlers. Any extensions to Solr are usually applied in these documents.
A must-have buy for a true Solr specialist and you can feel good about the purchase too knowing that 5% of each sale goes to support the Apache Software Foundation. We encourage you to buy directly from the publisher, then the basis of the percentage that goes to the ASF (and the author) is higher than if you buy it through other channels.
This book naturally covers the latest features in Solr as of version 3.4 like Result Grouping and Geospatial, but this is not a small update to the first book. No chapter has been left untouched: Faceting gets its own chapter, all search relevancy matters are discussed in one chapter, auto-complete approaches are all discussed together, much of the chapter on integration has been rewritten to discuss newer technologies, and the first chapter has been greatly streamlined. Furthermore, each chapter has a tip in the introduction that advises readers in a hurry on what parts should be read now or later. In summary, a new and improved version of the existing content, with about 25% more by page count.
The author David Smiley, and his co-author Eric Pugh have worked hard on this book and it's a great resource for our search community.
Different ways to make auto suggestions with Solr
Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That's a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.
There are different ways to make auto complete suggestions with Solr. You can find many articles and examples on the internet, but making the right choice is not always easy. The goal of this post is compare the available options in order to identify the best solution tailored to your needs, rather than describe any one specific approach in depth.
It's common practice to make auto-suggestions based on the indexed data. In fact a user is usually looking for something that can be found within the index, that's why we'd like to show the words that are similar to the current query and at the same time relevant within the index. On the other hand, it is recommended to provide query suggestions; we can for example capture and index on a specific solr core all the user queries which return more than zero results, so we can use those information to make auto-suggestions as well. What actually matters is that we are going to make suggestions based on what's inside the index; for this purpose it's not relevant if the index contains user queries or “normal data”, the solutions we are going to consider can be applied in both cases.
Some questions before starting
In order to make the right choice you should first of all ask yourself some questions:
- Which Solr version are you working with? If we're working with an old version (1.x for example) it is worth an upgrade. If you can't upgrade you'll probably have less options to choose from, unless you're willing to manually apply some patches.
- Do you want to make single term or multiple term suggestions? You should basically decide if you want to suggest single words which can complete the word the user has partially written, or even complete sentences.
- Do you want to filter the suggestions based on the actual search? The user could have previously selected a facet entry, filtering his results to a specific subset. Every search should match with that specific context, so it is common practice to have the auto-suggestions reflect the user filters. Unfortunately some of the solutions we have available don't support any filter.
- How do you want to sort the auto-suggestions? It's important to show on top the best suggestion, and each solution you are going to explore has a different sorting option.
- Do you want to make auto-suggestions based on multivalued fields? Multivalued fields are for example commonly used for tags, since every document can have more than one tag and do you want to suggest a tag while the user is typing it.
- Do you want to make auto-suggestions based on prefix queries or even infix queries? While it's always possible to suggest words starting with a prefix, not all the solutions are able to suggest words that contain the actual query.
- What's the impact of each solution in terms of performance and index size? The answer depends on the index you're working with and needs to take into account that some solutions can increase the index size, while all of them will affect performance.
Faceting using the prefix parameter
The first option we have is available in Solr 1.2 and based on a special facet that includes only the results starting with a prefix, which the user has partially typed, making use of the facet.prefix parameter. This solution works only for single term suggestions starting with a particular prefix (not infix) and you can sort results only alphabetically or by count. It works even with multi valued fields, and is possible to apply any filter queries to have the suggestions reflecting the current context of the search.
Use of NGrams as part of the analysis chain
The second solution is available from Solr 1.3 and relies on the use of NGramFilterFactory or EdgeNGramFilterFactory as part of the analysis chain. It means you'll have a specific field which makes possible to search on it through wildcard queries, typing word fragments. Every word in the index will be split into several NGrams; you can reduce the number of NGrams (and the size of the index) by increasing the minGramSize parameter or switching to the EdgeNGramFilterFactory which works in only one direction, by default from the beginning edge of an input token. With NGramFilterFactory you can use infix and prefix queries, while with EdgeNGramFilterFactory only prefix queries. This looks like a really flexible way to make auto-suggestions since it relies on a specific field with its configurable processors chain. You can easily filter your results and have them sorted based on relevance, also using boosting and the eDisMax query parser. Furthermore, this solution is faster than the previous one. On the other hand, if we want to make auto-suggestions based on a field which contains many terms, we should consider that the index size will considerably increase since we are indexing for each term a number of terms equals to term length – minGramSize (using EdgeNGrams). This option would work even with multi valued fields, but the index size would obviously increase even more.
Use of the TermsComponent
One more solution, available from Solr 1.4, is based on the use of the TermsComponent, which provides access to the indexed terms in a field and the number of documents that match each term. This option is even faster than the previous one, you can make prefix queries using the terms.prefix parameter or infix queries using the terms.regex parameter available starting from Solr 3.1. Only single term suggestions are possible, and unfortunately you can't apply any filter. Furthermore, user queries will not be analyzed in any way; you'll have access to raw indexed data, which means you could have problems with whitespaces or case-sensitive queries, since you'll be searching directly through the indexed terms.
Use of the Suggester
Due to the limitations of the above solutions, Solr developers have worked on a new component created exactly for this task. This option is the most recent and recommended one, available since Solr 3.1 and based on the SpellCheckComponent, the same you can use to make spelling correction. What’s new is the SolrSpellChecker implementation to make suggestions, called Suggester, which actually makes use of the lucene suggest module. All has started with the SOLR-1316 issue, based on which the Suggester was created. Then the collate functionality has been improved with the SOLR-2010 issue. After that, the task has been finalized with LUCENE-3135 by backporting to the 3.x branch the lucene suggest module, which is actually used from the Solr Suggester class. This solution has its own separate index which you can automatically build on every commit. Using collation you can have multi-term suggestions. Furthermore, it is possible to use a custom dictionary instead of the index content, which makes the current solution even more flexible.
The following table contains pros and cons for each solution I mentioned above, from the slowest to the fastest one. Even if the last option is the most flexible, it requires more tuning. Of course more power means also more responsibility, so if your requirements are just single term suggestions with filtering and you don't have particular performance problems, the facet old fashioned way works perfectly out of the box.
This blog entry has hopefully shown you some ways in which you can use auto-suggestions with Solr and the related pros and cons. I hope this will help you in making the right choices from the beginning tailored to your requirements. Please do share any additional considerations I may not have covered and your experiences. Also, we're intrigued to hear how you deal with the same problems in your search applications. Leave a comment or ask a question if you have any doubt too!