Featured Items
News Bites
There's More Lucene in Solr than You Think!
There is an interesting blog post about Lucene & Solr consultancy and training services and how these two technologies are perceived by different companies and their technical teams. The blog post highlights how many Solr users do not realize the importance of understanding the concepts of Lucene and provides some interesting examples too.
Apache Solr and Lucene 3.6.0 released
The Lucene PMC is pleased to announce the release of Apache Solr and Lucene 3.6.0. As this may be the last release from the 3.x line of releases, it is highly recommended that users upgrade. The releases include the new Kuromoji morphological analysis framework for Japanese, improvements to suggester implementations and query-time joining, a new SolrJ client connector based on Apache HttpComponents, and a wide variety of bug fixes. Solr and Lucene 3.6.0 can be downloaded from here and here respectively.
Blogs
New index statistics in Lucene 4.0
Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.
To understand the new statistics, let's pretend we've indexed the following two example documents, each with only one field "title":
- document 1: The Lion, the Witch, and the Wardrobe
- document 2: The Da Vinci Code
Assume we tokenize on whitespace, commas are removed, all terms are downcased and we don't discard stop-words. Here are the statistics Lucene tracks:
-
TermsEnum.docFreq() - How many documents contain at least one occurrence of the term in the field; 3.x indices also save this (
TermEnum.docFreq()). For term "lion" docFreq is 1, and for term "the" it's 2. -
Terms.getSumDocFreq() - Number of postings, i.e. sum of
TermsEnum.docFreq()across all terms in the field. For our example documents this is 9. -
TermsEnum.totalTermFreq() - Number of occurrences of this term in the field, across all documents. For term "the" it's 4, for term "vinci" it's 1.
-
Terms.getSumTotalTermFreq() - Number of term occurrences in the field, across all documents; this is the sum of
TermsEnum.totalTermFreq()across all unique terms in the field. For our example documents this is 11. -
Terms.getDocCount() - How many documents have at least one term for this field. In our example documents, this is 2, but if for example one of the documents was missing the title field, it would be 1.
-
Terms.getUniqueTermCount() - How many unique terms were seen in this field. For our example documents this is 8. Note that this statistic is of limited utility for scoring, because it's only available per-segment and you cannot (efficiently!) compute this across all segments in the index (unless there is only one segment).
-
Fields.getUniqueTermCount() - Number of unique terms across all fields; this is the sum of
Terms.getUniqueTermCount()across all fields. In our example documents this is 8. Note that this is also only available per-segment. -
Fields.getUniqueFieldCount() - Number of unique fields. For our example documents this is 1; if we also had a body field and an abstract field, it would be 3. Note that this is also only available per-segment.
3.x indices only store TermsEnum.docFreq(), so if you want to experiment with the new scoring models in Lucene 4.0, you should either re-index or upgrade your index using IndexUpgrader. Note that the new scoring models all use the same single-byte norms format, so you can freely switch between them without re-indexing.
In addition to what's stored in the index, there are also these statistics available per-field, per-document while indexing, in the FieldInvertState passed to Similarity.computeNorm method for both 3.x and 4.0:
-
length - How many tokens in the document. For document 1 it's 7; for document 2 it's 4.
-
uniqueTermCount - For this field in this document, how many unique terms are there? For document 1, it's 5; for document 2 it's 4.
-
maxTermFrequency - What was the count for the most frequent term in this document. For document 1 it's 3 ("the" occurs 3 times); for document 2 it's 1.
In 3.x, if you want to consume these indexing-time statistics, you'll have to save them away yourself (e.g., somehow encoding them into the single-byte norm value). However, since 4.0 uses doc values for norms, you have more freedom to encode these statistics however you'd like. Your custom similarity can then pull from these.
From these available statistics you're now free to derive other commonly used statistics:
- Average document length is
Terms.getSumTotalTermFreq()divided byTerms.getDocCount().
- Average within-document term frequency is
FieldInvertState.lengthdivided byFieldInvertState.uniqueTermCount.
- Average document length across the collection is
Terms.getSumTotalTermFreq()divided bymaxDoc(orTerms.getDocCount(), if not all documents have the field).
- Average number of unique terms per document is
Terms.getSumDocFreq()divided bymaxDoc(orTerms.getDocCount(field), if not all documents have the field).
Remember that the statistics do not reflect deleted documents, until those documents are merged away; in general this also means that segment merging will alter scores! Similarly, if the field omits term frequencies, then the statistics will not be correct (though they will still be consistent with one another: we will pretend each term occurred once per document).
Click here to read more of my blog posts.
Berlin Buzzwords is back and will take place on 4th & 5th June 2012! It's a conference for developers and users of open source software projects, focusing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations courtesy of international speakers specific to the three tags: "search", "store" and "scale". It goes without saying of course that many of the contributors from our SearchWorkings.org community site will be present too.
Registration is open so get your tickets now! More info via berlinbuzzwords.de
Furthermore, based on last years success there will also be an opportunity to participate in search related training sessions (with a massive discount for Berlin Buzzwords attendees), organized by some of our contributors, for more information click here.
Lucene Revolution 2012 will take place in Boston on May 7-10. It's the largest conference for the Apache Lucene / Solr open source search community. A large contingency of the project committers will be there, as well as most of the 400+ fellow Lucene / Solr enthusiasts. The two-day agenda consists of 40 sessions, workshops, panels and keynotes dedicated to all things related to open source search. We're proud that many of our contributors have been invited as speakers, more information available here. Furthermore, gain deeper insights into Lucene, Solr and Big Data by attending a two-day training workshop, which will take place May 7-8. Register now and take advantage of special savings! Visit lucenerevolution.org for more information.
Lucene Today, Tomorrow & Beyond