«Back

Simon says: Single Byte Norms are Dead!

 

Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene’s core scoring model is based on TF/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF/IDF factors Similarity also provides a norm value per document that is, by default a float value composed out of length normalization and field boost. Nothing special so far! However, this float value is encoded into a single byte and written down to the index. Lucene trades some precision lost for space on disk and eventually in memory since norms are loaded into memory per field upon first access. 

In lots of cases this precision lost is a fair trade-off, but once you find yourself in a situation where you need to store more information based on statistics collected during indexing you end up writing your own auxiliary data structure or “fork” Lucene for your app and mess with the source. 

The upcoming version of Lucene already added support for a lot more scoring models like:

The abstractions added to Lucene to implement those models already opens the door for applications that either want to roll their own “awesome” scoring model or modify the low level scorer implementations. Yet, norms are still one byte!

Use DocValues to add scoring factors

Lets look at some real world examples where additional scoring factors are required. Imagine you have a hierarchical system like a geo-search application and beside your ordinary document boost you also want to boost your documents based on their hierarchical entity.

In the previous Lucene version you could add the hierarchy information as a payload for each term and fetch it at runtime. Using payloads however have multiple downsides; besides the redundant data you are storing you also sometimes have massive runtime costs. In Lucene 4 you can use DocValues to store your boosts / categories per document in a nice & efficient datastructure; one of my previous posts shows how they can be used for scoring. Nice, problem solved! But wait, norms are still one byte?

Exposing the power of DocValues via Similarity

The limitation of not being able to emit more than 8bits worth of information during indexing becomes important once you are looking into machine-learing problems where boosts need to be fine grained or alternative scoring models like the lnu Ÿweighting scheme (Singhal et. al. in New Retrieval Approaches in SMART: TREC 4). The lnu scheme needs to store the number of unique terms in a document as well as the total number of terms in the document. Those values won’t fit in a single byte nor can you store them in a simple DocValues field since this information is not available until the document is indexed. 

To overcome this limitation, Lucene 4 replaces it’s old single byte norms implementation in favor of DocValues which are already available for other per-document values. This already happened two weeks ago but simply changing the implementation didn’t help much. Lucene needed a way to allow the user to encode the information from FieldInvertState into the index. We decided to keep the notion of a Norm but allow the user to write out arbitrary fixed size values directly from the Similarity.  We changed the Similarity#computeNorm method accordingly:

 @Override
public void computeNorm(FieldInvertState state, Norm norm) {
    norm.setInt(state.getLength() << 16 | state.getUniqueTermCount());
}
Figure 1: Encoding field length and num unique terms as a norm value in Lucene 4.0

This allows you to encode norms in Lucene 4 with the precision and the information you need for your application especially if the defaults don’t suit your needs. Good luck and let us know how you get on!

Comments
Trackback URL:

Sign in to comment


Hi Simon,

This sounds like a really useful feature. If I understand it correctly, will I be able to store values like the real doc length instead of an encoded/decoded one? If that's the case is there an example that I can look at that shows how this can be done? I apologise in advance if the question isn't relevant. emoticon

h.

Posted on 1/21/12 1:08 PM.

Top
Hey Hany,

if you look at the new Similarity#computeNorm method you can simply use the Norm to set whatever value you want. To use the play length of your document you can get it directly from FieldInvertState#getLength().

in my example above I use 16 bits for the length and 16 bits for the uniqueTermCount but you can also use a long and therefor 32 bits for each of them.

hope that helps

Posted on 1/22/12 9:04 PM.

Top
Thanks Simon for the reply. That's really helpful. For the above computeNorm method then, what would be the decodeNormValue?

Posted on 1/23/12 12:13 AM.

Top
The reason I am asking is that if I want the new similarity functions such as BM25 to use my encoded dl then I will have to override the encodeNormValue and decodeNormValue methods. Correct?

Posted on 1/23/12 3:48 PM in reply to Hany Azzam.

Top
Hey Hany,

I understand your question. The encode/decodeNormValue only apply if you use the DefaultSimilarity. The Lucene 4 similarity defines ExactDocScoere & SloppyDocScorer, in those classes you can just load the norm values via the passed in DocValue norms. Look at the ExactTFIDFDocScorer in lucene trunk, this uses the decodeNorm during scoring but if you write different values that don't need to be decoded you don't need this method at all.
In lucene 3 this is not available unfortunately.

Posted on 1/23/12 7:16 PM.

Top
Hi Simon,

Now it all works! I am running an extended version of the BM25Similarity function in lucene 4.0 without encoded/decoded doc length. That's great emoticon Thanks a lot.

h.

Posted on 1/23/12 7:59 PM in reply to Simon Willnauer.

Top
Hany, thats great! good to see that this work is of real use!

simon

Posted on 1/23/12 8:38 PM in reply to Hany Azzam.

Top