Smoothing language model by Lucene

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Smoothing language model by Lucene

cheyenne.lin
I've had an old implementation Lucene-lm by ilps, which is a good start. However, that implementation doesn't include smooth algorithm. And I found it particularly hard to re-write the core scoring mechanism to enable smooth.

(Background: In language model, smoothing strategy adds a little constant weight to documents with zero query frequency. Of course it doesn't change anything for one keyword, but consider the case of multiple-keyword query, when one document is strongly relevant to a few distinguishing keywords, smoothing may be important)

In the lucene framework for a multiple-keyword query (say, the simplest unigram, non-positional query), the following procedure happens, as my understanding:

1)QueryParser parse query string to BooleanQuery.clauses (weights)
2)(The corresponding scorer of BooleanQuery ) merges all document scores for each clause
3) but the problem is: each clause's termdocs only contains inversed index of clause, thus make smoothing strategy impossible, because the document won't be scored by each query term.

What can I do about that? What class should I concentrate on?
Reply | Threaded
Open this post in threaded view
|

Re: Smoothing language model by Lucene

Robert Muir
On Thu, Feb 2, 2012 at 3:40 AM, cheyenne.lin <[hidden email]> wrote:
>
> What can I do about that? What class should I concentrate on?
>

Maybe as an example you can take a look at lucene's trunk, it has two
of the methods from the Zhai/Lafferty paper:
"A study of smoothing methods for language models applied to Ad Hoc
information retrieval."

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/similarities/LMDirichletSimilarity.java
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.java

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]