I've had an old implementation Lucene-lm by ilps, which is a good start. However, that implementation doesn't include smooth algorithm. And I found it particularly hard to re-write the core scoring mechanism to enable smooth.
(Background: In language model, smoothing strategy adds a little constant weight to documents with zero query frequency. Of course it doesn't change anything for one keyword, but consider the case of multiple-keyword query, when one document is strongly relevant to a few distinguishing keywords, smoothing may be important)
In the lucene framework for a multiple-keyword query (say, the simplest unigram, non-positional query), the following procedure happens, as my understanding:
1)QueryParser parse query string to BooleanQuery.clauses (weights)
2)(The corresponding scorer of BooleanQuery ) merges all document scores for each clause
3) but the problem is: each clause's termdocs only contains inversed index of clause, thus make smoothing strategy impossible, because the document won't be scored by each query term.
What can I do about that? What class should I concentrate on?
On Thu, Feb 2, 2012 at 3:40 AM, cheyenne.lin <[hidden email]> wrote:
> What can I do about that? What class should I concentrate on?
Maybe as an example you can take a look at lucene's trunk, it has two
of the methods from the Zhai/Lafferty paper:
"A study of smoothing methods for language models applied to Ad Hoc