On Fri, Apr 28, 2006 at 01:54:51PM +0800, jason wrote:

> After reading the code, I found the similarity measure in Lucene is not the

> same as the cosine coefficient measure commonly used. I dont know it is

> correct. And I wonder whether i can use the cosine coefficient measure in

> lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap

> coefficient measure.

Noone seems to have answered this yet, so I guess I'll have a go.

I wrote down the following a while ago; I'm omitting boosts and coords

here, since you don't have to use them. It assumes that you are using

DefaultSimilarity and not a custom similarity implementation. You will

have to pick through the LaTeX code; it's rather difficult to render

formulas in ASCII.

Lucene uses a

modified vector-space model; the main scoring formula is

\begin{equation}

\label{eq:lucenescore}

\score(\qu, \doc) = \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)} \cdot

\idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}

\sqrt{\vphantom{\sum_{\term\in\qu} \idf(\term)^2}\sum_{\term\in\doc} \tf(\term, \doc)}}

\end{equation}

where

\[ \idf(\term) = \log\frac{|\Doc|}{\docfreq(\term) + 1} + 1 \]

Scores are normalized to fall in a range of 0.0 to 1.0.

This weighting scheme is easily related to the standard vector-space

model by using \(\sqrt{\tf(\term, \doc)}\) instead of \(\tf(\term, \doc)\)

and defining \(\tf(\term,\qu)\equiv 1\). Then

\begin{align*}

\score(\qu, \doc) &= \cos\angle(\vec{\qu}, \vec{\doc}) =

\frac{\vec{\qu}\cdot\vec{\doc}}{\|\vec{\qu}\|\cdot \|\vec{\doc}\|}\\

&= \frac{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \qu)}

\idf(\term)\right)\left(\sqrt{\tf(\term, \doc)}

\idf(\term)\right)}{ \sqrt{\sum_{\term\in\Term}

\left(\sqrt{\tf(\term, \qu)} \idf(\term)\right)^2}

\sqrt{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \doc)}

\idf(\term)\right)^2}}\\

&= \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)}

\idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}

\sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}}

\end{align*}

By omitting the term \(\idf(\term)^2\) from the term

\(\sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}\) in the

denominator, one arrives at the main scoring formula in

equation~(\ref{eq:lucenescore}). Omitting the inverse document

frequency from the document normalization factor allows one to

precompute this factor and store it in the index; otherwise it would

be necessary to recompute the normalization factors every time a

document is added or deleted from the index.

--

Sebastian Kirsch <

[hidden email]> [

http://www.sebastian-kirsch.org/]

---------------------------------------------------------------------

To unsubscribe, e-mail:

[hidden email]
For additional commands, e-mail:

[hidden email]