for the similarity measure

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

for the similarity measure

jason-51
Hi,

After reading the code, I found the similarity measure in Lucene is not the
same as the cosine coefficient measure commonly used. I dont know it is
correct. And I wonder whether i can use the cosine coefficient measure in
lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap
coefficient measure.
Reply | Threaded
Open this post in threaded view
|

Re: for the similarity measure

Sebastian Marius Kirsch
On Fri, Apr 28, 2006 at 01:54:51PM +0800, jason wrote:
> After reading the code, I found the similarity measure in Lucene is not the
> same as the cosine coefficient measure commonly used. I dont know it is
> correct. And I wonder whether i can use the cosine coefficient measure in
> lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap
> coefficient measure.

Noone seems to have answered this yet, so I guess I'll have a go.

I wrote down the following a while ago; I'm omitting boosts and coords
here, since you don't have to use them. It assumes that you are using
DefaultSimilarity and not a custom similarity implementation. You will
have to pick through the LaTeX code; it's rather difficult to render
formulas in ASCII.


Lucene uses a
modified vector-space model; the main scoring formula is
\begin{equation}
\label{eq:lucenescore}
\score(\qu, \doc) = \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)} \cdot
  \idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}
  \sqrt{\vphantom{\sum_{\term\in\qu} \idf(\term)^2}\sum_{\term\in\doc} \tf(\term, \doc)}}
\end{equation}
where
\[ \idf(\term) = \log\frac{|\Doc|}{\docfreq(\term) + 1} + 1 \]
Scores are normalized to fall in a range of 0.0 to 1.0.

This weighting scheme is easily related to the standard vector-space
model by using \(\sqrt{\tf(\term, \doc)}\) instead of \(\tf(\term, \doc)\)
and defining \(\tf(\term,\qu)\equiv 1\). Then
\begin{align*}
  \score(\qu, \doc) &= \cos\angle(\vec{\qu}, \vec{\doc}) =
  \frac{\vec{\qu}\cdot\vec{\doc}}{\|\vec{\qu}\|\cdot \|\vec{\doc}\|}\\
  &= \frac{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \qu)}
      \idf(\term)\right)\left(\sqrt{\tf(\term, \doc)}
      \idf(\term)\right)}{ \sqrt{\sum_{\term\in\Term}
      \left(\sqrt{\tf(\term, \qu)} \idf(\term)\right)^2}
    \sqrt{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \doc)}
        \idf(\term)\right)^2}}\\
  &= \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)}
    \idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}
    \sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}}
\end{align*}
By omitting the term \(\idf(\term)^2\) from the term
\(\sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}\) in the
denominator, one arrives at the main scoring formula in
equation~(\ref{eq:lucenescore}).  Omitting the inverse document
frequency from the document normalization factor allows one to
precompute this factor and store it in the index; otherwise it would
be necessary to recompute the normalization factors every time a
document is added or deleted from the index.

--
Sebastian Kirsch <[hidden email]> [http://www.sebastian-kirsch.org/]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]