# for the similarity measure

2 messages
 On Fri, Apr 28, 2006 at 01:54:51PM +0800, jason wrote: > After reading the code, I found the similarity measure in Lucene is not the > same as the cosine coefficient measure commonly used. I dont know it is > correct. And I wonder whether i can use the cosine coefficient measure in > lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap > coefficient measure. Noone seems to have answered this yet, so I guess I'll have a go. I wrote down the following a while ago; I'm omitting boosts and coords here, since you don't have to use them. It assumes that you are using DefaultSimilarity and not a custom similarity implementation. You will have to pick through the LaTeX code; it's rather difficult to render formulas in ASCII. Lucene uses a modified vector-space model; the main scoring formula is \label{eq:lucenescore} \score(\qu, \doc) = \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)} \cdot   \idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}   \sqrt{\vphantom{\sum_{\term\in\qu} \idf(\term)^2}\sum_{\term\in\doc} \tf(\term, \doc)}} where $\idf(\term) = \log\frac{|\Doc|}{\docfreq(\term) + 1} + 1$ Scores are normalized to fall in a range of 0.0 to 1.0. This weighting scheme is easily related to the standard vector-space model by using $$\sqrt{\tf(\term, \doc)}$$ instead of $$\tf(\term, \doc)$$ and defining $$\tf(\term,\qu)\equiv 1$$. Then \begin{align*}   \score(\qu, \doc) &= \cos\angle(\vec{\qu}, \vec{\doc}) =   \frac{\vec{\qu}\cdot\vec{\doc}}{\|\vec{\qu}\|\cdot \|\vec{\doc}\|}\\   &= \frac{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \qu)}       \idf(\term)\right)\left(\sqrt{\tf(\term, \doc)}       \idf(\term)\right)}{ \sqrt{\sum_{\term\in\Term}       \left(\sqrt{\tf(\term, \qu)} \idf(\term)\right)^2}     \sqrt{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \doc)}         \idf(\term)\right)^2}}\\   &= \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)}     \idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}     \sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}} \end{align*} By omitting the term $$\idf(\term)^2$$ from the term $$\sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}$$ in the denominator, one arrives at the main scoring formula in equation~(\ref{eq:lucenescore}).  Omitting the inverse document frequency from the document normalization factor allows one to precompute this factor and store it in the index; otherwise it would be necessary to recompute the normalization factors every time a document is added or deleted from the index. -- Sebastian Kirsch <[hidden email]> [http://www.sebastian-kirsch.org/] --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email]