[
https://issues.apache.org/jira/browse/LUCENE1896?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12753377#action_12753377 ]
Mark Miller edited comment on LUCENE1896 at 9/9/09 7:09 PM:

Okay  think I was a tad off base 
Here is the cosine def used:
{code}
cos(a) = V(q) dot V(d) / V(q)V(d)
{code}
So the cosine is the query vector dot the document vector divided by the magnitude of the vectors. Classically, V(q)V(d) is a normalization factor that takes the vectors to unit vectors (so you get the real cosine)
{code}
cos(a) = v(q) dot v(d)
{code}
This is because the magnitude of a unit vector is 1 be definition.
But we don't care about absolute numbers, just relative numbers (as has been often pointed out)  so the IR guys already fudge this stuff.
While I thought that the queryNorm correlates to V(q)V(d) before, I was off  its just V(q). V(d) is replaced with the document length normalization, a much faster calculation with similar properties  a longer doc would have a larger magnitude most likely. *edit* not just similar properties  but many times better properties  the standard normalization would not factor in document length at all  it essentially removes it.
So one strategy is just to not normalize query  though the lit i see doing this is very inefficiently calculating the query norm in the inner loop  we are not doing that, and so its not much of an optimization for us.
{code}
cos(a) = V(q) dot V(d) / V(d) == cos(a) * V(q) = v(q) dot v(d)
{code}
And it does make queries less comparable (an odd goal I know, but for free?) ;)
Sorry I was a little off earlier  just tried to learn all this myself  and linear alg was years ago  and open book tests lured my younger, more irresponsible self to not go to the classes ...
Anyhow, thats my current understanding  please point out if you know I have something wrong.
was (Author:
[hidden email]):
Okay  think I was a tad off base 
Here is the cosine def used:
{code}
cos(a) = V(q) dot V(d) / V(q)V(d)
{code}
So the cosine is the query vector dot the document vector divided by the magnitude of the vectors. Classically, V(q)V(d) is a normalization factor that takes the vectors to unit vectors (so you get the real cosine)
{code}
cos(a) = v(q) dot v(d)
{code}
This is because the magnitude of a unit vector is 1 be definition.
But we don't care about absolute numbers, just relative numbers (as has been often pointed out)  so the IR guys already fudge this stuff.
While I thought that the queryNorm correlates to V(q)V(d) before, I was off  its just V(q). V(d) is replaced with the document length normalization, a much faster calculation with similar properties  a longer doc would have a larger magnitude most likely.
So one strategy is just to not normalize query  though the lit i see doing this is very inefficiently calculating the query norm in the inner loop  we are not doing that, and so its not much of an optimization for us.
{code}
cos(a) = V(q) dot V(d) / V(d) == cos(a) * V(q) = v(q) dot v(d)
{code}
And it does make queries less comparable (an odd goal I know, but for free?) ;)
Sorry I was a little off earlier  just tried to learn all this myself  and linear alg was years ago  and open book tests lured my younger, more irresponsible self to not go to the classes ...
Anyhow, thats my current understanding  please point out if you know I have something wrong.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.

To unsubscribe, email:
[hidden email]
For additional commands, email:
[hidden email]