
12

For a long time now, we've been telling people not to compare scores across queries, yet we maintain the queryNorm() code as an attempt to do this and the javadocs even promote it. I'm in the process of researching this some more (references welcomed), but wanted to hear what people think about it here. I haven't profiled it just yet, but it seems like a good chunk of wasted computation to me (loops, divisions and square roots). At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs). That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation. I haven't tested it yet, but wanted to find out what others think.
Thoughts?
Grant

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Grant Ingersoll wrote:
> For a long time now, we've been telling people not to compare scores across queries, yet we maintain the queryNorm() code as an attempt to do this and the javadocs even promote it. I'm in the process of researching this some more (references welcomed), but wanted to hear what people think about it here. I haven't profiled it just yet, but it seems like a good chunk of wasted computation to me (loops, divisions and square roots). At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs). That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation. I haven't tested it yet, but wanted to find out what others think.
>
> Thoughts?
>
> Grant
> 
> To unsubscribe, email: [hidden email]
> For additional commands, email: [hidden email]
>
>
Here is old discussion http://issues.apache.org/jira/browse/LUCENE1896.
Its essentially no cost and has minor benefits  I'm still +1 for
keeping it.

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


The fact Lucene Similarity is most decidely *not* cosine similarity, but strongly resembles it with the queryNorm() in there, means that I personally would certainly like to see this called out, at least in the documentation.
As for performance, is the queryNorm() called ever in any loops? It's all set up in the construction of the Weight, right? Which means that by the time you're doing scoring, all the weighting factors are already factored into one? What's the performance issue which would be saved here?
jake On Fri, Nov 20, 2009 at 7:56 AM, Grant Ingersoll <[hidden email]> wrote:
For a long time now, we've been telling people not to compare scores across queries, yet we maintain the queryNorm() code as an attempt to do this and the javadocs even promote it. I'm in the process of researching this some more (references welcomed), but wanted to hear what people think about it here. I haven't profiled it just yet, but it seems like a good chunk of wasted computation to me (loops, divisions and square roots). At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs). That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation. I haven't tested it yet, but wanted to find out what others think.
Thoughts?
Grant

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Mark Miller wrote:
> Grant Ingersoll wrote:
>
>> At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs). That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation.
By the way  this issue does just that
http://issues.apache.org/jira/browse/LUCENE1907. I don't think its
really any measurable perf savings, but it certainly makes more sense.

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


I should add in my $0.02 on whether to just get rid of queryNorm() altogether: 1 from me, even though it's confusing, because having that call there (somewhere, at least) allows you to actually do compare scores across queries if you do the extra work of properly normalizing the documents as well (at index time). And for people who actually do machinelearning training of their perfield query boosts, this is pretty critical.
jake On Fri, Nov 20, 2009 at 8:15 AM, Jake Mannix <[hidden email]> wrote:
The fact Lucene Similarity is most decidely *not* cosine similarity, but strongly resembles it with the queryNorm() in there, means that I personally would certainly like to see this called out, at least in the documentation.
As for performance, is the queryNorm() called ever in any loops? It's all set up in the construction of the Weight, right? Which means that by the time you're doing scoring, all the weighting factors are already factored into one? What's the performance issue which would be saved here?
jakeOn Fri, Nov 20, 2009 at 7:56 AM, Grant Ingersoll <[hidden email]> wrote:
For a long time now, we've been telling people not to compare scores across queries, yet we maintain the queryNorm() code as an attempt to do this and the javadocs even promote it. I'm in the process of researching this some more (references welcomed), but wanted to hear what people think about it here. I haven't profiled it just yet, but it seems like a good chunk of wasted computation to me (loops, divisions and square roots). At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs). That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation. I haven't tested it yet, but wanted to find out what others think.
Thoughts?
Grant

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


On Nov 20, 2009, at 11:19 AM, Jake Mannix wrote: I should add in my $0.02 on whether to just get rid of queryNorm() altogether:
1 from me, even though it's confusing, because having that call there (somewhere, at least) allows you to actually do compare scores across queries if you do the extra work of properly normalizing the documents as well (at index time).
Do you have some references on this? I'm interested in reading more on the subject. I've never quite been sold on how it is meaningful to compare scores and would like to read more opinions. And for people who actually do machinelearning training of their perfield query boosts, this is pretty critical.
jake
On Fri, Nov 20, 2009 at 8:15 AM, Jake Mannix <[hidden email]> wrote:
The fact Lucene Similarity is most decidely *not* cosine similarity, but strongly resembles it with the queryNorm() in there, means that I personally would certainly like to see this called out, at least in the documentation.
As for performance, is the queryNorm() called ever in any loops? It's all set up in the construction of the Weight, right? Which means that by the time you're doing scoring, all the weighting factors are already factored into one? What's the performance issue which would be saved here?
jakeOn Fri, Nov 20, 2009 at 7:56 AM, Grant Ingersoll <[hidden email]> wrote:
For a long time now, we've been telling people not to compare scores across queries, yet we maintain the queryNorm() code as an attempt to do this and the javadocs even promote it. I'm in the process of researching this some more (references welcomed), but wanted to hear what people think about it here. I haven't profiled it just yet, but it seems like a good chunk of wasted computation to me (loops, divisions and square roots). At a minimum, I think we might be able to refactor the callback mechanism for it just as we did for the collectors, such that we push of the actual calculation of the sum of squares into Similarity, instead of just doing 1/sqrt(sumSqs). That way, when people want to override queryNorm() to return 1, they are saving more than just the 1/sqrt calculation. I haven't tested it yet, but wanted to find out what others think.
Thoughts?
Grant

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


On Fri, Nov 20, 2009 at 10:08 AM, Grant Ingersoll <[hidden email]> wrote:
I should add in my $0.02 on whether to just get rid of queryNorm() altogether:
1 from me, even though it's confusing, because having that call there (somewhere, at least) allows you to actually do compare scores across queries if you do the extra work of properly normalizing the documents as well (at index time).
Do you have some references on this? I'm interested in reading more on the subject. I've never quite been sold on how it is meaningful to compare scores and would like to read more opinions.
References on how people do this *with Lucene*, or just how this is done in general? There are lots of papers on fancy things which can be done, but I'm not sure where to point you to start out. The technique I'm referring to is really just the simplest possible thing beyond setting your weights "by hand": let's assume you have a boolean OR query, Q, built up out of subqueries q_i (hitting, for starters, different fields, although you can overlap as well with some more work), each with a set of weights (boosts) b_i, then if you have a training corpus (good matches, bad matches, or ranked lists of matches in order of relevance for the queries at hand), *and* scores (at the q_i level) which are comparable, then you can do a simple regression (linear or logistic, depending on whether you map your final scores to a logit or not) on the w_i to fit for the best boosts to use. What is critical here is that scores from different queries are comparable. If they're not, then queries where the best document for a query scores 2.0 overly affect the training in comparison to the queries where the best possible score is 0.5 (actually, wait, it's the reverse: you're training to increase scores of matching documents, so the system tries to make that 0.5 scoring document score much higher by raising boosts higher and higher, while the good matches already scoring 2.0 don't need any more boosting, if that makes sense).
There are of course far more complex "state of the art" training techniques, but probably someone like Ted would be able to give a better list of references on where is easiest to read those from. But I can try to dredge up some places where I've read about doing this, and post again later if I can find any.
jake


On Nov 20, 2009, at 1:24 PM, Jake Mannix wrote:
On Fri, Nov 20, 2009 at 10:08 AM, Grant Ingersoll <[hidden email]> wrote:
I should add in my $0.02 on whether to just get rid of queryNorm() altogether:
1 from me, even though it's confusing, because having that call there (somewhere, at least) allows you to actually do compare scores across queries if you do the extra work of properly normalizing the documents as well (at index time).
Do you have some references on this? I'm interested in reading more on the subject. I've never quite been sold on how it is meaningful to compare scores and would like to read more opinions.
References on how people do this *with Lucene*, or just how this is done in general?
in general. Academic references, etc. There are lots of papers on fancy things which can be done, but I'm not sure where to point you to start out. The technique I'm referring to is really just the simplest possible thing beyond setting your weights "by hand": let's assume you have a boolean OR query, Q, built up out of subqueries q_i (hitting, for starters, different fields, although you can overlap as well with some more work), each with a set of weights (boosts) b_i, then if you have a training corpus (good matches, bad matches, or ranked lists of matches in order of relevance for the queries at hand), *and* scores (at the q_i level) which are comparable, then you can do a simple regression (linear or logistic, depending on whether you map your final scores to a logit or not) on the w_i to fit for the best boosts to use. What is critical here is that scores from different queries are comparable. If they're not, then queries where the best document for a query scores 2.0 overly affect the training in comparison to the queries where the best possible score is 0.5 (actually, wait, it's the reverse: you're training to increase scores of matching documents, so the system tries to make that 0.5 scoring document score much higher by raising boosts higher and higher, while the good matches already scoring 2.0 don't need any more boosting, if that makes sense).
This makes sense from a mathematical sense, assuming scores are comparable. What I would like to get at is why anyone thinks scores are comparable across queries to begin with. I agree it is beneficial in some cases (as you described) if they are. Probably a question suited for an academic IR list... There are of course far more complex "state of the art" training techniques, but probably someone like Ted would be able to give a better list of references on where is easiest to read those from. But I can try to dredge up some places where I've read about doing this, and post again later if I can find any.


Grant Ingersoll wrote:
>
> What I would like to get at is why anyone thinks scores are
> comparable across queries to begin with.
>
They are somewhat comparable because we are using the approximate cosine
between the document/query vectors for the score  plus boosts n stuff.
How close the vectors are to each other. If q1 has a smaller angle diff
with d1 than q2 does with d2, then you can do a comparison. Its just
vector similarities. Its approximate because we fudge the normalization.
Why do you think the scores within a query search are comparable? Whats
the difference when you try another query? The query is the difference,
and the query norm is what makes it more comparable. Its just a
different query vector with another query. Its still going to just be a
given "angle" from the doc vectors. Closer is considered a better match.
We don't do it to improve anything, or because someone discovered
something  its just part of the formula for calculating the cosine. Its
the dot product formula. You can lose it and keep the same relative
rankings, but then you are further from the cosine for the score  you
start scaling by the magnitude of the query vector. When you do that
they are not so comparable.
If you take out the queryNorm, its much less comparable. You are
effectively multiplying the cosine by the magnitude of the query vector
 so different queries will scale the score differently  and not in a
helpful way  a term vector and query vector can have very different
magnitudes, but very similar term distributions. Thats why we are using
the cosine rather than euclidean distance in the first place. Pretty
sure its more linear algebra than IR  or the vector stuff from calc 3
(or wherever else different schools put it).

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Remember: we're not really doing cosine at all here. The factor of IDF^2 on the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom couples together to end up with the following effect: q1 = "TERM1"
q2 = "TERM2" doc1 = "TERM1" doc2 = "TERM2" score(q1, doc1) = idf(TERM1) score(q2, doc2) = idf(TERM2) Both are perfect matches, but one scores higher (possibly much higher) than
the other. Boosts work just fine with cosine (it's just a way of putting "tf" into the query side as well as in the document side), but normalizing documents without taking the idf of terms in the document into consideration blows away the ability to
compare scores in default Lucene scoring, even *with* the queryNorm() factored in. I know you probably know this Mark, but it's important to make sure we're stating that in Lucene as is currently structured, scores can be *wildly* different between
queries, even with queryNorm() factored in, for the sake of people reading this who haven't worked through the scoring in detail. jake On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller <[hidden email]> wrote:
Grant Ingersoll wrote:
>
> What I would like to get at is why anyone thinks scores are
> comparable across queries to begin with.
>
They are somewhat comparable because we are using the approximate cosine
between the document/query vectors for the score  plus boosts n stuff.
How close the vectors are to each other. If q1 has a smaller angle diff
with d1 than q2 does with d2, then you can do a comparison. Its just
vector similarities. Its approximate because we fudge the normalization.
Why do you think the scores within a query search are comparable? Whats
the difference when you try another query? The query is the difference,
and the query norm is what makes it more comparable. Its just a
different query vector with another query. Its still going to just be a
given "angle" from the doc vectors. Closer is considered a better match.
We don't do it to improve anything, or because someone discovered
something  its just part of the formula for calculating the cosine. Its
the dot product formula. You can lose it and keep the same relative
rankings, but then you are further from the cosine for the score  you
start scaling by the magnitude of the query vector. When you do that
they are not so comparable.
If you take out the queryNorm, its much less comparable. You are
effectively multiplying the cosine by the magnitude of the query vector
 so different queries will scale the score differently  and not in a
helpful way  a term vector and query vector can have very different
magnitudes, but very similar term distributions. Thats why we are using
the cosine rather than euclidean distance in the first place. Pretty
sure its more linear algebra than IR  or the vector stuff from calc 3
(or wherever else different schools put it).

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Yes, its a good point. I'm coming at it from a more pure angle. And I'm
not so elegant in my thought patterns :)
Right though  our document vector normalization is  uh  quick and
dirty :) Its about the cheapest one I've seen other than root(length).
I don't think that scores between queries are very comparable in general
in Lucene either but they would be even less so if we dropped the
query norm. As I've argued in the past  if it had any real perf hit,
I'd be on the side of dropping it  but from what I can see, it really
doesn't, so I don't see why we should further skew the scores.
Jake Mannix wrote:
> Remember: we're not really doing cosine at all here. The factor of
> IDF^2 on
> the top, with the factor of 1/sqrt(numTermsInDocument) on the bottom
> couples
> together to end up with the following effect:
>
> q1 = "TERM1"
> q2 = "TERM2"
>
> doc1 = "TERM1"
> doc2 = "TERM2"
>
> score(q1, doc1) = idf(TERM1)
> score(q2, doc2) = idf(TERM2)
>
> Both are perfect matches, but one scores higher (possibly much higher)
> than
> the other.
>
> Boosts work just fine with cosine (it's just a way of putting "tf"
> into the query side
> as well as in the document side), but normalizing documents without
> taking the
> idf of terms in the document into consideration blows away the ability to
> compare scores in default Lucene scoring, even *with* the queryNorm()
> factored
> in.
>
> I know you probably know this Mark, but it's important to make sure
> we're stating
> that in Lucene as is currently structured, scores can be *wildly*
> different between
> queries, even with queryNorm() factored in, for the sake of people
> reading this
> who haven't worked through the scoring in detail.
>
> jake
>
>
> On Fri, Nov 20, 2009 at 2:24 PM, Mark Miller < [hidden email]
> <mailto: [hidden email]>> wrote:
>
> Grant Ingersoll wrote:
> >
> > What I would like to get at is why anyone thinks scores are
> > comparable across queries to begin with.
> >
> They are somewhat comparable because we are using the approximate
> cosine
> between the document/query vectors for the score  plus boosts n
> stuff.
> How close the vectors are to each other. If q1 has a smaller angle
> diff
> with d1 than q2 does with d2, then you can do a comparison. Its just
> vector similarities. Its approximate because we fudge the
> normalization.
> Why do you think the scores within a query search are comparable?
> Whats
> the difference when you try another query? The query is the
> difference,
> and the query norm is what makes it more comparable. Its just a
> different query vector with another query. Its still going to just
> be a
> given "angle" from the doc vectors. Closer is considered a better
> match.
> We don't do it to improve anything, or because someone discovered
> something  its just part of the formula for calculating the
> cosine. Its
> the dot product formula. You can lose it and keep the same relative
> rankings, but then you are further from the cosine for the score  you
> start scaling by the magnitude of the query vector. When you do that
> they are not so comparable.
>
> If you take out the queryNorm, its much less comparable. You are
> effectively multiplying the cosine by the magnitude of the query
> vector
>  so different queries will scale the score differently  and not in a
> helpful way  a term vector and query vector can have very different
> magnitudes, but very similar term distributions. Thats why we are
> using
> the cosine rather than euclidean distance in the first place. Pretty
> sure its more linear algebra than IR  or the vector stuff from calc 3
> (or wherever else different schools put it).
>
> 
> To unsubscribe, email: [hidden email]
> <mailto: [hidden email]>
> For additional commands, email: [hidden email]
> <mailto: [hidden email]>
>
>

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Jake Mannix wrote:
> Remember: we're not really doing cosine at all here.
This, I think, is fuzzy right? It seems to be common to still call this
cosine scoring loosely  pretty much every practical impl fudges things
somewhat when doing the normalization (though we are on the heavy side
of fudgers)  I think its pretty rare to do the true cosine because its
so expensive. It can be somewhat misleading though.
Have you looked at the Similarity scoring explanation page that was
recently improved? Have any suggestions on changes to it? Doron put a
fair amount of work into improving it recently, but I think it could
always get better. Its currently leaning towards presenting this as
cosine  that seems in line with the few text books I've seen, but I'm
admittedly not that deep into any of this.

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


On Fri, Nov 20, 2009 at 2:50 PM, Mark Miller <[hidden email]> wrote:
Jake Mannix wrote:
> Remember: we're not really doing cosine at all here.
This, I think, is fuzzy right? It seems to be common to still call this
cosine scoring loosely  pretty much every practical impl fudges things
somewhat when doing the normalization (though we are on the heavy side
of fudgers)  I think its pretty rare to do the true cosine because its
so expensive. It can be somewhat misleading though.
Cosine isn't expensive  it's just *impossible* to do when you're doing incremental indexing: to normalize to actually do cosine similarity requires
knowing the idf at *index time*, so that the cosine norm (the norm Lucene does use on the query side) of fields  1/sqrt(sum_{i}(tf_(term_i)^2 * idf(term_i)^2))  is impossible to calculate.
Incremental indexing has it's costs! In cases where you *do* have either the exact idf values for your corpus's termvocabulary (or a good way to approximate it, since it only varies logarithmically, this isn't too hard), then if you override lengthNorm to always
return 1, and then take your document's Fields, and boost() them by the factor I listed above, then you can get exact (well, as exact as you can with only onebyte fieldnorms :\ [are we really that starved for RAM that we have to
have that little precision?] ) cosine similarity, at really no additional *querytime* performance hit. Just a little math done at indexing time.
Have you looked at the Similarity scoring explanation page that was
recently improved? Have any suggestions on changes to it? Doron put a
fair amount of work into improving it recently, but I think it could
always get better. Its currently leaning towards presenting this as
cosine  that seems in line with the few text books I've seen, but I'm
admittedly not that deep into any of this.
Yeah, that page is a lot better than it used to be, but I'd really try not to call this cosine  when people think of cosines, they think of:
a) angles, b) numbers which are less than one, and c) when they're equal to 1, you have a perfect match. We have a)  kinda. We do not have b) or c), and we don't have scores which are really comparable, and maybe Grant's question should be
answered with the statement: they shouldn't be: If I query the google with "web" (getting a little less than three billion hits, so it maybe has an idf of 3 or 4), and find a document with 10 terms, one
of which is "web" (and the others are equally common in terms of idf), then lucene says we should score that as: idf(web) / sqrt(10) =~ 0.9, while cosine would say: idf(web) / sqrt(1^2 * idf(web)^2 + tf_common_term2^2 * idf(term2)^2 + ... )
=~ 1/sqrt(10) = 0.31 If I instead query with "floccinaucinilipilification" (which comes up with about 45K hits, so maybe has an idf of 20), and find a document with 10 terms, one of which is floccinaucinilipilification, and the other terms
are also similarly rare (also idf 20), lucene says the score is idf(floccinaucinilipilification) / sqrt(10), = 6.5 while cosine would say: idf(floccinaucinilipilification) / sqrt(1^2 idf(flocci...)^2 + tf*rareTerm2^2 + ... )
=~ 1/sqrt(10) = 0.31 What is nice about *real* cosine is that it captures a sense of absolute scoring  the hit on the really rare word in a document which does not have a lot of other rare words (and is in general pretty short) is indeed
measured by the fact that the score is close to 1.0 (perfect match), whereas the Lucene default scoring does not give a good measure  web scores close to 1.0 on its hit, and you only notice that this isn't as
good as the rare word hit because that one scores way *higher* than 1.0. On the other hand, the thing which Lucene scoring does which cosine does *not* is measure the fact that hitting the rare word is *way* better
than hit on a common word, from a user perspective, so giving them equal scores because their cosines are the same is maybe not the right thing to do (and so "yay Doug!", heh). Have you looked at the Similarity scoring explanation page that was
recently improved? Have any suggestions on changes to it? Doron put a
fair amount of work into improving it recently, but I think it could
always get better. Its currently leaning towards presenting this as
cosine  that seems in line with the few text books I've seen, but I'm
admittedly not that deep into any of this.
I think it's way better than it used to be (although it's even more dense, which I guess is required, given that it's a dense subject), but I really don't think
we want to let people think it's a cosine  it's "like" a cosine, but with a rather different normalization. Doron's got it stated like that, already, but it needs to be crystal clear that scores are sometimes higher than 1, and in
fact the "best" score for one query may be rather different than the "best" for another query. jake


Jake Mannix wrote:
>
>
> On Fri, Nov 20, 2009 at 2:50 PM, Mark Miller < [hidden email]
> <mailto: [hidden email]>> wrote:
>
> Jake Mannix wrote:
> > Remember: we're not really doing cosine at all here.
> This, I think, is fuzzy right? It seems to be common to still call
> this
> cosine scoring loosely  pretty much every practical impl fudges
> things
> somewhat when doing the normalization (though we are on the heavy side
> of fudgers)  I think its pretty rare to do the true cosine
> because its
> so expensive. It can be somewhat misleading though.
>
>
> Cosine isn't expensive  it's just *impossible* to do when you're doing
> incremental indexing: to normalize to actually do cosine similarity
> requires
> knowing the idf at *index time*, so that the cosine norm
But cosine has two norms? The query norm and the document norm  taking
the two vectors to the unit space  it looks expensive to me to do both
of them properly. The IR lit I've seen fudges them down to Root(L) even
in the non incremental case for the document vector normalization.
> (the norm Lucene does
> use on the query side) of fields  1/sqrt(sum_{i}(tf_(term_i)^2 *
> idf(term_i)^2)) 
> is impossible to calculate.
But isn't that *not* the query side norm? Looks like the document vector
norm to me.
>
> Incremental indexing has it's costs!
>
> In cases where you *do* have either the exact idf values for your corpus's
> termvocabulary (or a good way to approximate it, since it only varies
> logarithmically, this isn't too hard), then if you override lengthNorm
> to always
> return 1, and then take your document's Fields, and boost() them by
> the factor
> I listed above, then you can get exact (well, as exact as you can with
> only
> onebyte fieldnorms :\ [are we really that starved for RAM that we
> have to
> have that little precision?] ) cosine similarity, at really no
> additional *querytime*
> performance hit. Just a little math done at indexing time.
Where are you doing the correct document vector normalization to get the
true cosine? I'm not following. It looks like the above formula is the
doc vector norm  so what about the query vector norm then?
>
> Have you looked at the Similarity scoring explanation page that was
> recently improved? Have any suggestions on changes to it? Doron put a
> fair amount of work into improving it recently, but I think it could
> always get better. Its currently leaning towards presenting this as
> cosine  that seems in line with the few text books I've seen, but I'm
> admittedly not that deep into any of this.
>
>
> Yeah, that page is a lot better than it used to be, but I'd really try
> not to
> call this cosine  when people think of cosines, they think of:
>
> a) angles,
> b) numbers which are less than one, and
> c) when they're equal to 1, you have a perfect match.
When they think of cosine yes  but IR lit seems to fudge this  non of
the common document length normalizations are real cosine  they are
just square root variations  with Lucene's being the most simple in the
family.
>
> We have a)  kinda. We do not have b) or c), and we don't have scores
> which are really comparable, and maybe Grant's question should be
> answered with the statement: they shouldn't be:
Right  we have always said they are not really comparable. But there
are degrees of that. I think they have a degree of comparability  they
must if we can make them less comparable, and we can  drop queryNorm.
>
> If I query the google with "web" (getting a little less than three
> billion hits,
> so it maybe has an idf of 3 or 4), and find a document with 10 terms, one
> of which is "web" (and the others are equally common in terms of idf),
> then lucene says we should score that as:
>
> idf(web) / sqrt(10) =~ 0.9,
>
> while cosine would say:
>
> idf(web) / sqrt(1^2 * idf(web)^2 + tf_common_term2^2 * idf(term2)^2
> + ... )
>
> =~ 1/sqrt(10) = 0.31
>
> If I instead query with "floccinaucinilipilification" (which comes up
> with
> about 45K hits, so maybe has an idf of 20), and find a document with
> 10 terms, one of which is floccinaucinilipilification, and the other terms
> are also similarly rare (also idf 20), lucene says the score is
>
> idf(floccinaucinilipilification) / sqrt(10), = 6.5
>
> while cosine would say:
>
> idf(floccinaucinilipilification) / sqrt(1^2 idf(flocci...)^2 +
> tf*rareTerm2^2 + ... )
>
> =~ 1/sqrt(10) = 0.31
>
> What is nice about *real* cosine is that it captures a sense of absolute
> scoring  the hit on the really rare word in a document which does not
> have a lot of other rare words (and is in general pretty short) is indeed
> measured by the fact that the score is close to 1.0 (perfect match),
> whereas the Lucene default scoring does not give a good measure 
> web scores close to 1.0 on its hit, and you only notice that this
> isn't as
> good as the rare word hit because that one scores way *higher* than 1.0.
>
> On the other hand, the thing which Lucene scoring does which cosine
> does *not* is measure the fact that hitting the rare word is *way* better
> than hit on a common word, from a user perspective, so giving them
> equal scores because their cosines are the same is maybe not the
> right thing to do (and so "yay Doug!", heh).
>
> Have you looked at the Similarity scoring explanation page that was
> recently improved? Have any suggestions on changes to it? Doron put a
> fair amount of work into improving it recently, but I think it could
> always get better. Its currently leaning towards presenting this as
> cosine  that seems in line with the few text books I've seen, but I'm
> admittedly not that deep into any of this.
>
>
> I think it's way better than it used to be (although it's even more
> dense, which
> I guess is required, given that it's a dense subject), but I really
> don't think
> we want to let people think it's a cosine  it's "like" a cosine, but
> with a
> rather different normalization.
We can amp that  but I like I said, I think thats standard practice 
in the IR books I've seen, they say, here is the cosine formula for term
vector scoring, and then they go on to fudge in a manner very similar to
how Lucene does it.
> Doron's got it stated like that, already, but
> it needs to be crystal clear that scores are sometimes higher than 1,
> and in
> fact the "best" score for one query may be rather different than the
> "best"
> for another query.
I think this is general knowledge with Lucene, but that may be a stretch
 perhaps its not as wide know as I now assume for new users. Hits used
to normalize to a 01 score actually, but its no longer around.
>
> jake

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Mark Miller wrote:
>
> it looks expensive to me to do both
> of them properly.
Okay  I guess that somewhat makes sense  you can calculate the
magnitude of the doc vectors at index time. How is that impossible with
incremental indexing though? Isn't it just expensive? Seems somewhat
expensive in the non incremental case as well  your just eating it at
index time rather than query time  though the same could be done for
incremental? The information is all there in either case.

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


On Fri, Nov 20, 2009 at 4:09 PM, Mark Miller <[hidden email]> wrote:
But cosine has two norms? The query norm and the document norm  taking
the two vectors to the unit space  it looks expensive to me to do both
of them properly. The IR lit I've seen fudges them down to Root(L) even
in the non incremental case for the document vector normalization.
Lucene has two norms too: lengthNorm (of the field), produced at index time, and queryNorm. The expense of queryNorm is done once per query, not
once per document, and so as you yourself say, is not a performance problem. The lengthNorm is factored into the document at index time, and so is not a hit at *query time* at all
> (the norm Lucene does
> use on the query side) of fields  1/sqrt(sum_{i}(tf_(term_i)^2 *
> idf(term_i)^2)) 
> is impossible to calculate.
But isn't that *not* the query side norm? Looks like the document vector
norm to me.
This is exactly the query norm  1/sqrt(sumOfSquaredWeights), and sum of squared weights is just sum of tf^2 * idf^2, except queries have "boost" in place of "tf". The document vector norm that *Lucene* uses is
independent of idf of the terms in the document (it's 1/sqrt(doc_length) for DefaultSimilarity).
Where are you doing the correct document vector normalization to get the
true cosine? I'm not following. It looks like the above formula is the
doc vector norm  so what about the query vector norm then?
Ok, the technique I use to do this (I should really just refactor it to a point where I can post it for contrib/queries so that other people can use this
if it's not that commonly known), is to, at index time, calculate the true document vector normalizations for the fields, and then feed it into their boosts, and make sure that Similarity does *not* do field normalization itself.
The formula for query vector normalization and document vector normalization is *exactly* the same if you translate boosts of terms into termFrequency of said term in the query "document". It's just applied to a different bag of
words.
> a) angles,
> b) numbers which are less than one, and
> c) when they're equal to 1, you have a perfect match.
When they think of cosine yes  but IR lit seems to fudge this  non of
the common document length normalizations are real cosine  they are
just square root variations  with Lucene's being the most simple in the
family.
This is true, which is why I'm not really giving a better wording than what Doron already wrote  I doubt I could do better. I'm concerned myself because I've seen lots of questions over the years on the lists of people
asking about how to compare scores across queries, no matter how much we tell them not to. And now here again we're telling them "they're not comparable, but they're more comparable than if we didn't have this
queryNorm in there!". What exactly does that mean? I do search, and I'm not really sure what it means, what does someone who is just learning this stuff going to think?
>
> I think it's way better than it used to be (although it's even more
> dense, which
> I guess is required, given that it's a dense subject), but I really
> don't think
> we want to let people think it's a cosine  it's "like" a cosine, but
> with a
> rather different normalization.
We can amp that  but I like I said, I think thats standard practice 
in the IR books I've seen, they say, here is the cosine formula for term
vector scoring, and then they go on to fudge in a manner very similar to
how Lucene does it.
Well just because it's standard practice, doesn't mean it's a good idea. I came to search from a math background, not CS, so when I see "vector space" and "cosine", I maybe think different things than the average coder
learning IR. So maybe I'm not the right person to ask how such things should be explained!
> Doron's got it stated like that, already, but
> it needs to be crystal clear that scores are sometimes higher than 1,
> and in
> fact the "best" score for one query may be rather different than the
> "best"
> for another query.
I think this is general knowledge with Lucene, but that may be a stretch
 perhaps its not as wide know as I now assume for new users. Hits used
to normalize to a 01 score actually, but its no longer around.
It was definitely a good idea to get rid of normalizing by the top score, but what could replace it is normalizing by the "best possible score". This tells
you how close to a perfect match you could have gotten. For cosine, this would just be the cosine of the angle from the document vector to the query vector, for Lucene, it would be something similar, but would have absolute meaning
(maybe... maybe not, because the theoretical best score is likely to be far from the actual best score in a given corpus). jake


On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <[hidden email]> wrote:
Mark Miller wrote:
>
> it looks expensive to me to do both
> of them properly.
Okay  I guess that somewhat makes sense  you can calculate the
magnitude of the doc vectors at index time. How is that impossible with
incremental indexing though? Isn't it just expensive? Seems somewhat
expensive in the non incremental case as well  your just eating it at
index time rather than query time  though the same could be done for
incremental? The information is all there in either case.
The expense, if you have the idfs of all terms in the vocabulary (keep them in the form of idf^2 for efficiency at index time), is pretty trivial, isn't it? If
you have a document with 1000 terms, it's maybe 3000 floating point operations, all CPU actions, in memory, no disk seeks.
What it does require, is knowing, even when you have no documents yet on disk, what the idf of terms in the first few documents are. Where do
you know this, in Lucene, if you haven't externalized some notion of idf?
jake

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <[hidden email]> wrote:
Mark Miller wrote:
Okay  I guess that somewhat makes sense  you can calculate the
magnitude of the doc vectors at index time. How is that impossible with
incremental indexing though? Isn't it just expensive? Seems somewhat
expensive in the non incremental case as well  your just eating it at
index time rather than query time  though the same could be done for
incremental? The information is all there in either case.
Ok, I think I see what you were imagining I was doing: you take the current state of the index as gospel for idf (when the index is already large, this
is a good approximation), and look up these factors at index time  this means grabbing docFreq(Term) for each term in my document, and yes, this would be very expensive, I'd imagine. I've done it by pulling a
monstrous (the most common 1million terms, say) Map<String, Float> (effectively) outside of lucene entirely, which gives term idfs, and housing this in memory so that computing field norms for cosine is a very fast
operation at index time.
Doing it like this is hard from scratch, but is fine incrementally, because I've basically fixed idf using some previous corpus (and update the idfMap every once in a while, in cases where it doesn't change much). This has
the effect of also providing a global notion of idf in a distributed corpus.
jake

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Jake Mannix wrote:
>
>
> On Fri, Nov 20, 2009 at 4:09 PM, Mark Miller < [hidden email]
> <mailto: [hidden email]>> wrote:
>
>
> But cosine has two norms? The query norm and the document norm 
> taking
> the two vectors to the unit space  it looks expensive to me to do
> both
> of them properly. The IR lit I've seen fudges them down to Root(L)
> even
> in the non incremental case for the document vector normalization.
>
>
> Lucene has two norms too: lengthNorm (of the field), produced at index
> time,
> and queryNorm. The expense of queryNorm is done once per query, not
> once per document, and so as you yourself say, is not a performance
> problem. The lengthNorm is factored into the document at index time, and
> so is not a hit at *query time* at all
Right  Lucene's two norms come from the cosine similarity. The
queryNorm is our version of the Euclidian length of the query vector.
The lengthNorm is our version of the Euclidian length of the document
vector. I don't think tf/idf systems that use cosine similarity, even
less fudged cosine similarity, generally produce scores that are
actually the cosine  the cosine similarity is just a factor to help
remove document length from the score, not actually produce the score.
And there are a few different options for how the document length
normalization is done. I've never seen it done correctly in lit though 
its always an approximation.
I know lengthNorm is not a hit at query time  but it can be a
performance issue at index time as well. And can be even worse with
incremental indexing as you said, but impossible?
>
>
> > (the norm Lucene does
> > use on the query side) of fields  1/sqrt(sum_{i}(tf_(term_i)^2 *
> > idf(term_i)^2)) 
> > is impossible to calculate.
> But isn't that *not* the query side norm? Looks like the document
> vector
> norm to me.
>
>
> This is exactly the query norm  1/sqrt(sumOfSquaredWeights), and
> sum of squared weights is just sum of tf^2 * idf^2, except queries have
> "boost" in place of "tf". The document vector norm that *Lucene* uses is
> independent of idf of the terms in the document (it's 1/sqrt(doc_length)
> for DefaultSimilarity).
Right  but thats not the real doc norm. Thats a fake. You won't get the
real cosine with it. You'd have to do the real document norm. I was
talking not in terms of Lucene, but just cosine similarity in general.
>
>
> Where are you doing the correct document vector normalization to
> get the
> true cosine? I'm not following. It looks like the above formula is the
> doc vector norm  so what about the query vector norm then?
>
>
> Ok, the technique I use to do this (I should really just refactor it
> to a point
> where I can post it for contrib/queries so that other people can use this
> if it's not that commonly known), is to, at index time, calculate the true
> document vector normalizations for the fields, and then feed it into
> their
> boosts, and make sure that Similarity does *not* do field
> normalization itself.
>
> The formula for query vector normalization and document vector
> normalization
> is *exactly* the same if you translate boosts of terms into
> termFrequency of
> said term in the query "document". It's just applied to a different
> bag of
> words.
>
>
> > a) angles,
> > b) numbers which are less than one, and
> > c) when they're equal to 1, you have a perfect match.
>
>
>
> When they think of cosine yes  but IR lit seems to fudge this 
> non of
> the common document length normalizations are real cosine  they are
> just square root variations  with Lucene's being the most simple
> in the
> family.
>
>
> This is true, which is why I'm not really giving a better wording than
> what
> Doron already wrote  I doubt I could do better. I'm concerned myself
> because I've seen lots of questions over the years on the lists of people
> asking about how to compare scores across queries, no matter how much
> we tell them not to. And now here again we're telling them "they're not
> comparable, but they're more comparable than if we didn't have this
> queryNorm in there!". What exactly does that mean? I do search, and
> I'm not really sure what it means, what does someone who is just learning
> this stuff going to think?
Who knows? Most of us working on this don't know  good luck to those
just learning :)
>
>
>
> >
> > I think it's way better than it used to be (although it's even more
> > dense, which
> > I guess is required, given that it's a dense subject), but I really
> > don't think
> > we want to let people think it's a cosine  it's "like" a
> cosine, but
> > with a
> > rather different normalization.
>
>
>
> We can amp that  but I like I said, I think thats standard practice 
> in the IR books I've seen, they say, here is the cosine formula
> for term
> vector scoring, and then they go on to fudge in a manner very
> similar to
> how Lucene does it.
>
>
> Well just because it's standard practice, doesn't mean it's a good idea.
> I came to search from a math background, not CS, so when I see "vector
> space" and "cosine", I maybe think different things than the average coder
> learning IR. So maybe I'm not the right person to ask how such things
> should be explained!
As good as anyone else here :) I'm taking what I'm thinking of it from
IR books I'm looking at though. They don't appear to consider cosine
similarity to require the score actually be the cosine  its just a
factor in the score. And they fudge that factor with shortcuts.
>
>
> > Doron's got it stated like that, already, but
> > it needs to be crystal clear that scores are sometimes higher
> than 1,
> > and in
> > fact the "best" score for one query may be rather different than the
> > "best"
> > for another query.
>
>
>
> I think this is general knowledge with Lucene, but that may be a
> stretch
>  perhaps its not as wide know as I now assume for new users. Hits
> used
> to normalize to a 01 score actually, but its no longer around.
>
>
> It was definitely a good idea to get rid of normalizing by the top
> score, but
> what could replace it is normalizing by the "best possible score".
> This tells
> you how close to a perfect match you could have gotten. For cosine,
> this would
> just be the cosine of the angle from the document vector to the query
> vector,
> for Lucene, it would be something similar, but would have absolute
> meaning
> (maybe... maybe not, because the theoretical best score is likely to
> be far
> from the actual best score in a given corpus).
>
> jake

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]


Okay  my fault  I'm not really talking in terms of Lucene. Though even
there I consider it possible. You'd just have to like, rewrite it :) And
it would likely be pretty slow.
Jake Mannix wrote:
>
>
> On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller < [hidden email]
> <mailto: [hidden email]>> wrote:
>
> Mark Miller wrote:
> >
> > it looks expensive to me to do both
> > of them properly.
> Okay  I guess that somewhat makes sense  you can calculate the
> magnitude of the doc vectors at index time. How is that impossible
> with
> incremental indexing though? Isn't it just expensive? Seems somewhat
> expensive in the non incremental case as well  your just eating it at
> index time rather than query time  though the same could be done for
> incremental? The information is all there in either case.
>
>
> The expense, if you have the idfs of all terms in the vocabulary (keep
> them
> in the form of idf^2 for efficiency at index time), is pretty trivial,
> isn't it? If
> you have a document with 1000 terms, it's maybe 3000 floating point
> operations, all CPU actions, in memory, no disk seeks.
>
> What it does require, is knowing, even when you have no documents yet
> on disk, what the idf of terms in the first few documents are. Where do
> you know this, in Lucene, if you haven't externalized some notion of idf?
>
> jake
>
>
>
>
> 
> To unsubscribe, email: [hidden email]
> <mailto: [hidden email]>
> For additional commands, email: [hidden email]
> <mailto: [hidden email]>
>
>

To unsubscribe, email: [hidden email]
For additional commands, email: [hidden email]

12
