Whither Query Norm?

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Jake Mannix


On Fri, Nov 20, 2009 at 4:51 PM, Mark Miller <[hidden email]> wrote:
Okay - my fault - I'm not really talking in terms of Lucene. Though even
there I consider it possible. You'd just have to like, rewrite it :) And
it would likely be pretty slow.

Rewrite it how?  When you index the very first document, the docFreq of all
terms is 1, out of numDocs = 1 docs in the corpus.  Everybody's idf is the same.
No matter how you normalize this, it'll be wrong, once you've indexed a million
documents.  This isn't a matter of Lucene architecture, it's a matter of idf being
a query-time exactly available value (you can approximate it partway through
indexing, but you don't know it at all when you start).
 
  -jake
Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Mark Miller-3
Go back and put it in after you have all the documents for that commit point. Or on reader load, calculate it. 

- Mark

On Nov 20, 2009, at 7:56 PM, Jake Mannix <[hidden email]> wrote:



On Fri, Nov 20, 2009 at 4:51 PM, Mark Miller <[hidden email]> wrote:
Okay - my fault - I'm not really talking in terms of Lucene. Though even
there I consider it possible. You'd just have to like, rewrite it :) And
it would likely be pretty slow.

Rewrite it how?  When you index the very first document, the docFreq of all
terms is 1, out of numDocs = 1 docs in the corpus.  Everybody's idf is the same.
No matter how you normalize this, it'll be wrong, once you've indexed a million
documents.  This isn't a matter of Lucene architecture, it's a matter of idf being
a query-time exactly available value (you can approximate it partway through
indexing, but you don't know it at all when you start).
 
  -jake
Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Jake Mannix
In reply to this post by Grant Ingersoll-2
Back to Grant's original question, for a second...

On Fri, Nov 20, 2009 at 1:59 PM, Grant Ingersoll <[hidden email]> wrote:
 
This makes sense from a mathematical sense, assuming scores are comparable.  What I would like to get at is why anyone thinks scores are comparable across queries to begin with.  I agree it is beneficial in some cases (as you described) if they are.   Probably a question suited for an academic IR list...

Well, without getting into the academic IR which I'm not really qualified to argue about, what is wrong with comparing two queries by saying that a document which "perfectly" matches a query should score 1.0, and scale with respect to that?

Maybe it's a better question to turn it around: can you give examples of two queries where you can see that it *doesn't* make sense to compare scores?  Let's imagine we're doing pure, properly normalized tf-idf cosine scoring (not default Lucene scoring) on a couple of different fields at once.  Then whenever a sub-query is exactly equal to the field it's hitting (or else the field is the repetition of that query some multiple number of times), the score for that sub-query will be 1.0.  When the match isn't perfect, the score will go down, ok.  Sub-queries hitting longer fields (which aren't just pathologically made up of just repetitions of a smaller set of terms) will in general have even the best scores be very low compared to the best scores on the small fields (this is true for Lucene as well, of course), but this makes sense: if you query with a very small set of terms (as is usually done, unless you're doing a MoreLikeThis kind of query), and you find a match in the "title" field which is exactly what you were looking for, that field match is far and away better than anything else you could get in a body match. 

To put it more simply - if you do really have cosine similarity (or Jaccard/Tanimoto or something like that, if you don't care about idf for some reason), then queries scores are normalized relative to "how close did I find documents to *perfectly* matching my query" - 1.0 means you found your query in the corpus, and less than that means some fractional proximity.  This is an absolute measure, across queries. 

Of course, then you ask, well, in reality, in web and enterprise search, documents are big, queries are small, you never really find documents which are perfect matches, so if the best match for q1, out of your whole corpus, is 0.1 for doc1, and the best match for q2 is 0.25 for doc2, is it really true that the best match for the second query is "better" than the best match for the first query?  I've typically tried to remain agnostic on that front, and instead as the related question: if the user (or really, a sampling of many users) queried for (q1 OR q2) and assuming for simplicity that q1 didn't match any of the good hits for q2, and vice-versa, then does the user (ie. your gold-standard training set) say that the best result is doc1, or doc2?  If it's doc1, then you'd better have found a way to boost q1's score contribution higher than q2's, right?  Is this wrong?  (in the theoretical sense?)

  -jake
Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Jake Mannix
In reply to this post by Mark Miller-3


On Fri, Nov 20, 2009 at 5:02 PM, Mark Miller <[hidden email]> wrote:
Go back and put it in after you have all the documents for that commit point. Or on reader load, calculate it. 

Ah!  Now I see what you mean by "expensive". :)  Basically run through all your documents
you've indexed all over again, fixing their norms.  That's pretty expensive.  Not as bad
as doing it at reader load, but still pretty harsh, but yes, not impossible!

 -jake
Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Otis Gospodnetic-2
In reply to this post by Jake Mannix
I'm late to the thread, and although it looks like the discussion is over, I'll inline a Q for Jake.

>I should add in my $0.02 on whether to just get rid of queryNorm() altogether:
>>>
>>>  -1 from me, even though it's confusing, because having that call there (somewhere, at least) allows you to actually do compare scores across queries if you do the extra work of properly normalizing the documents as well (at index time).
>>
>>
>>Do you have some references on this?  I'm interested in reading more on the subject.  I've never quite been sold on how it is meaningful to compare scores and would like to read more opinions.
>
>References on how people do this *with Lucene*, or just how this is done in general?  There are lots of papers on fancy things which can be done, but I'm not sure where to point you to start out.  The technique I'm referring to is really just the simplest possible thing beyond setting your weights "by hand": let's assume you have a boolean OR query, Q, built up out of sub-queries q_i (hitting, for starters, different fields, although you can overlap as well with some more work), each with a set of weights (boosts) b_i, then if you have a training corpus (good matches, bad matches, or ranked lists of matches in order of relevance for the queries at hand), *and* scores (at the q_i level) which are comparable,

You mentioned this about 3 times in this thread (contrib/queries wants you!)
It sounds like you've done this before (with Lucene?).  But how, if the scores are not comparable, and that's required for this "field boost learning/training" to work?

Thanks,
Otis

> then you can do a simple regression (linear or logistic, depending on whether you map your final scores to a logit or not) on the w_i to fit for the best boosts to use.  What is critical here is that scores from different queries are comparable.  If they're not, then queries where the best document for a query scores 2.0 overly affect the training in comparison to the queries where the best possible score is 0.5 (actually, wait, it's the reverse: you're training to increase scores of matching documents, so the system tries to make that 0.5 scoring document score much higher by raising boosts higher and higher, while the good matches already scoring 2.0 don't need any more boosting, if that makes sense).
>
>There are of course far more complex "state of the art" training techniques, but probably someone like Ted would be able to give a better list of references on where is easiest to read those from.  But I can try to dredge up some places where I've read about doing this, and post again later if I can find any.
>
>  -jake
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Otis Gospodnetic-2
In reply to this post by Jake Mannix
Hello,

Regarding that monstrous term->idf map.
Is this something that one could use to adjust the scores in http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations scenario?  Say a map like that was created periodically for each shard and distributed to all other nodes (so in the end each node has all maps locally).  Couldn't the local scorer in the Solr instance (and in distributed Lucene setup) consult idfs for relevant terms in all those maps and adjust the scores of local scores before returning results?

Otis

>From: Jake Mannix <[hidden email]>
>To: [hidden email]
>Sent: Fri, November 20, 2009 7:49:34 PM
>Subject: Re: Whither Query Norm?
>
>
>
>
>On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <[hidden email]> wrote:
>
>Mark Miller wrote:
>>Okay - I guess that somewhat makes sense - you can calculate the
>>>>magnitude of the doc vectors at index time. How is that impossible with
>>>>incremental indexing though? Isn't it just expensive? Seems somewhat
>>>>expensive in the non incremental case as well - your just eating it at
>>>>index time rather than query time - though the same could be done for
>>>>incremental? The information is all there in either case.
>>
>>
>
>Ok, I think I see what you were imagining I was doing: you take the current
>state of the index as gospel for idf (when the index is already large, this
>>is a good approximation), and look up these factors at index time - this
>means grabbing docFreq(Term) for each term in my document, and yes,
>this would be very expensive, I'd imagine.  I've done it by pulling a
>>monstrous (the most common 1-million terms, say) Map<String, Float>
>(effectively) outside of lucene entirely, which gives term idfs, and housing
>this in memory so that computing field norms for cosine is a very fast
>>operation at index time.
>
>Doing it like this is hard from scratch, but is fine incrementally, because
>I've basically fixed idf using some previous corpus (and update the idfMap
>every once in a while, in cases where it doesn't change much).  This has
>>the effect of also providing a global notion of idf in a distributed corpus.
>
>  -jake
>
>
>>
>>
>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: [hidden email]
>>>>For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Jake Mannix
In reply to this post by Otis Gospodnetic-2


On Tue, Nov 24, 2009 at 9:18 PM, Otis Gospodnetic <[hidden email]> wrote:
I'm late to the thread, and although it looks like the discussion is over, I'll inline a Q for Jake.

>
>References on how people do this *with Lucene*, or just how this is done in general?  There are lots of papers on fancy things which can be done, but I'm not sure where to point you to start out.  The technique I'm referring to is really just the simplest possible thing beyond setting your weights "by hand": let's assume you have a boolean OR query, Q, built up out of sub-queries q_i (hitting, for starters, different fields, although you can overlap as well with some more work), each with a set of weights (boosts) b_i, then if you have a training corpus (good matches, bad matches, or ranked lists of matches in order of relevance for the queries at hand), *and* scores (at the q_i level) which are comparable,

You mentioned this about 3 times in this thread (contrib/queries wants you!)
It sounds like you've done this before (with Lucene?).  But how, if the scores are not comparable, and that's required for this "field boost learning/training" to work?

Well that's the point, right?  You need to make the scores comparable, somehow.  The most general thing you can do is figure out what the maximum possible score for a query is (if there is a maximum, which for most scoring systems there will be, given strictly positive doc norms) and normalize with respect to that.  For Lucene, the simplest possible way to do this (I think?) is to swap in a true cosine (or something like Tanimoto) similarity instead of the doc-length normalized one (which may require externalizing the IDF). 

When the score for a tf-idf weighted document and a boost-idf weighted query (with both are normalized on the same scale) is exactly just the cosine of the angle between them, then scores become fairly comparable - they're all on a 0 to 1 scale.  Now, longer fields/documents are still going to score way lower than shorter documents, for typical user-generated queries, but at least now "way lower" has more meaning than before (because it's "way lower *relative to 1.0*").

Frankly, I've even done this kind of logistic regression training of weights even when you use raw lucene scoring, and while it doesn't work completely (because scores are so incomparable), it's remarkable how well it ends up working (I guess in comparison to randomly setting your boosts by hand and simple A/B tests...)

  -jake



Thanks,
Otis

> then you can do a simple regression (linear or logistic, depending on whether you map your final scores to a logit or not) on the w_i to fit for the best boosts to use.  What is critical here is that scores from different queries are comparable.  If they're not, then queries where the best document for a query scores 2.0 overly affect the training in comparison to the queries where the best possible score is 0.5 (actually, wait, it's the reverse: you're training to increase scores of matching documents, so the system tries to make that 0.5 scoring document score much higher by raising boosts higher and higher, while the good matches already scoring 2.0 don't need any more boosting, if that makes sense).
>
>There are of course far more complex "state of the art" training techniques, but probably someone like Ted would be able to give a better list of references on where is easiest to read those from.  But I can try to dredge up some places where I've read about doing this, and post again later if I can find any.
>
>  -jake
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Jake Mannix
In reply to this post by Otis Gospodnetic-2


On Tue, Nov 24, 2009 at 9:18 PM, Otis Gospodnetic <[hidden email]> wrote:
I'm late to the thread, and although it looks like the discussion is over, I'll inline a Q for Jake.
 
You mentioned this about 3 times in this thread (contrib/queries wants you!)

Yeah, I've got to get some of the stuff I've written pulled out of all of our own code - now that Lucene is on java 1.5, it'll be much easier for me to contribute this stuff.  Soon!

  -jake
Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Jake Mannix
In reply to this post by Otis Gospodnetic-2


On Tue, Nov 24, 2009 at 9:31 PM, Otis Gospodnetic <[hidden email]> wrote:
Hello,

Regarding that monstrous term->idf map.
Is this something that one could use to adjust the scores in http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations scenario?  Say a map like that was created periodically for each shard and distributed to all other nodes (so in the end each node has all maps locally).  Couldn't the local scorer in the Solr instance (and in distributed Lucene setup) consult idfs for relevant terms in all those maps and adjust the scores of local scores before returning results?


Why would you want all nodes to have all maps?  Why not merge them into one map, then redistributed out to all nodes, which would be far smaller than many maps anyways?  Then yes, the scoring can be done locally using this big idfMap to produce scores, instead of using reader.docFreq() for idf, that's what I do.  But then what are you implying should be done?  Just rescale the top scores based on the idfs before returning your top results?  You'd need to know exactly which terms hit those top-scoring documents, right? Which implies the cost of basically explain(), doesn't it? 

Although with the per-field scoring (the thing I do to be able to train on sub-query field matches scores), this gets easier, because then you can try to hang onto this information if the query isn't too big, but this isn't something normal BooleanQueries will handle for you naturally.

  -jake
Reply | Threaded
Open this post in threaded view
|

Re: Whither Query Norm?

Jake Mannix
In reply to this post by Grant Ingersoll-2
Now that Otis reminded me that this thread existed (I've got a brain like a sieve these days, I tell you)...

On Fri, Nov 20, 2009 at 10:08 AM, Grant Ingersoll <[hidden email]> wrote:

  -1 from me, even though it's confusing, because having that call there (somewhere, at least) allows you to actually do compare scores across queries if you do the extra work of properly normalizing the documents as well (at index time).

Do you have some references on this?  I'm interested in reading more on the subject.  I've never quite been sold on how it is meaningful to compare scores and would like to read more opinions.

So I couldn't find any really good papers on this specifically, but I seem to remember seeing this stuff done a lot in Manning and Schutze' IR book - the go over training field boosts with logistic regression and all that, but they don't specifically look at the Lucene case (although they consider similar scoring functions).   They must talk about the necessity of comparable scores to do this, I'm sure.

  -jake
12