The values which compute scores.

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

The values which compute scores.

wls-2
Hopefully I'm not opening myself up to public ridicule with what may
be a very stupid question, but...

At the moment, I'm trying to wrap my head around some of the math that
happens when Lucene does scoring.  Let's put aside the big equation
for a moment and focus on a simple method, such as tf().  [term
frequency]

I understand that tf(freq) is supposed to return larger values when
freq is large, and smaller values when freq is small.  Though here's
what making me scratch my head today:

a) Where does freq come from?  (Not what is it, but who computes it and how?)

Reason I ask is:

b) How do I know what "large" and "small" is, as I don't really have a
relative scale of what the max and min values are?  Should I just
assume a linear scale of 1.0 to 0.0 will be passed to the method?

But then that begs the question...

c) What values should I be passing out of a function like this?
Should I normalize my outgoing scores to some scale, or do I simply
just need to provide numbers that "have the right shaped curve".

I wish the documentation shed a smidgen bit more light in those areas.


I look at things like idf() which returns 1+log(ratio) and then has
that value squared.  Clearly that isn't on a scale of 1.0 to 0.0.

I feel like there may be some mathematical trickery going on and that
maybe the actual score values themselves don't matter inside the
ranking code, so long as their relative values to one another.

This then makes me ponder how the normalization process is done
between queries, allowing for a mix'n'match of results as these
numbers spill to the outside world.  Obviously normalization has to
happen at that point for the mixing query results magic to work.


Is there a math wizard in the group who can talk to me like I'm four years old?

-wls
http://www.wwco.com/~wls/blog/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The values which compute scores.

Yonik Seeley-2
On 5/30/07, Walt Stoneburner <[hidden email]> wrote:
> a) Where does freq come from?  (Not what is it, but who computes it and how?)

For a single term, it's determined at index time and stored in the index.
TermDocs gives you a list of documents containing the term, and for
each document, the number of time the term appears (the freq).

Note: TermScorer calls the tf(int) version on Similarity rather than
the tf(float) version.

> c) What values should I be passing out of a function like this?
> Should I normalize my outgoing scores to some scale, or do I simply
> just need to provide numbers that "have the right shaped curve".

Hits normalizes to 1 based on the max score, if that max score is
greater than 1.
Scores across queries aren't really comparable though.

> I look at things like idf() which returns 1+log(ratio) and then has
> that value squared.  Clearly that isn't on a scale of 1.0 to 0.0.
>
> I feel like there may be some mathematical trickery going on and that
> maybe the actual score values themselves don't matter inside the
> ranking code, so long as their relative values to one another.

Pretty much.

> This then makes me ponder how the normalization process is done
> between queries, allowing for a mix'n'match of results as these
> numbers spill to the outside world.  Obviously normalization has to
> happen at that point for the mixing query results magic to work.

Lucene doesn't currently do this "mixing", and it's not really clear
to me how it should be done.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The values which compute scores.

Grant Ingersoll-2
In reply to this post by wls-2
Hi Walt,

One question that comes to mind, is what are you looking to do?  Are  
you not happy with the current scoring or you just trying to better  
understand scoring?  The calls to Similarity.tf(), etc. are call  
backs from within the scoring algorithm (have a look at TermScorer in  
the code) and provide a means for an application to change the score,  
but in many cases there really isn't too much incentive to do so.

-Grant


On May 30, 2007, at 4:45 PM, Walt Stoneburner wrote:

> Hopefully I'm not opening myself up to public ridicule with what may
> be a very stupid question, but...
>
> At the moment, I'm trying to wrap my head around some of the math that
> happens when Lucene does scoring.  Let's put aside the big equation
> for a moment and focus on a simple method, such as tf().  [term
> frequency]
>
> I understand that tf(freq) is supposed to return larger values when
> freq is large, and smaller values when freq is small.  Though here's
> what making me scratch my head today:
>
> a) Where does freq come from?  (Not what is it, but who computes it  
> and how?)
>
> Reason I ask is:
>
> b) How do I know what "large" and "small" is, as I don't really have a
> relative scale of what the max and min values are?  Should I just
> assume a linear scale of 1.0 to 0.0 will be passed to the method?
>
> But then that begs the question...
>
> c) What values should I be passing out of a function like this?
> Should I normalize my outgoing scores to some scale, or do I simply
> just need to provide numbers that "have the right shaped curve".
>
> I wish the documentation shed a smidgen bit more light in those areas.
>
>
> I look at things like idf() which returns 1+log(ratio) and then has
> that value squared.  Clearly that isn't on a scale of 1.0 to 0.0.
>
> I feel like there may be some mathematical trickery going on and that
> maybe the actual score values themselves don't matter inside the
> ranking code, so long as their relative values to one another.
>
> This then makes me ponder how the normalization process is done
> between queries, allowing for a mix'n'match of results as these
> numbers spill to the outside world.  Obviously normalization has to
> happen at that point for the mixing query results magic to work.
>
>
> Is there a math wizard in the group who can talk to me like I'm  
> four years old?
>
> -wls
> http://www.wwco.com/~wls/blog/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The values which compute scores.

Daniel Einspanjer
This may be a five year old explaining to a four year old why the sky
is blue, but I'll share some of the stuff I've picked up. :)

My application isn't so much a search engine as a matching engine.  I
take a large list of movie documents from a customer like a movie
channel or a cable provider and match that list against the movies our
company has classified.  I wrote a query parser on top of the native
query parser that understands interpolated terms such as
+title_fuzzy_multivalued:"${LongTitle}" and it will pull the LongTitle
field from the customer movie and plug it into that term.

The huge problem I ran into was one of scoring.  Since this is
matching not searching, and since the interpolation causes the query
from item A to be different from item B and likely wildly different
from the queries used in a different customer's matching, I really
needed a good score that could be compared across the board.

The solution I opted for was what I call perfect score normalization.
Basically, I index both the customer feed and the classified feed.
When the user of my system is adding a new feed to the system, they
define field alignments, e.g. they map the customer's LongTitle field
to the title field of the classified feed.  Then, they define the
appropriate indices to use for each field alignment, e.g. they might
index the title fields using the title_strict_string_multivalue and
the title_fuzzy_terms_multivalue indices.

Now that I have these common indices, when I perform the matching run,
I interpolate the query using the values for the source item and get
both the best match from the classified feed (using a Solr filter
query to restrict the result set to only items with the classified
feed id) and the match for the customer item (using a filter query on
that item's ID).  Now that I have these two scores, they are
comparable in the sense that the score of the customer item is "as
good as it gets".  I divide the match score by the reference item
score and if the value is greater than one for some reason, I subtract
the amount above one from one to penalize it for being "too good".

This strategy required a few tweaks in the Similarity class.  I have
actor name phrase queries with a word slop of two so that I can match
First Last to Last, First. I made my tf(float) function return 0 or 1
so that the scores for those two items look the same.  tf also matters
in the case of multiple hits of a term within a field such as title.
If I am matching a movie with the title "Caesar Came Saw and
Conquered", I don't want the title "Caesar Came, Caesar Saw, Caesar
Conquered" to have a higher score just because the word Caesar is
repeated.

I customize the idf() function to always return a 1 for year fields
because it could do funny things to a score if the source item had a
year 1984 and my query term was year_year:[${year -1} TO ${year +1}]
and there was only one item with a year of 1983. The 1983 would
actually score higher than the 1984.

I'm currently looking at whether overriding queryNorm() to always
return 1 is a good thing or not.  I saw reference in a recent thread
that doing that might cause ^ boosts in terms or clauses to not work
right so I need to go back and study that again.

The other big thing that I'm doing is that the user doesn't define the
query in one big lump. They break it down into scoring sections. all
the title related terms are in one section and all the year related
terms in a different one.  The user defines weights that each of these
sections should contribute to my "weighted score".  I run individual
queries for each of these scoring sections against the source and
target items and record those normalized scores then multiply them by
their weights and add them up to get my weighted score.
This strategy is working pretty well, but it is slow because of all
the extra queries.  I know that I can eliminate them by getting access
to the Explanation object and parsing out the scores I want there, but
that is what I am in the middle of researching how to do now. :)

Anyway.. some of this might be useful to you or maybe it is all
babble. You are either welcome or asked for forgiveness respectively.
:)

Daniel

On 5/30/07, Grant Ingersoll <[hidden email]> wrote:

> Hi Walt,
>
> One question that comes to mind, is what are you looking to do?  Are
> you not happy with the current scoring or you just trying to better
> understand scoring?  The calls to Similarity.tf(), etc. are call
> backs from within the scoring algorithm (have a look at TermScorer in
> the code) and provide a means for an application to change the score,
> but in many cases there really isn't too much incentive to do so.
>
> -Grant
>
>
> On May 30, 2007, at 4:45 PM, Walt Stoneburner wrote:
>
> > Hopefully I'm not opening myself up to public ridicule with what may
> > be a very stupid question, but...
> >
> > At the moment, I'm trying to wrap my head around some of the math that
> > happens when Lucene does scoring.  Let's put aside the big equation
> > for a moment and focus on a simple method, such as tf().  [term
> > frequency]
> >
> > I understand that tf(freq) is supposed to return larger values when
> > freq is large, and smaller values when freq is small.  Though here's
> > what making me scratch my head today:
> >
> > a) Where does freq come from?  (Not what is it, but who computes it
> > and how?)
> >
> > Reason I ask is:
> >
> > b) How do I know what "large" and "small" is, as I don't really have a
> > relative scale of what the max and min values are?  Should I just
> > assume a linear scale of 1.0 to 0.0 will be passed to the method?
> >
> > But then that begs the question...
> >
> > c) What values should I be passing out of a function like this?
> > Should I normalize my outgoing scores to some scale, or do I simply
> > just need to provide numbers that "have the right shaped curve".
> >
> > I wish the documentation shed a smidgen bit more light in those areas.
> >
> >
> > I look at things like idf() which returns 1+log(ratio) and then has
> > that value squared.  Clearly that isn't on a scale of 1.0 to 0.0.
> >
> > I feel like there may be some mathematical trickery going on and that
> > maybe the actual score values themselves don't matter inside the
> > ranking code, so long as their relative values to one another.
> >
> > This then makes me ponder how the normalization process is done
> > between queries, allowing for a mix'n'match of results as these
> > numbers spill to the outside world.  Obviously normalization has to
> > happen at that point for the mixing query results magic to work.
> >
> >
> > Is there a math wizard in the group who can talk to me like I'm
> > four years old?
> >
> > -wls
> > http://www.wwco.com/~wls/blog/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The values which compute scores.

Doron Cohen
I have no particular experience with matching
problems so the following might be off target...

Anyhow, if I understand correctly, problem is that,
currently, given a set of customer film descriptions
{D1, D2, ... , Dn}, a set of n queries are created
and each query can match at most one film in the
classification index, and it is impossible for both D1
and D2 to match the same classified film F. But if
both q1 and q2 matched a certain film F, some score
normalization is required in order to select: F matches
D1, or F matches D2.

If so, I am not sure that deep tuning of Lucene
scores is the best way to go.

How about the following alternative:
(1) Create an auxiliary index containing all (and
only) the customer documents. So each document in
the aux index represents a customer film description.
(2) For each customer document D, create the query Q,
and run it on the combined index (using a MultiReader
over the auxiliary index (reader) and the classification
index (reader). The customer document D (from the aux
index) is supposed to be the top result. Ignore all
other scores from the aux index. Use this (max) score
of D to normalize all the scores of film docs from the
classification index.

HTH,
Doron

"Daniel Einspanjer" <[hidden email]> wrote on 30/05/2007 17:18:10:

> This may be a five year old explaining to a four year old why the sky
> is blue, but I'll share some of the stuff I've picked up. :)
>
> My application isn't so much a search engine as a matching engine.  I
> take a large list of movie documents from a customer like a movie
> channel or a cable provider and match that list against the movies our
> company has classified.  I wrote a query parser on top of the native
> query parser that understands interpolated terms such as
> +title_fuzzy_multivalued:"${LongTitle}" and it will pull the LongTitle
> field from the customer movie and plug it into that term.
>
> The huge problem I ran into was one of scoring.  Since this is
> matching not searching, and since the interpolation causes the query
> from item A to be different from item B and likely wildly different
> from the queries used in a different customer's matching, I really
> needed a good score that could be compared across the board.
>
> The solution I opted for was what I call perfect score normalization.
> Basically, I index both the customer feed and the classified feed.
> When the user of my system is adding a new feed to the system, they
> define field alignments, e.g. they map the customer's LongTitle field
> to the title field of the classified feed.  Then, they define the
> appropriate indices to use for each field alignment, e.g. they might
> index the title fields using the title_strict_string_multivalue and
> the title_fuzzy_terms_multivalue indices.
>
> Now that I have these common indices, when I perform the matching run,
> I interpolate the query using the values for the source item and get
> both the best match from the classified feed (using a Solr filter
> query to restrict the result set to only items with the classified
> feed id) and the match for the customer item (using a filter query on
> that item's ID).  Now that I have these two scores, they are
> comparable in the sense that the score of the customer item is "as
> good as it gets".  I divide the match score by the reference item
> score and if the value is greater than one for some reason, I subtract
> the amount above one from one to penalize it for being "too good".
>
> This strategy required a few tweaks in the Similarity class.  I have
> actor name phrase queries with a word slop of two so that I can match
> First Last to Last, First. I made my tf(float) function return 0 or 1
> so that the scores for those two items look the same.  tf also matters
> in the case of multiple hits of a term within a field such as title.
> If I am matching a movie with the title "Caesar Came Saw and
> Conquered", I don't want the title "Caesar Came, Caesar Saw, Caesar
> Conquered" to have a higher score just because the word Caesar is
> repeated.
>
> I customize the idf() function to always return a 1 for year fields
> because it could do funny things to a score if the source item had a
> year 1984 and my query term was year_year:[${year -1} TO ${year +1}]
> and there was only one item with a year of 1983. The 1983 would
> actually score higher than the 1984.
>
> I'm currently looking at whether overriding queryNorm() to always
> return 1 is a good thing or not.  I saw reference in a recent thread
> that doing that might cause ^ boosts in terms or clauses to not work
> right so I need to go back and study that again.
>
> The other big thing that I'm doing is that the user doesn't define the
> query in one big lump. They break it down into scoring sections. all
> the title related terms are in one section and all the year related
> terms in a different one.  The user defines weights that each of these
> sections should contribute to my "weighted score".  I run individual
> queries for each of these scoring sections against the source and
> target items and record those normalized scores then multiply them by
> their weights and add them up to get my weighted score.
> This strategy is working pretty well, but it is slow because of all
> the extra queries.  I know that I can eliminate them by getting access
> to the Explanation object and parsing out the scores I want there, but
> that is what I am in the middle of researching how to do now. :)
>
> Anyway.. some of this might be useful to you or maybe it is all
> babble. You are either welcome or asked for forgiveness respectively.
> :)
>
> Daniel
>
> On 5/30/07, Grant Ingersoll <[hidden email]> wrote:
> > Hi Walt,
> >
> > One question that comes to mind, is what are you looking to do?  Are
> > you not happy with the current scoring or you just trying to better
> > understand scoring?  The calls to Similarity.tf(), etc. are call
> > backs from within the scoring algorithm (have a look at TermScorer in
> > the code) and provide a means for an application to change the score,
> > but in many cases there really isn't too much incentive to do so.
> >
> > -Grant
> >
> >
> > On May 30, 2007, at 4:45 PM, Walt Stoneburner wrote:
> >
> > > Hopefully I'm not opening myself up to public ridicule with what may
> > > be a very stupid question, but...
> > >
> > > At the moment, I'm trying to wrap my head around some of
> the math that
> > > happens when Lucene does scoring.  Let's put aside the big equation
> > > for a moment and focus on a simple method, such as tf().  [term
> > > frequency]
> > >
> > > I understand that tf(freq) is supposed to return larger values when
> > > freq is large, and smaller values when freq is small.  Though here's
> > > what making me scratch my head today:
> > >
> > > a) Where does freq come from?  (Not what is it, but who computes it
> > > and how?)
> > >
> > > Reason I ask is:
> > >
> > > b) How do I know what "large" and "small" is, as I don't
> really have a
> > > relative scale of what the max and min values are?  Should I just
> > > assume a linear scale of 1.0 to 0.0 will be passed to the method?
> > >
> > > But then that begs the question...
> > >
> > > c) What values should I be passing out of a function like this?
> > > Should I normalize my outgoing scores to some scale, or do I simply
> > > just need to provide numbers that "have the right shaped curve".
> > >
> > > I wish the documentation shed a smidgen bit more light in
> those areas.
> > >
> > >
> > > I look at things like idf() which returns 1+log(ratio) and then has
> > > that value squared.  Clearly that isn't on a scale of 1.0 to 0.0.
> > >
> > > I feel like there may be some mathematical trickery going on and that
> > > maybe the actual score values themselves don't matter inside the
> > > ranking code, so long as their relative values to one another.
> > >
> > > This then makes me ponder how the normalization process is done
> > > between queries, allowing for a mix'n'match of results as these
> > > numbers spill to the outside world.  Obviously normalization has to
> > > happen at that point for the mixing query results magic to work.
> > >
> > >
> > > Is there a math wizard in the group who can talk to me like I'm
> > > four years old?
> > >
> > > -wls
> > > http://www.wwco.com/~wls/blog/
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> >
> > --------------------------
> > Grant Ingersoll
> > Center for Natural Language Processing
> > http://www.cnlp.org/tech/lucene.asp
> >
> > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> > LuceneFAQ
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The values which compute scores.

Daniel Einspanjer
The score normalization is actually more important for purposes of
review. It actually is possible that both D1 and D2 properly match to
F1. Some customers have repeats of the same film (e.g. Spiderman 2 and
Spiderman 2 in HD).  When the system goes through and records the
potential matches, our review team needs to be able to determine
whether these matches are correct or not.  In order to do so quickly
and accurately, they want the ability to judge things like "This is
why it chose this as a match, the title was a poor score, but the
actors and year were very highly scored".  They might mark that as a
good match or as a bad match. Things that are recorded as bad matches
will be avoided on subsequent runs (e.g. the engine will match the
highest scoring item that is not in the blacklist for this source
item).

Having all the customer feeds and the classified feed in one big index
hasn't really caused me any problems so far. The filter queries seem
to work fine for restricting a particular query to one set or the
other.

On 5/31/07, Doron Cohen <[hidden email]> wrote:

> I have no particular experience with matching
> problems so the following might be off target...
>
> Anyhow, if I understand correctly, problem is that,
> currently, given a set of customer film descriptions
> {D1, D2, ... , Dn}, a set of n queries are created
> and each query can match at most one film in the
> classification index, and it is impossible for both D1
> and D2 to match the same classified film F. But if
> both q1 and q2 matched a certain film F, some score
> normalization is required in order to select: F matches
> D1, or F matches D2.
>
> If so, I am not sure that deep tuning of Lucene
> scores is the best way to go.
>
> How about the following alternative:
> (1) Create an auxiliary index containing all (and
> only) the customer documents. So each document in
> the aux index represents a customer film description.
> (2) For each customer document D, create the query Q,
> and run it on the combined index (using a MultiReader
> over the auxiliary index (reader) and the classification
> index (reader). The customer document D (from the aux
> index) is supposed to be the top result. Ignore all
> other scores from the aux index. Use this (max) score
> of D to normalize all the scores of film docs from the
> classification index.
>
> HTH,
> Doron
>
> "Daniel Einspanjer" <[hidden email]> wrote on 30/05/2007 17:18:10:
>
> > This may be a five year old explaining to a four year old why the sky
> > is blue, but I'll share some of the stuff I've picked up. :)
> >
> > My application isn't so much a search engine as a matching engine.  I
> > take a large list of movie documents from a customer like a movie
> > channel or a cable provider and match that list against the movies our
> > company has classified.  I wrote a query parser on top of the native
> > query parser that understands interpolated terms such as
> > +title_fuzzy_multivalued:"${LongTitle}" and it will pull the LongTitle
> > field from the customer movie and plug it into that term.
> >
> > The huge problem I ran into was one of scoring.  Since this is
> > matching not searching, and since the interpolation causes the query
> > from item A to be different from item B and likely wildly different
> > from the queries used in a different customer's matching, I really
> > needed a good score that could be compared across the board.
> >
> > The solution I opted for was what I call perfect score normalization.
> > Basically, I index both the customer feed and the classified feed.
> > When the user of my system is adding a new feed to the system, they
> > define field alignments, e.g. they map the customer's LongTitle field
> > to the title field of the classified feed.  Then, they define the
> > appropriate indices to use for each field alignment, e.g. they might
> > index the title fields using the title_strict_string_multivalue and
> > the title_fuzzy_terms_multivalue indices.
> >
> > Now that I have these common indices, when I perform the matching run,
> > I interpolate the query using the values for the source item and get
> > both the best match from the classified feed (using a Solr filter
> > query to restrict the result set to only items with the classified
> > feed id) and the match for the customer item (using a filter query on
> > that item's ID).  Now that I have these two scores, they are
> > comparable in the sense that the score of the customer item is "as
> > good as it gets".  I divide the match score by the reference item
> > score and if the value is greater than one for some reason, I subtract
> > the amount above one from one to penalize it for being "too good".
> >
> > This strategy required a few tweaks in the Similarity class.  I have
> > actor name phrase queries with a word slop of two so that I can match
> > First Last to Last, First. I made my tf(float) function return 0 or 1
> > so that the scores for those two items look the same.  tf also matters
> > in the case of multiple hits of a term within a field such as title.
> > If I am matching a movie with the title "Caesar Came Saw and
> > Conquered", I don't want the title "Caesar Came, Caesar Saw, Caesar
> > Conquered" to have a higher score just because the word Caesar is
> > repeated.
> >
> > I customize the idf() function to always return a 1 for year fields
> > because it could do funny things to a score if the source item had a
> > year 1984 and my query term was year_year:[${year -1} TO ${year +1}]
> > and there was only one item with a year of 1983. The 1983 would
> > actually score higher than the 1984.
> >
> > I'm currently looking at whether overriding queryNorm() to always
> > return 1 is a good thing or not.  I saw reference in a recent thread
> > that doing that might cause ^ boosts in terms or clauses to not work
> > right so I need to go back and study that again.
> >
> > The other big thing that I'm doing is that the user doesn't define the
> > query in one big lump. They break it down into scoring sections. all
> > the title related terms are in one section and all the year related
> > terms in a different one.  The user defines weights that each of these
> > sections should contribute to my "weighted score".  I run individual
> > queries for each of these scoring sections against the source and
> > target items and record those normalized scores then multiply them by
> > their weights and add them up to get my weighted score.
> > This strategy is working pretty well, but it is slow because of all
> > the extra queries.  I know that I can eliminate them by getting access
> > to the Explanation object and parsing out the scores I want there, but
> > that is what I am in the middle of researching how to do now. :)
> >
> > Anyway.. some of this might be useful to you or maybe it is all
> > babble. You are either welcome or asked for forgiveness respectively.
> > :)
> >
> > Daniel
> >
> > On 5/30/07, Grant Ingersoll <[hidden email]> wrote:
> > > Hi Walt,
> > >
> > > One question that comes to mind, is what are you looking to do?  Are
> > > you not happy with the current scoring or you just trying to better
> > > understand scoring?  The calls to Similarity.tf(), etc. are call
> > > backs from within the scoring algorithm (have a look at TermScorer in
> > > the code) and provide a means for an application to change the score,
> > > but in many cases there really isn't too much incentive to do so.
> > >
> > > -Grant
> > >
> > >
> > > On May 30, 2007, at 4:45 PM, Walt Stoneburner wrote:
> > >
> > > > Hopefully I'm not opening myself up to public ridicule with what may
> > > > be a very stupid question, but...
> > > >
> > > > At the moment, I'm trying to wrap my head around some of
> > the math that
> > > > happens when Lucene does scoring.  Let's put aside the big equation
> > > > for a moment and focus on a simple method, such as tf().  [term
> > > > frequency]
> > > >
> > > > I understand that tf(freq) is supposed to return larger values when
> > > > freq is large, and smaller values when freq is small.  Though here's
> > > > what making me scratch my head today:
> > > >
> > > > a) Where does freq come from?  (Not what is it, but who computes it
> > > > and how?)
> > > >
> > > > Reason I ask is:
> > > >
> > > > b) How do I know what "large" and "small" is, as I don't
> > really have a
> > > > relative scale of what the max and min values are?  Should I just
> > > > assume a linear scale of 1.0 to 0.0 will be passed to the method?
> > > >
> > > > But then that begs the question...
> > > >
> > > > c) What values should I be passing out of a function like this?
> > > > Should I normalize my outgoing scores to some scale, or do I simply
> > > > just need to provide numbers that "have the right shaped curve".
> > > >
> > > > I wish the documentation shed a smidgen bit more light in
> > those areas.
> > > >
> > > >
> > > > I look at things like idf() which returns 1+log(ratio) and then has
> > > > that value squared.  Clearly that isn't on a scale of 1.0 to 0.0.
> > > >
> > > > I feel like there may be some mathematical trickery going on and that
> > > > maybe the actual score values themselves don't matter inside the
> > > > ranking code, so long as their relative values to one another.
> > > >
> > > > This then makes me ponder how the normalization process is done
> > > > between queries, allowing for a mix'n'match of results as these
> > > > numbers spill to the outside world.  Obviously normalization has to
> > > > happen at that point for the mixing query results magic to work.
> > > >
> > > >
> > > > Is there a math wizard in the group who can talk to me like I'm
> > > > four years old?
> > > >
> > > > -wls
> > > > http://www.wwco.com/~wls/blog/
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > Center for Natural Language Processing
> > > http://www.cnlp.org/tech/lucene.asp
> > >
> > > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> > > LuceneFAQ
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The values which compute scores.

wls-2
In reply to this post by wls-2
Grant writes:
> One question that comes to mind, is what are you looking to do?

What I'm trying to do is prevent Lucene from providing better ranking
for documents that use a term multiple times than those that have more
term hits.

I've got some huge queries with quite a number of unique terms.  I
want the documents that hit more unique terms to float to the top,
while documents that hit some or few of the terms to sink to the
bottom (even if they have more occurrences of those terms).

Lucene, as I understand things, does this for the most part, though it
is possible that term frequency can play a significant roll and drown
out the part of the desired behavior that I'd like to keep.

My users want to grab search results using the currently Lucene
method, then select a checkbox and search without term frequency
contributing to the score.  I, on the other hand, have a vested
interest in not maintaining two indexes.


Daniel wrote:
> I don't want the title "Caesar Came, Caesar Saw, Caesar Conquered" to have a higher
> score just because the word Caesar is repeated.

This is very much the kind of problem I'm trying to address.

> I'm currently looking at whether overriding queryNorm() to always
> return 1 is a good thing or not.

I vaguely recall reading something about boosts as well; I'm not sure
you want to mess with this one.  For me, idf() is the bigger question.


Yonik writes:
> For a single term, [freq is] determined at index time and stored in the index.

I guess what I'm asking is, is freq, the value passed to tf(), the
count of the term, or a ratio of the term to total terms in the index.

> Scores across queries aren't really comparable though.
> ... Lucene doesn't currently do this "mixing", ...

This has always been my understanding of search engines, that the raw
scores are effectively meaningless if used to mix other query results
together.

However, according to the documentation at
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html,
it would seem that #4 queryNorm() says:
"This factor does not affect document ranking (since all ranked
documents are multiplied by the same factor), but rather just attempts
to make scores from different queries (or even different indexes)
comparable."

This to me is black magic.  It alludes that one can do two different
queries and a merge-sort, and further that the content can come from
different indexes.

Either I'm reading this completely wrong, or the documentation may
need an update.


Hope this provides better clarification.

As for why am I doing asking all this?  What I'd like to do is get a
firm understanding of how Lucene does scoring in such a way that the
behavior can be modified, and if ambitious enough fill in some of the
holes in the documentation.

Thanks all,
-wls

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The values which compute scores.

Chris Hostetter-3

: What I'm trying to do is prevent Lucene from providing better ranking
: for documents that use a term multiple times than those that have more
: term hits.
:
: I've got some huge queries with quite a number of unique terms.  I
: want the documents that hit more unique terms to float to the top,
: while documents that hit some or few of the terms to sink to the
: bottom (even if they have more occurrences of those terms).
:
: Lucene, as I understand things, does this for the most part, though it
: is possible that term frequency can play a significant roll and drown
: out the part of the desired behavior that I'd like to keep.

your best two choices for tweaking this behavior are to make term
frequency less significant, or make the coord factor for boolean queries
more significant.

: I guess what I'm asking is, is freq, the value passed to tf(), the
: count of the term, or a ratio of the term to total terms in the index.

for term queries it is the literal term frequency (you can see this by
looking at the Explaination info for a query)

: "This factor does not affect document ranking (since all ranked
: documents are multiplied by the same factor), but rather just attempts
: to make scores from different queries (or even different indexes)
: comparable."
:
: This to me is black magic.  It alludes that one can do two different
: queries and a merge-sort, and further that the content can come from
: different indexes.

it is in fact, black magic ... as the phrase says it *attempts* to make
scores from different queries comparable ... it does not actually make
them mathematicly comparable, since scores are completley unbounded.

As i recall, a more practical purpose for the queryNorm is that when
dealing with large complex query structures consisting of "container"
queries (BooleanQueries, DisjunctionMaxQueries, SpanNearQueries, etc...)
the queryNorm is applied to the the "leaf"  queries as the computation
proceeds, which helps keep the scores from getting unmanagably large
(and loosing precision) as they are aggregated up.

when dealing with floats, where 0<n<1 ...
  A*n + B*n + C*n + ... Z*n
...results in  more "precise" calculation then...
  (A + B C + ... + Z)*n

...correct?




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]