Newbie questions re: scoring

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Newbie questions re: scoring

Lee, Andrew J (CA - Toronto)
Hi,

I am new to Lucene and this mailing list, so my apologies if these
questions have already been answered.

1)  I create an index with one document with a searchable field of "All
dogs are brown."  If I search on that index with a query of "All dogs
are brown." I do not get a hit with score 1.0, but something low like
0.38.  I tried looking at the scoring algorithm and can't make heads or
tails of it.  Can anybody explain it to me in simple terms?

2)  I have an index of documents, then run a search against it.  I run
through the list of hits, building a Vector of documents whose score is
above a certain threshold.  If I run the program with a threshold of
say, 0.15, I'll get a Vector of documents with scores >= 0.15 (as
expected).  If I set the threshold higher (0.30, for example) and rerun
the program, I see some of the same documents that I thought would have
been trimmed off with the higher threshold.  With a threshold of 0.15
they would score 0.17, and with a threshold of 0.30 they are scoring
something like 0.33.  Can anybody explain this?  My trimming is coming
post-index-searching, so this is pretty confusing.

Thanks in advance for any help.

Andrew Lee



-----------------------------------------
*******************************************************************
*******************
Confidentiality Warning: This message and any attachments are
intended only for the use of the intended recipient(s), are
confidential, and may be privileged. If you are not the intended
recipient, you are hereby notified that any review, retransmission,
conversion to hard copy, copying, circulation or other use of this
message and any attachments is strictly prohibited. If you are not
the intended recipient, please notify the sender immediately by
return e-mail, and delete this message and any attachments from
your system. Thank you.

Information confidentielle: Le présent message, ainsi que tout
fichier qui y est joint, est envoyé à l'intention exclusive de
son ou de ses destinataires; il est de nature confidentielle et
peut constituer une information privilégiée. Nous avertissons
toute personne autre que le destinataire prévu que tout examen,
réacheminement, impression, copie, distribution ou autre
utilisation de ce message et de tout fichier qui y est joint est
strictement interdit. Si vous n'êtes pas le destinataire prévu,
veuillez en aviser immédiatement l'expéditeur par retour de
courriel et supprimer ce message et tout document joint de votre
système. Merci.
*******************************************************************
*******************
Reply | Threaded
Open this post in threaded view
|

Re: Newbie questions re: scoring

Chris Hostetter-3

: 1)  I create an index with one document with a searchable field of "All
: dogs are brown."  If I search on that index with a query of "All dogs
: are brown." I do not get a hit with score 1.0, but something low like
: 0.38.  I tried looking at the scoring algorithm and can't make heads or
: tails of it.  Can anybody explain it to me in simple terms?

I've been using Lucene for about 16 months now, and i've never found a
simple way to explain the scoring.  But a big factor that you need to
realize is there is a differnece between the "raw" score and the
normalized score.  if you use a HitCollector or TopDocs object you get the
raw scored -- which is uncosntrained.  if you use a Hits object then your
scores will be normalized so that *if* the highest scoring document has a
score above 1, then all scores will be divided by the highest score -- if
the highest score is less then one, nothing changes.

my best advice for understainding how scores are calculated, is to look
at the toString() of an Explanation object from searcher.explain() for a
bunch of queries on a bunch of documens you know match, and think about
how those explanations corrispond to the equation in the Similarity class
javadocs.

: 2)  I have an index of documents, then run a search against it.  I run
: through the list of hits, building a Vector of documents whose score is
: above a certain threshold.  If I run the program with a threshold of
: say, 0.15, I'll get a Vector of documents with scores >= 0.15 (as
: expected).  If I set the threshold higher (0.30, for example) and rerun
: the program, I see some of the same documents that I thought would have
: been trimmed off with the higher threshold.  With a threshold of 0.15
: they would score 0.17, and with a threshold of 0.30 they are scoring
: something like 0.33.  Can anybody explain this?  My trimming is coming
: post-index-searching, so this is pretty confusing.

you are doing this with the exact same index and Query each time?

1) that shouldn't happen .. can you email some code that demonstates this
problem (ideally code that builds a small index and then searches it and
shows the same document getting two different scores without the index
changing)

2) independent of the scores being different, it is not safe to try and
pick a score threshold, this is mentioned in the FAQ...

http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Newbie questions re: scoring

John Hamilton
In reply to this post by Lee, Andrew J (CA - Toronto)

> 2) independent of the scores being different, it is not safe to try and
> pick a score threshold, this is mentioned in the FAQ...
>
> http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03



That link appears to be referring to normalized scores (everything is < 1.0).  Is it also not safe to use a threshold for raw scores?


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Newbie questions re: scoring

Chris Hostetter-3

: That link appears to be referring to normalized scores (everything is <
: 1.0).  Is it also not safe to use a threshold for raw scores?

Nope.  The basic flaw in comparing scores between two queries still holds
... early messages in the threads linked to go into more detail, but as i
recall, the basic problem has to do with the way idf and docFreq come into
play.  Just becuase a term query for foo:bar says that document A has a
score of 2.2 and B has a score of 6.6; and a term query for yak:baz says
that document X has a score of 2.2 and Y has a score of 6.6 doesn't means
X is as relevent to yak:baz as A is to foo:bar -- it just means that the
relative quality of B compared to A is the same as the relative quality of
Y compared to X for their respective queries.  (once their normalized,
even that goes out the window)

the only way I can think of to fairly compare scores from queries for
foo:bar with queries for yak:baz is to normalize them relative a maximum
possible score across the entire term query space -- but finding that
maximum is a pretty complicated problem just for simple term queries ...
when you start talking about more complicated query structures you really
get messy -- and even then it's only fair as long as the query structures
are identical, you can never compare the scores from apples and oranges.





-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]