Implementation of a ScoreObject ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Implementation of a ScoreObject ?

Robichaud, Jean-Philippe
Hi Everyone,

 

            Lucene is incredible for a lot of reasons.  I've been using it
for the past months and it served me quite well.  I'm using the subversion
snapshots, which I update every now and then.  Almost every functionality I
need is already present and well implemented, but sadly some crucial ones
are missing. I think the most crucial for me is to have something like a
ScoreObject that would contain in a simple (bean?) way all the score
information.  Having to use the reader.explain() function is just impossible
since it basically rerun  the entire search to give a STRING representation
of the scores of ONE document.  Parsing the explanation is pretty slow for
an application that handles 1k-1.5k queries per minutes.  I would really
need that "term level" information to enhance my application.

 

Probably the simplest/ideal schema of the ScoreObject would be something
like a hashtable with Term being the keys and a TermScoreObject the value.
The TermScoreObject would be filled at search time (if asked) and would
contain all values used in the calculation of the "similarity score".  That
way we could easily know what is the contribution of a specific term to the
overall score.  

 

Is this is something that would be useful to others also?  Is this a feature
that was on the dev whiteboard?

 

Thanks,

 

Jp

 

____________________________________________________________________________
_________
SpeechWorks solutions from ScanSoft. Inspired Applications, Exceptional
Results

 

<Jean-Philippe Robichaud >  ::  Solution Speech Scientist

ScanSoft :: Professional Services

5100-75 Queen Street, Montreal, QC

P +1 514 843 4884

 

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Implementation of a ScoreObject ?

Chuck Williams
Robichaud, Jean-Philippe wrote:

>Probably the simplest/ideal schema of the ScoreObject would be something
>like a hashtable with Term being the keys and a TermScoreObject the value.
>The TermScoreObject would be filled at search time (if asked) and would
>contain all values used in the calculation of the "similarity score".  That
>way we could easily know what is the contribution of a specific term to the
>overall score.  
>  
>
Jean-Philippe,

Some of us have talked about a score object in the past and agree that
this would be a very good thing.  In addition to providing a sounder
foundation for explanation, such a mechanism could help to provide
better scoring.  For example, one limitation in Lucene now is that score
normalization is ad hoc -- all scores are divided by the highest score
IF the highest score is greater than 1, and whether or not the highest
unnormalized score is greater that 1 is pretty much random.  This yields
a situation where scores across multiple searches are not comparable
(notwithstanding many applications that do compare them, getting random
results).  With a score object, one would like to keep additional
information, e.g., a count of boost-weighted query terms and the
boost-weighted percentage of such terms that were matched by each
result.  This could provide a more intrinsic normalization scheme, e.g.,
defining the highest score as the boost-weighted percentage of matched
query terms and dividing all scores by the same constant to achieve
this.  (Some additional refinements are necessary to handle things like
MultiTermQuery's, which rewrite to BooleanQuery's with coord disabled --
such lists of alternate query terms should count as one term).

That is one addition example of something score objects could be used
for.  A general mechanism should provide for easy extension such that
different scoring classes could collect, record and aggregate different
information for various purposes.

I've wanted to work on this for a while but haven't found the time.  I
know Doug has had a score object mechanism on his radar screen (he first
suggested this approach to me as a solution to the normalization issue
I'm concerned about).  I expect he has a good approach in mind.  It
would be great if you'd tackle this -- I'd be happy to help if that
makes sense.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Implementation of a ScoreObject ?

Robichaud, Jean-Philippe
In reply to this post by Robichaud, Jean-Philippe
I would gladly help.  I fear that my Java skills are probably a little
limited for the task, but hey, why not.  I would certainly need some
guidance as to where to start from.  I'm just to unfamiliar with complexes
queries structures and scoring methodology.  While I'm pretty sure reading
the entire code would be a great exercise, I'm not sure I can afford the
time needed to learn everything the hard way...  

Doug, do you have any clues form where I can start from ?

Thanks,

Jp



-----Original Message-----
From: Chuck Williams [mailto:[hidden email]]
Sent: Wednesday, April 27, 2005 12:30 PM
To: [hidden email]
Subject: Re: Implementation of a ScoreObject ?

Robichaud, Jean-Philippe wrote:

>Probably the simplest/ideal schema of the ScoreObject would be something
>like a hashtable with Term being the keys and a TermScoreObject the value.
>The TermScoreObject would be filled at search time (if asked) and would
>contain all values used in the calculation of the "similarity score".  That
>way we could easily know what is the contribution of a specific term to the
>overall score.  
>  
>
Jean-Philippe,

Some of us have talked about a score object in the past and agree that
this would be a very good thing.  In addition to providing a sounder
foundation for explanation, such a mechanism could help to provide
better scoring.  For example, one limitation in Lucene now is that score
normalization is ad hoc -- all scores are divided by the highest score
IF the highest score is greater than 1, and whether or not the highest
unnormalized score is greater that 1 is pretty much random.  This yields
a situation where scores across multiple searches are not comparable
(notwithstanding many applications that do compare them, getting random
results).  With a score object, one would like to keep additional
information, e.g., a count of boost-weighted query terms and the
boost-weighted percentage of such terms that were matched by each
result.  This could provide a more intrinsic normalization scheme, e.g.,
defining the highest score as the boost-weighted percentage of matched
query terms and dividing all scores by the same constant to achieve
this.  (Some additional refinements are necessary to handle things like
MultiTermQuery's, which rewrite to BooleanQuery's with coord disabled --
such lists of alternate query terms should count as one term).

That is one addition example of something score objects could be used
for.  A general mechanism should provide for easy extension such that
different scoring classes could collect, record and aggregate different
information for various purposes.

I've wanted to work on this for a while but haven't found the time.  I
know Doug has had a score object mechanism on his radar screen (he first
suggested this approach to me as a solution to the normalization issue
I'm concerned about).  I expect he has a good approach in mind.  It
would be great if you'd tackle this -- I'd be happy to help if that
makes sense.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...