Overriding Similarity

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Overriding Similarity

Grant Ingersoll
Hey Luceners,

I am doing some documentation on scoring and I am interested in use  
cases people have for overriding the DefaultSimilarity.  If you can  
share what you did and why you did it, it would be much appreciated.

For example, Daniel Naber posted his at: http://www.gossamer- 
threads.com/lists/lucene/java-user/38967?search_string=Similarity;#38967

Thanks,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Overriding Similarity

Chris Hostetter-3

: I am doing some documentation on scoring and I am interested in use
: cases people have for overriding the DefaultSimilarity.  If you can
: share what you did and why you did it, it would be much appreciated.

I touched on this a little bit when i commited SweetSpotSimilarity...

http://www.nabble.com/Re%3A-SweetSpotSimiliarity-p4536312.html

...really any situation where you know more about your data then just that
it's "text" is a situation where it *might* make sense to to override your
SImilarity method.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Overriding Similarity

MH H
I had a situation where I was only interested in whether the term was
there or not (not how many times), and I didn't want to penalize long
fields. So I wrote a Similariy subclass where I overrided the
following methods as this:

   public float lengthNorm(String fieldName, int numTerms) {
      return numTerms > 0 ? 1.0f : 0.0f;
   }
       
   public float tf(float freq) {
      return freq > 0 ? 1.0f : 0.0f;
   }

And then I made this subclass the default similarity. It worked well
for tf but not for lengthNorm. The reason appears to be that the
TermScorer class does not call lengthNorm, but instead uses a cache
implemented as an static array in Similarity, made available through
static methods in Similarity. Since TermScorer calls these static
methods in Similarity, changing the default similarity has no effect
in this regard. So I ended up having to customize the code of core
lucene by changing the following code in Similarity:

   static {
      for (int i = 0; i < 256; i++)
         NORM_TABLE[i] = 1.0f; //Originally: NORM_TABLE[i] =
SmallFloat.byte315ToFloat((byte)i);
   }

This worked well, but I had hoped not having to change core lucene, so
if anyone has any other/better solution, I would appreciate some tips.

MHH


> : I am doing some documentation on scoring and I am interested in use
> : cases people have for overriding the DefaultSimilarity.  If you can
> : share what you did and why you did it, it would be much appreciated.
>
> I touched on this a little bit when i commited SweetSpotSimilarity...
>
> http://www.nabble.com/Re%3A-SweetSpotSimiliarity-p4536312.html
>
> ...really any situation where you know more about your data then just that
> it's "text" is a situation where it *might* make sense to to override your
> SImilarity method.
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Overriding Similarity

Chris Hostetter-3
: And then I made this subclass the default similarity. It worked well
: for tf but not for lengthNorm. The reason appears to be that the
: TermScorer class does not call lengthNorm, but instead uses a cache

Acctually, the lengthNorm method is used by the IndexWriter; it compresses
the float returned by lengthNorm into a representation that uses a single
byte, and writes it to a file (one per field) which is exposed by
IndexReader.norms(field) for use in the Scorers.

:          NORM_TABLE[i] = 1.0f; //Originally: NORM_TABLE[i] =
: SmallFloat.byte315ToFloat((byte)i);

that norm table is just used as a cache of mappings from the "byte
encoded" values to the nearest float value so that Scorers don't need to
call SmallFloat.byte315ToFloat((byte)i) everytime.

If you use Similarity.setDefault (or IndexWriter.setSimilarity) before
building your index you shouldn't need that change.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Overriding Similarity

MH H
Ah, I see, I should of course use the same similarity during indexing
and searching. Many thanks!

On 20/08/06, Chris Hostetter <[hidden email]> wrote:

> : And then I made this subclass the default similarity. It worked well
> : for tf but not for lengthNorm. The reason appears to be that the
> : TermScorer class does not call lengthNorm, but instead uses a cache
>
> Acctually, the lengthNorm method is used by the IndexWriter; it compresses
> the float returned by lengthNorm into a representation that uses a single
> byte, and writes it to a file (one per field) which is exposed by
> IndexReader.norms(field) for use in the Scorers.
>
> :          NORM_TABLE[i] = 1.0f; //Originally: NORM_TABLE[i] =
> : SmallFloat.byte315ToFloat((byte)i);
>
> that norm table is just used as a cache of mappings from the "byte
> encoded" values to the nearest float value so that Scorers don't need to
> call SmallFloat.byte315ToFloat((byte)i) everytime.
>
> If you use Similarity.setDefault (or IndexWriter.setSimilarity) before
> building your index you shouldn't need that change.
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]