PerFieldSimilarity

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

PerFieldSimilarity

Robichaud, Jean-Philippe
Hi Everyone,

I've been searching the archive without success to answer this one: is it
possible to specify one similarity class per field, just like we can do with
an analyzer ?  I know I can change the similarity of the searcher, but that
restrict me to break some complex queries into different chunk and sum the
score "by hand" rather than having the fast internal implementation do the
job.  What I would really like is to have something like
PerFieldAnalyzerWrapper but for similarity...  Is this possible ?

Jp

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: PerFieldSimilarity

Erik Hatcher

On May 3, 2005, at 5:57 PM, Robichaud, Jean-Philippe wrote:

> Hi Everyone,
>
> I've been searching the archive without success to answer this one:  
> is it
> possible to specify one similarity class per field, just like we  
> can do with
> an analyzer ?  I know I can change the similarity of the searcher,  
> but that
> restrict me to break some complex queries into different chunk and  
> sum the
> score "by hand" rather than having the fast internal implementation  
> do the
> job.  What I would really like is to have something like
> PerFieldAnalyzerWrapper but for similarity...  Is this possible ?

I'm interested in what your use case is in desiring this.  What  
factors would you vary per field?  The only factor that seems to make  
sense is lengthNorm which is computed at indexing time and does allow  
per-field tweaking.  A custom Similarity subclass could be used to  
affect the lengthNorm based on the field name parameter.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: PerFieldSimilarity

Robichaud, Jean-Philippe
In reply to this post by Robichaud, Jean-Philippe

I have an application where I use Lucene to retrieve "made up" documents,
just like many people do.  I my case, I need the score to be meaningful,
really meaning full.  For certain fields, the similarity should be a
frequency count, without idf factor, for others the idf should be the real
idf, for others again idf should be equal to sqrt(idf).  Again, I can change
the similarity of the reader at run-time and issue specific queries, summing
the score myself, but that is pretty inefficient.  A ScoreObject
(http://mail-archives.apache.org/mod_mbox/lucene-java-user/200504.mbox/%3c42
[hidden email]%3e) would save me a little bit, but that's
another topic.

I understand that Lucene objective is more to be a generic search engine
rather than a semantic/special IR system, but it is so close of being so
that is it too tempting to use it as is.

Jp
 

-----Original Message-----
From: Erik Hatcher [mailto:[hidden email]]
Sent: Tuesday, May 03, 2005 7:40 PM
To: [hidden email]
Subject: Re: PerFieldSimilarity


On May 3, 2005, at 5:57 PM, Robichaud, Jean-Philippe wrote:

> Hi Everyone,
>
> I've been searching the archive without success to answer this one:  
> is it
> possible to specify one similarity class per field, just like we  
> can do with
> an analyzer ?  I know I can change the similarity of the searcher,  
> but that
> restrict me to break some complex queries into different chunk and  
> sum the
> score "by hand" rather than having the fast internal implementation  
> do the
> job.  What I would really like is to have something like
> PerFieldAnalyzerWrapper but for similarity...  Is this possible ?

I'm interested in what your use case is in desiring this.  What  
factors would you vary per field?  The only factor that seems to make  
sense is lengthNorm which is computed at indexing time and does allow  
per-field tweaking.  A custom Similarity subclass could be used to  
affect the lengthNorm based on the field name parameter.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: PerFieldSimilarity

Doug Cutting
Robichaud, Jean-Philippe wrote:
> Again, I can change
> the similarity of the reader at run-time and issue specific queries, summing
> the score myself, but that is pretty inefficient.

You can also specify a Similarity implementation per Query node in a
complex query, e.g.:

BooleanQuery query = new BooleanQuery() {
   public Similarity getSimilarity(Searcher searcher) {
     return new DefaultSimilarity {
        ... override Similarity methods here ...
     };
   }
}

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: PerFieldSimilarity

Robichaud, Jean-Philippe
In reply to this post by Robichaud, Jean-Philippe
How cool, I did not knew that...  that may help me...  If I understand you
correctly, I can create a boolean query where each "clause" use a different
similarity ?

Thanks,

Jp

___________________________________________________________________________
SpeechWorks solutions from ScanSoft. Inspired Applications, Exceptional
Results
 
<Jean-Philippe Robichaud >  ::  Solution Speech Scientist
ScanSoft :: Professional Services
5100-75 Queen Street, Montreal, QC
P +1 514 843 4884
 

-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Wednesday, May 04, 2005 4:45 PM
To: [hidden email]
Subject: Re: PerFieldSimilarity

Robichaud, Jean-Philippe wrote:
> Again, I can change
> the similarity of the reader at run-time and issue specific queries,
summing
> the score myself, but that is pretty inefficient.

You can also specify a Similarity implementation per Query node in a
complex query, e.g.:

BooleanQuery query = new BooleanQuery() {
   public Similarity getSimilarity(Searcher searcher) {
     return new DefaultSimilarity {
        ... override Similarity methods here ...
     };
   }
}

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: PerFieldSimilarity

Doug Cutting
Robichaud, Jean-Philippe wrote:
> How cool, I did not knew that...  that may help me...  If I understand you
> correctly, I can create a boolean query where each "clause" use a different
> similarity ?

Yes.  That would look something like:

BooleanQuery booleanQuery = new BooleanQuery();
TermQuery clause1 = new TermQuery("foo", "bar") {
     public Similarity getSimilarity(Searcher searcher) {
       return new DefaultSimilarity() {
         public float idf(Term term) { return 1.0f; }
       };
     }
  };
booleanQuery.add(clause1, true, false);
...

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: PerFieldSimilarity

Robichaud, Jean-Philippe
In reply to this post by Robichaud, Jean-Philippe
Thanks for the clarification...

While studying more in depth the doc about Similarity, I discover something
that is troubling be a little.  The idf is calculated using the following
formula:

(Log (numDocInIndex/ (numDocWithTerm_t +1)) +1

While I agree this is fine for most application, it is not quite in mine.
numDocWithTerm_t is really, numDocWith_t.text_in_field_t.field.  That's fine
with me, the problem is the other guy numDocInIndex...  I would like to use
numDocInIndex_having_t.field.  The reason is, again, that I want the
similarity score to be really meaningful.  I have 'classes' of document in
the same index :
Document1: MeaningA="something here",ContentA="searchable text 1"
Document2: MeaningB="something else",ContentB="searchable text 2"
...

I have an unequal number of "A" and "B" documents.  The same query text will
be sent in contentA and contentB at the same time.  Since there is more
document in class B than in class A, the "idf" should use a different
numDocInIndex value.  Is there a good way to achieve that ?

Thanks for all your help,

Jp


-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Wednesday, May 04, 2005 5:10 PM
To: [hidden email]
Subject: Re: PerFieldSimilarity

Robichaud, Jean-Philippe wrote:
> How cool, I did not knew that...  that may help me...  If I understand you
> correctly, I can create a boolean query where each "clause" use a
different
> similarity ?

Yes.  That would look something like:

BooleanQuery booleanQuery = new BooleanQuery();
TermQuery clause1 = new TermQuery("foo", "bar") {
     public Similarity getSimilarity(Searcher searcher) {
       return new DefaultSimilarity() {
         public float idf(Term term) { return 1.0f; }
       };
     }
  };
booleanQuery.add(clause1, true, false);
...

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]