Custom lucene scoring - Dot product between field boost and query boost

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Custom lucene scoring - Dot product between field boost and query boost

Yuval Kesten
Hi,
I want to use Lucene with the following scoring logic:
When I index my documents I want to set for each field a score/weight.
When I query my index I want to set for each query term a score/weight.

I will NEVER index or query with many instances of the same field - In each query (document) there will be 0-1 instances with the same field name.
My fields/query term are not analyzed - they are already made out of one token.

I want the score to be simply the dot product between the fields of the query to the fields of the document if they have the same value.

For example:
Query:
Field Name

Field Value

Field Score

1

AA

0.1

7

BB

0.2

8

CC

0.3


Document 1:
Field Name

Field Value

Field Score

1

AA

0.2

2

DD

0.8

7

CC

0.999

10

FFF

0.1


Document 2:
Field Name

Field Value

Field Score

7

BB

0.3

8

CC

0.5


The scores should be:
Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2  = 0.02
Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q * FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)

What would be the best way implement it? In terms of accuracy and performances (I don't need TF and IDF calculations).

I currently implemented it by setting boosts to the fields and query terms.
Then I overwritten the DefaultSimilarity class:

public class MySimilarity extends DefaultSimilarity {

    @Override
    public float computeNorm(String field, FieldInvertState state) {
        return state.getBoost();
    }

    @Override
    public float queryNorm(float sumOfSquaredWeights) {
        return 1;
    }

    @Override
    public float tf(float freq) {
        return 1;
    }

    @Override
    public float idf(int docFreq, int numDocs) {
        return 1;
    }

    @Override
    public float coord(int overlap, int maxOverlap) {
        return 1;
    }

}

And based on http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html this should work.
Problems:
1. Performances: I am calculating all the TF/IDF stuff and NORMS for nothing...
2. The score I get from the TopScoreDocCollector is not the same as I get from the Explanation.
Here is part of my code:

indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN, true);
indexSearcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = indexSearcher.doc(docId);
double score = hits[i].score;
String id = d.get(FIELD_ID);
Explanation explanation = indexSearcher.explain(query, docId);
}

Thanks!

Reply | Threaded
Open this post in threaded view
|

RE: Custom lucene scoring - Dot product between field boost and query boost

Yuval Kesten
The same question is formatted nicer here:
http://stackoverflow.com/questions/9380188/custom-lucene-scoring-dot-product-between-field-boost-and-query-boost

Thanks!

-----Original Message-----
From: Yuval Kesten [mailto:[hidden email]]
Sent: Tuesday, February 21, 2012 5:18 PM
To: [hidden email]
Subject: Custom lucene scoring - Dot product between field boost and query boost

Hi,
I want to use Lucene with the following scoring logic:
When I index my documents I want to set for each field a score/weight.
When I query my index I want to set for each query term a score/weight.

I will NEVER index or query with many instances of the same field - In each query (document) there will be 0-1 instances with the same field name.
My fields/query term are not analyzed - they are already made out of one token.

I want the score to be simply the dot product between the fields of the query to the fields of the document if they have the same value.

For example:
Query:
Field Name

Field Value

Field Score

1

AA

0.1

7

BB

0.2

8

CC

0.3


Document 1:
Field Name

Field Value

Field Score

1

AA

0.2

2

DD

0.8

7

CC

0.999

10

FFF

0.1


Document 2:
Field Name

Field Value

Field Score

7

BB

0.3

8

CC

0.5


The scores should be:
Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2  = 0.02
Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q * FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)

What would be the best way implement it? In terms of accuracy and performances (I don't need TF and IDF calculations).

I currently implemented it by setting boosts to the fields and query terms.
Then I overwritten the DefaultSimilarity class:

public class MySimilarity extends DefaultSimilarity {

    @Override
    public float computeNorm(String field, FieldInvertState state) {
        return state.getBoost();
    }

    @Override
    public float queryNorm(float sumOfSquaredWeights) {
        return 1;
    }

    @Override
    public float tf(float freq) {
        return 1;
    }

    @Override
    public float idf(int docFreq, int numDocs) {
        return 1;
    }

    @Override
    public float coord(int overlap, int maxOverlap) {
        return 1;
    }

}

And based on http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html this should work.
Problems:
1. Performances: I am calculating all the TF/IDF stuff and NORMS for nothing...
2. The score I get from the TopScoreDocCollector is not the same as I get from the Explanation.
Here is part of my code:

indexSearcher = new IndexSearcher(IndexReader.open(directory, true)); TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN, true); indexSearcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; ++i) { int docId = hits[i].doc; Document d = indexSearcher.doc(docId); double score = hits[i].score; String id = d.get(FIELD_ID); Explanation explanation = indexSearcher.explain(query, docId); }

Thanks!


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Em
Reply | Threaded
Open this post in threaded view
|

Re: Custom lucene scoring - Dot product between field boost and query boost

Em
In reply to this post by Yuval Kesten
Hi Yuval,

> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for
> nothing...
You aren't calculating that much, since you declared all those values as
constants. What are you worried about?

> 2. The score I get from the TopScoreDocCollector is not the same as I
get from the Explanation.
> Here is part of my code:
Could you provide us the code where you are setting the Similarity, please?

Kind regards,
Em

Am 21.02.2012 16:18, schrieb Yuval Kesten:

> Hi,
> I want to use Lucene with the following scoring logic:
> When I index my documents I want to set for each field a score/weight.
> When I query my index I want to set for each query term a score/weight.
>
> I will NEVER index or query with many instances of the same field - In each query (document) there will be 0-1 instances with the same field name.
> My fields/query term are not analyzed - they are already made out of one token.
>
> I want the score to be simply the dot product between the fields of the query to the fields of the document if they have the same value.
>
> For example:
> Query:
> Field Name
>
> Field Value
>
> Field Score
>
> 1
>
> AA
>
> 0.1
>
> 7
>
> BB
>
> 0.2
>
> 8
>
> CC
>
> 0.3
>
>
> Document 1:
> Field Name
>
> Field Value
>
> Field Score
>
> 1
>
> AA
>
> 0.2
>
> 2
>
> DD
>
> 0.8
>
> 7
>
> CC
>
> 0.999
>
> 10
>
> FFF
>
> 0.1
>
>
> Document 2:
> Field Name
>
> Field Value
>
> Field Score
>
> 7
>
> BB
>
> 0.3
>
> 8
>
> CC
>
> 0.5
>
>
> The scores should be:
> Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2  = 0.02
> Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q * FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)
>
> What would be the best way implement it? In terms of accuracy and performances (I don't need TF and IDF calculations).
>
> I currently implemented it by setting boosts to the fields and query terms.
> Then I overwritten the DefaultSimilarity class:
>
> public class MySimilarity extends DefaultSimilarity {
>
>     @Override
>     public float computeNorm(String field, FieldInvertState state) {
>         return state.getBoost();
>     }
>
>     @Override
>     public float queryNorm(float sumOfSquaredWeights) {
>         return 1;
>     }
>
>     @Override
>     public float tf(float freq) {
>         return 1;
>     }
>
>     @Override
>     public float idf(int docFreq, int numDocs) {
>         return 1;
>     }
>
>     @Override
>     public float coord(int overlap, int maxOverlap) {
>         return 1;
>     }
>
> }
>
> And based on http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html this should work.
> Problems:
> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for nothing...
> 2. The score I get from the TopScoreDocCollector is not the same as I get from the Explanation.
> Here is part of my code:
>
> indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
> TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN, true);
> indexSearcher.search(query, collector);
> ScoreDoc[] hits = collector.topDocs().scoreDocs;
> for (int i = 0; i < hits.length; ++i) {
> int docId = hits[i].doc;
> Document d = indexSearcher.doc(docId);
> double score = hits[i].score;
> String id = d.get(FIELD_ID);
> Explanation explanation = indexSearcher.explain(query, docId);
> }
>
> Thanks!
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Custom lucene scoring - Dot product between field boost and query boost

Yuval Kesten
Hi Em,
1. Regarding the performances - the similarity class (And my subtype as well) gets the IDF and TF and SQUARED SUMS calculations as inputs - they just factor them differently. Even though I ignore the values they are being computed.
2. I have written this code:
    static {
        Similarity.setDefault(new MySimilarity());
    }
Which means that I am setting the default similarity before doing the indexing and obviously before the searching.
Thanks!

-----Original Message-----
From: Em [mailto:[hidden email]]
Sent: Tuesday, February 21, 2012 6:07 PM
To: [hidden email]
Subject: Re: Custom lucene scoring - Dot product between field boost and query boost

Hi Yuval,

> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for
> nothing...
You aren't calculating that much, since you declared all those values as constants. What are you worried about?

> 2. The score I get from the TopScoreDocCollector is not the same as I
get from the Explanation.
> Here is part of my code:
Could you provide us the code where you are setting the Similarity, please?

Kind regards,
Em

Am 21.02.2012 16:18, schrieb Yuval Kesten:

> Hi,
> I want to use Lucene with the following scoring logic:
> When I index my documents I want to set for each field a score/weight.
> When I query my index I want to set for each query term a score/weight.
>
> I will NEVER index or query with many instances of the same field - In each query (document) there will be 0-1 instances with the same field name.
> My fields/query term are not analyzed - they are already made out of one token.
>
> I want the score to be simply the dot product between the fields of the query to the fields of the document if they have the same value.
>
> For example:
> Query:
> Field Name
>
> Field Value
>
> Field Score
>
> 1
>
> AA
>
> 0.1
>
> 7
>
> BB
>
> 0.2
>
> 8
>
> CC
>
> 0.3
>
>
> Document 1:
> Field Name
>
> Field Value
>
> Field Score
>
> 1
>
> AA
>
> 0.2
>
> 2
>
> DD
>
> 0.8
>
> 7
>
> CC
>
> 0.999
>
> 10
>
> FFF
>
> 0.1
>
>
> Document 2:
> Field Name
>
> Field Value
>
> Field Score
>
> 7
>
> BB
>
> 0.3
>
> 8
>
> CC
>
> 0.5
>
>
> The scores should be:
> Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2  = 0.02
> Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q *
> FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)
>
> What would be the best way implement it? In terms of accuracy and performances (I don't need TF and IDF calculations).
>
> I currently implemented it by setting boosts to the fields and query terms.
> Then I overwritten the DefaultSimilarity class:
>
> public class MySimilarity extends DefaultSimilarity {
>
>     @Override
>     public float computeNorm(String field, FieldInvertState state) {
>         return state.getBoost();
>     }
>
>     @Override
>     public float queryNorm(float sumOfSquaredWeights) {
>         return 1;
>     }
>
>     @Override
>     public float tf(float freq) {
>         return 1;
>     }
>
>     @Override
>     public float idf(int docFreq, int numDocs) {
>         return 1;
>     }
>
>     @Override
>     public float coord(int overlap, int maxOverlap) {
>         return 1;
>     }
>
> }
>
> And based on http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html this should work.
> Problems:
> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for nothing...
> 2. The score I get from the TopScoreDocCollector is not the same as I get from the Explanation.
> Here is part of my code:
>
> indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
> TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN,
> true); indexSearcher.search(query, collector); ScoreDoc[] hits =
> collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; ++i) {
> int docId = hits[i].doc; Document d = indexSearcher.doc(docId); double
> score = hits[i].score; String id = d.get(FIELD_ID); Explanation
> explanation = indexSearcher.explain(query, docId); }
>
> Thanks!
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Em
Reply | Threaded
Open this post in threaded view
|

Re: Custom lucene scoring - Dot product between field boost and query boost

Em
Hi Yuval,

> 1. Regarding the performances - the similarity class (And my subtype
as well) gets the IDF and TF and SQUARED SUMS calculations as inputs -
they just factor them differently. Even though I ignore the values they
are being computed.

Good point. However I think that these values are relatively cheap and
nothing to worry about, as long as it does not harm your performance
(measureable!).

> 2. I have written this code:
>     static {
>         Similarity.setDefault(new MySimilarity());
>     }

What class do you get back when you call getSimilarity on your searcher?

Could you please provide us the output of your scores and your
Explanation's?

Regards,
Em

Am 22.02.2012 08:17, schrieb Yuval Kesten:

> Hi Em,
> 1. Regarding the performances - the similarity class (And my subtype as well) gets the IDF and TF and SQUARED SUMS calculations as inputs - they just factor them differently. Even though I ignore the values they are being computed.
> 2. I have written this code:
>     static {
>         Similarity.setDefault(new MySimilarity());
>     }
> Which means that I am setting the default similarity before doing the indexing and obviously before the searching.
> Thanks!
>
> -----Original Message-----
> From: Em [mailto:[hidden email]]
> Sent: Tuesday, February 21, 2012 6:07 PM
> To: [hidden email]
> Subject: Re: Custom lucene scoring - Dot product between field boost and query boost
>
> Hi Yuval,
>
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for
>> nothing...
> You aren't calculating that much, since you declared all those values as constants. What are you worried about?
>
>> 2. The score I get from the TopScoreDocCollector is not the same as I
> get from the Explanation.
>> Here is part of my code:
> Could you provide us the code where you are setting the Similarity, please?
>
> Kind regards,
> Em
>
> Am 21.02.2012 16:18, schrieb Yuval Kesten:
>> Hi,
>> I want to use Lucene with the following scoring logic:
>> When I index my documents I want to set for each field a score/weight.
>> When I query my index I want to set for each query term a score/weight.
>>
>> I will NEVER index or query with many instances of the same field - In each query (document) there will be 0-1 instances with the same field name.
>> My fields/query term are not analyzed - they are already made out of one token.
>>
>> I want the score to be simply the dot product between the fields of the query to the fields of the document if they have the same value.
>>
>> For example:
>> Query:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 1
>>
>> AA
>>
>> 0.1
>>
>> 7
>>
>> BB
>>
>> 0.2
>>
>> 8
>>
>> CC
>>
>> 0.3
>>
>>
>> Document 1:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 1
>>
>> AA
>>
>> 0.2
>>
>> 2
>>
>> DD
>>
>> 0.8
>>
>> 7
>>
>> CC
>>
>> 0.999
>>
>> 10
>>
>> FFF
>>
>> 0.1
>>
>>
>> Document 2:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 7
>>
>> BB
>>
>> 0.3
>>
>> 8
>>
>> CC
>>
>> 0.5
>>
>>
>> The scores should be:
>> Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2  = 0.02
>> Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q *
>> FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)
>>
>> What would be the best way implement it? In terms of accuracy and performances (I don't need TF and IDF calculations).
>>
>> I currently implemented it by setting boosts to the fields and query terms.
>> Then I overwritten the DefaultSimilarity class:
>>
>> public class MySimilarity extends DefaultSimilarity {
>>
>>     @Override
>>     public float computeNorm(String field, FieldInvertState state) {
>>         return state.getBoost();
>>     }
>>
>>     @Override
>>     public float queryNorm(float sumOfSquaredWeights) {
>>         return 1;
>>     }
>>
>>     @Override
>>     public float tf(float freq) {
>>         return 1;
>>     }
>>
>>     @Override
>>     public float idf(int docFreq, int numDocs) {
>>         return 1;
>>     }
>>
>>     @Override
>>     public float coord(int overlap, int maxOverlap) {
>>         return 1;
>>     }
>>
>> }
>>
>> And based on http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html this should work.
>> Problems:
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for nothing...
>> 2. The score I get from the TopScoreDocCollector is not the same as I get from the Explanation.
>> Here is part of my code:
>>
>> indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
>> TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN,
>> true); indexSearcher.search(query, collector); ScoreDoc[] hits =
>> collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; ++i) {
>> int docId = hits[i].doc; Document d = indexSearcher.doc(docId); double
>> score = hits[i].score; String id = d.get(FIELD_ID); Explanation
>> explanation = indexSearcher.explain(query, docId); }
>>
>> Thanks!
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Custom lucene scoring - Dot product between field boost and query boost

Alan Woodward
In reply to this post by Yuval Kesten
Hi Yuval,

You can just override Similarity, rather than DefaultSimilarity - that way you don't burn any CPU cycles on TF/IDF calculations.

Alan

On 22 Feb 2012, at 07:17, Yuval Kesten wrote:

> Hi Em,
> 1. Regarding the performances - the similarity class (And my subtype as well) gets the IDF and TF and SQUARED SUMS calculations as inputs - they just factor them differently. Even though I ignore the values they are being computed.
> 2. I have written this code:
>    static {
>        Similarity.setDefault(new MySimilarity());
>    }
> Which means that I am setting the default similarity before doing the indexing and obviously before the searching.
> Thanks!
>
> -----Original Message-----
> From: Em [mailto:[hidden email]]
> Sent: Tuesday, February 21, 2012 6:07 PM
> To: [hidden email]
> Subject: Re: Custom lucene scoring - Dot product between field boost and query boost
>
> Hi Yuval,
>
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for
>> nothing...
> You aren't calculating that much, since you declared all those values as constants. What are you worried about?
>
>> 2. The score I get from the TopScoreDocCollector is not the same as I
> get from the Explanation.
>> Here is part of my code:
> Could you provide us the code where you are setting the Similarity, please?
>
> Kind regards,
> Em
>
> Am 21.02.2012 16:18, schrieb Yuval Kesten:
>> Hi,
>> I want to use Lucene with the following scoring logic:
>> When I index my documents I want to set for each field a score/weight.
>> When I query my index I want to set for each query term a score/weight.
>>
>> I will NEVER index or query with many instances of the same field - In each query (document) there will be 0-1 instances with the same field name.
>> My fields/query term are not analyzed - they are already made out of one token.
>>
>> I want the score to be simply the dot product between the fields of the query to the fields of the document if they have the same value.
>>
>> For example:
>> Query:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 1
>>
>> AA
>>
>> 0.1
>>
>> 7
>>
>> BB
>>
>> 0.2
>>
>> 8
>>
>> CC
>>
>> 0.3
>>
>>
>> Document 1:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 1
>>
>> AA
>>
>> 0.2
>>
>> 2
>>
>> DD
>>
>> 0.8
>>
>> 7
>>
>> CC
>>
>> 0.999
>>
>> 10
>>
>> FFF
>>
>> 0.1
>>
>>
>> Document 2:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 7
>>
>> BB
>>
>> 0.3
>>
>> 8
>>
>> CC
>>
>> 0.5
>>
>>
>> The scores should be:
>> Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2  = 0.02
>> Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q *
>> FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)
>>
>> What would be the best way implement it? In terms of accuracy and performances (I don't need TF and IDF calculations).
>>
>> I currently implemented it by setting boosts to the fields and query terms.
>> Then I overwritten the DefaultSimilarity class:
>>
>> public class MySimilarity extends DefaultSimilarity {
>>
>>    @Override
>>    public float computeNorm(String field, FieldInvertState state) {
>>        return state.getBoost();
>>    }
>>
>>    @Override
>>    public float queryNorm(float sumOfSquaredWeights) {
>>        return 1;
>>    }
>>
>>    @Override
>>    public float tf(float freq) {
>>        return 1;
>>    }
>>
>>    @Override
>>    public float idf(int docFreq, int numDocs) {
>>        return 1;
>>    }
>>
>>    @Override
>>    public float coord(int overlap, int maxOverlap) {
>>        return 1;
>>    }
>>
>> }
>>
>> And based on http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html this should work.
>> Problems:
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for nothing...
>> 2. The score I get from the TopScoreDocCollector is not the same as I get from the Explanation.
>> Here is part of my code:
>>
>> indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
>> TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN,
>> true); indexSearcher.search(query, collector); ScoreDoc[] hits =
>> collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; ++i) {
>> int docId = hits[i].doc; Document d = indexSearcher.doc(docId); double
>> score = hits[i].score; String id = d.get(FIELD_ID); Explanation
>> explanation = indexSearcher.explain(query, docId); }
>>
>> Thanks!
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Custom lucene scoring - Dot product between field boost and query boost

Yuval Kesten
Hi all,
Inspired by another thread here (Question about CustomScoreQuery) I am using this solution which is working really well (with one drawback):
I discovered that some of my problems were due to the fact that my assumption was wrong:
I did have many fields/queries terms with the same field ID.
This ruined my approach because the query boost was aggregated and my calculations were wrong.

What I did was during indexing I added the field value to the field id (concatenated it by '_') and as filed value used the desired score.

At search time I am using simple FieldScoreQuery (As-is, no modifications needed) with the complex field ID.
Here I can still use the setBoost to set the score because now my filed are unique.

Logic wise this is perfect - dot product using Lucene.

Drawback - Lots of lots of different types of fields - effects the memory usage dramatically.

If anyone has better ideas - please share!

-----Original Message-----
From: Alan Woodward [mailto:[hidden email]]
Sent: Wednesday, February 22, 2012 4:00 PM
To: [hidden email]
Subject: Re: Custom lucene scoring - Dot product between field boost and query boost

Hi Yuval,

You can just override Similarity, rather than DefaultSimilarity - that way you don't burn any CPU cycles on TF/IDF calculations.

Alan

On 22 Feb 2012, at 07:17, Yuval Kesten wrote:

> Hi Em,
> 1. Regarding the performances - the similarity class (And my subtype as well) gets the IDF and TF and SQUARED SUMS calculations as inputs - they just factor them differently. Even though I ignore the values they are being computed.
> 2. I have written this code:
>    static {
>        Similarity.setDefault(new MySimilarity());
>    }
> Which means that I am setting the default similarity before doing the indexing and obviously before the searching.
> Thanks!
>
> -----Original Message-----
> From: Em [mailto:[hidden email]]
> Sent: Tuesday, February 21, 2012 6:07 PM
> To: [hidden email]
> Subject: Re: Custom lucene scoring - Dot product between field boost
> and query boost
>
> Hi Yuval,
>
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for
>> nothing...
> You aren't calculating that much, since you declared all those values as constants. What are you worried about?
>
>> 2. The score I get from the TopScoreDocCollector is not the same as I
> get from the Explanation.
>> Here is part of my code:
> Could you provide us the code where you are setting the Similarity, please?
>
> Kind regards,
> Em
>
> Am 21.02.2012 16:18, schrieb Yuval Kesten:
>> Hi,
>> I want to use Lucene with the following scoring logic:
>> When I index my documents I want to set for each field a score/weight.
>> When I query my index I want to set for each query term a score/weight.
>>
>> I will NEVER index or query with many instances of the same field - In each query (document) there will be 0-1 instances with the same field name.
>> My fields/query term are not analyzed - they are already made out of one token.
>>
>> I want the score to be simply the dot product between the fields of the query to the fields of the document if they have the same value.
>>
>> For example:
>> Query:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 1
>>
>> AA
>>
>> 0.1
>>
>> 7
>>
>> BB
>>
>> 0.2
>>
>> 8
>>
>> CC
>>
>> 0.3
>>
>>
>> Document 1:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 1
>>
>> AA
>>
>> 0.2
>>
>> 2
>>
>> DD
>>
>> 0.8
>>
>> 7
>>
>> CC
>>
>> 0.999
>>
>> 10
>>
>> FFF
>>
>> 0.1
>>
>>
>> Document 2:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 7
>>
>> BB
>>
>> 0.3
>>
>> 8
>>
>> CC
>>
>> 0.5
>>
>>
>> The scores should be:
>> Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2  = 0.02
>> Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q *
>> FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)
>>
>> What would be the best way implement it? In terms of accuracy and performances (I don't need TF and IDF calculations).
>>
>> I currently implemented it by setting boosts to the fields and query terms.
>> Then I overwritten the DefaultSimilarity class:
>>
>> public class MySimilarity extends DefaultSimilarity {
>>
>>    @Override
>>    public float computeNorm(String field, FieldInvertState state) {
>>        return state.getBoost();
>>    }
>>
>>    @Override
>>    public float queryNorm(float sumOfSquaredWeights) {
>>        return 1;
>>    }
>>
>>    @Override
>>    public float tf(float freq) {
>>        return 1;
>>    }
>>
>>    @Override
>>    public float idf(int docFreq, int numDocs) {
>>        return 1;
>>    }
>>
>>    @Override
>>    public float coord(int overlap, int maxOverlap) {
>>        return 1;
>>    }
>>
>> }
>>
>> And based on http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html this should work.
>> Problems:
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for nothing...
>> 2. The score I get from the TopScoreDocCollector is not the same as I get from the Explanation.
>> Here is part of my code:
>>
>> indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
>> TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN,
>> true); indexSearcher.search(query, collector); ScoreDoc[] hits =
>> collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; ++i)
>> { int docId = hits[i].doc; Document d = indexSearcher.doc(docId);
>> double score = hits[i].score; String id = d.get(FIELD_ID);
>> Explanation explanation = indexSearcher.explain(query, docId); }
>>
>> Thanks!
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Custom lucene scoring - Dot product between field boost and query boost

Yuval Kesten
One important thing -
Since I am not using the indexed documents fields' norms, because the weight is the value of the field, I am now indexing the fields using:
Field field = new Field(field_name, Float.toString(weight), Store.YES, Index.NOT_ANALYZED_NO_NORMS);
And the memory usage is back to normal... So cool!

-----Original Message-----
From: Yuval Kesten [mailto:[hidden email]]
Sent: Wednesday, February 22, 2012 7:29 PM
To: [hidden email]
Subject: RE: Custom lucene scoring - Dot product between field boost and query boost

Hi all,
Inspired by another thread here (Question about CustomScoreQuery) I am using this solution which is working really well (with one drawback):
I discovered that some of my problems were due to the fact that my assumption was wrong:
I did have many fields/queries terms with the same field ID.
This ruined my approach because the query boost was aggregated and my calculations were wrong.

What I did was during indexing I added the field value to the field id (concatenated it by '_') and as filed value used the desired score.

At search time I am using simple FieldScoreQuery (As-is, no modifications needed) with the complex field ID.
Here I can still use the setBoost to set the score because now my filed are unique.

Logic wise this is perfect - dot product using Lucene.

Drawback - Lots of lots of different types of fields - effects the memory usage dramatically.

If anyone has better ideas - please share!

-----Original Message-----
From: Alan Woodward [mailto:[hidden email]]
Sent: Wednesday, February 22, 2012 4:00 PM
To: [hidden email]
Subject: Re: Custom lucene scoring - Dot product between field boost and query boost

Hi Yuval,

You can just override Similarity, rather than DefaultSimilarity - that way you don't burn any CPU cycles on TF/IDF calculations.

Alan

On 22 Feb 2012, at 07:17, Yuval Kesten wrote:

> Hi Em,
> 1. Regarding the performances - the similarity class (And my subtype as well) gets the IDF and TF and SQUARED SUMS calculations as inputs - they just factor them differently. Even though I ignore the values they are being computed.
> 2. I have written this code:
>    static {
>        Similarity.setDefault(new MySimilarity());
>    }
> Which means that I am setting the default similarity before doing the indexing and obviously before the searching.
> Thanks!
>
> -----Original Message-----
> From: Em [mailto:[hidden email]]
> Sent: Tuesday, February 21, 2012 6:07 PM
> To: [hidden email]
> Subject: Re: Custom lucene scoring - Dot product between field boost
> and query boost
>
> Hi Yuval,
>
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for
>> nothing...
> You aren't calculating that much, since you declared all those values as constants. What are you worried about?
>
>> 2. The score I get from the TopScoreDocCollector is not the same as I
> get from the Explanation.
>> Here is part of my code:
> Could you provide us the code where you are setting the Similarity, please?
>
> Kind regards,
> Em
>
> Am 21.02.2012 16:18, schrieb Yuval Kesten:
>> Hi,
>> I want to use Lucene with the following scoring logic:
>> When I index my documents I want to set for each field a score/weight.
>> When I query my index I want to set for each query term a score/weight.
>>
>> I will NEVER index or query with many instances of the same field - In each query (document) there will be 0-1 instances with the same field name.
>> My fields/query term are not analyzed - they are already made out of one token.
>>
>> I want the score to be simply the dot product between the fields of the query to the fields of the document if they have the same value.
>>
>> For example:
>> Query:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 1
>>
>> AA
>>
>> 0.1
>>
>> 7
>>
>> BB
>>
>> 0.2
>>
>> 8
>>
>> CC
>>
>> 0.3
>>
>>
>> Document 1:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 1
>>
>> AA
>>
>> 0.2
>>
>> 2
>>
>> DD
>>
>> 0.8
>>
>> 7
>>
>> CC
>>
>> 0.999
>>
>> 10
>>
>> FFF
>>
>> 0.1
>>
>>
>> Document 2:
>> Field Name
>>
>> Field Value
>>
>> Field Score
>>
>> 7
>>
>> BB
>>
>> 0.3
>>
>> 8
>>
>> CC
>>
>> 0.5
>>
>>
>> The scores should be:
>> Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2  = 0.02
>> Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q *
>> FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)
>>
>> What would be the best way implement it? In terms of accuracy and performances (I don't need TF and IDF calculations).
>>
>> I currently implemented it by setting boosts to the fields and query terms.
>> Then I overwritten the DefaultSimilarity class:
>>
>> public class MySimilarity extends DefaultSimilarity {
>>
>>    @Override
>>    public float computeNorm(String field, FieldInvertState state) {
>>        return state.getBoost();
>>    }
>>
>>    @Override
>>    public float queryNorm(float sumOfSquaredWeights) {
>>        return 1;
>>    }
>>
>>    @Override
>>    public float tf(float freq) {
>>        return 1;
>>    }
>>
>>    @Override
>>    public float idf(int docFreq, int numDocs) {
>>        return 1;
>>    }
>>
>>    @Override
>>    public float coord(int overlap, int maxOverlap) {
>>        return 1;
>>    }
>>
>> }
>>
>> And based on http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html this should work.
>> Problems:
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for nothing...
>> 2. The score I get from the TopScoreDocCollector is not the same as I get from the Explanation.
>> Here is part of my code:
>>
>> indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
>> TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN,
>> true); indexSearcher.search(query, collector); ScoreDoc[] hits =
>> collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; ++i)
>> { int docId = hits[i].doc; Document d = indexSearcher.doc(docId);
>> double score = hits[i].score; String id = d.get(FIELD_ID);
>> Explanation explanation = indexSearcher.explain(query, docId); }
>>
>> Thanks!
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]