No Effect of omitNorms and omitTermFreqAndPositions when using MLT handler?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

No Effect of omitNorms and omitTermFreqAndPositions when using MLT handler?

Ravish Bhagdev
Hi All,

I was wondering if omitNorms will have any effect on MLT handler at all?

I'm using schema version 1.2 with Solr 1.4 and have defined couple of
fields, which I want to use for MLT lookup and don't want factors like
field length or TF/IDF to affect the scores.  The definitions are as below:

     <fieldType name="lowercase" class="solr.TextField"
positionIncrementGap="100" omitNorms="true" omitTermFreqAndPositions="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>

<fieldType name="text_nonorms" class="solr.TextField"
positionIncrementGap="100" omitNorms="true" omitTermFreqAndPositions="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>

<!-- and the fields that use the above field types are -->
<field name="PROFILE_TAGS" type="lowercase" indexed="true" stored="true"
multiValued="true" termVectors="true"/>
<field name="PROFILE_TAGS_TXT" type="text_nonorms" indexed="true"
stored="true" multiValued="true" termVectors="true"/>

In My solrconfig.xml I have defined following for my MLT request handler:

  <requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
<lst name="defaults">
<str name="mlt.fl">PROFILE_TAGS,PROFILE_TAGS_TXT</str>
<str name="mlt.qf">PROFILE_TAGS^10.0 PROFILE_TAGS_TXT^2.0</str>
<int name="mlt.mindf">1</int>
<int name="mlt.mintf">1</int>
<str name="fl">id,score</str>
<str name="mlt.fl">PROFILE_TAGS,PROFILE_TAGS_TXT</str>
</lst>
  </requestHandler>


However, when I run my query as follows:
http://localhost:9090/solr/mlt?fl=*,score&start=0&q=id:4417454.matchRecord&qt=/mlt&fq=targetDB:ConnectMeDB&rows=1000&&debugQuery=on

the debug scoring info shows following:

<str name="5042172.matchRecord">
0.17156276 = (MATCH) product of:
  1.4296896 = (MATCH) sum of:
    0.24737607 = (MATCH) weight(PROFILE_TAGS_TXT:system^5.0 in 1472),
product of:
      0.06376338 = queryWeight(PROFILE_TAGS_TXT:system^5.0), product of:
        5.0 = boost
        3.8795946 = idf(docFreq=538, maxDocs=9598)
        0.0032871156 = queryNorm
      3.8795946 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:system in 1472),
product of:
        1.0 = tf(termFreq(PROFILE_TAGS_TXT:system)=1)
        3.8795946 = idf(docFreq=538, maxDocs=9598)
        1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
    0.65193653 = (MATCH) weight(PROFILE_TAGS_TXT:adapt^5.0 in 1472),
product of:
      0.10351306 = queryWeight(PROFILE_TAGS_TXT:adapt^5.0), product of:
        5.0 = boost
        6.298109 = idf(docFreq=47, maxDocs=9598)
        0.0032871156 = queryNorm
      6.298109 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:adapt in 1472),
product of:
        1.0 = tf(termFreq(PROFILE_TAGS_TXT:adapt)=1)
        6.298109 = idf(docFreq=47, maxDocs=9598)
        1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
    0.530377 = (MATCH) weight(PROFILE_TAGS_TXT:optic^5.0 in 1472), product
of:
      0.093365155 = queryWeight(PROFILE_TAGS_TXT:optic^5.0), product of:
        5.0 = boost
        5.6806736 = idf(docFreq=88, maxDocs=9598)
        0.0032871156 = queryNorm
      5.6806736 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:optic in 1472),
product of:
        1.0 = tf(termFreq(PROFILE_TAGS_TXT:optic)=1)
        5.6806736 = idf(docFreq=88, maxDocs=9598)
        1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
  0.12 = coord(3/25)
</str>

Which seems to suggest that the TF/IDF is being performed on these fields!
 Also, does it make any difference if I specify omitNorms in <field>
definition vs specifying in <fieldType> definition?

I will appreciate any help with this.

Thanks,
Ravish
Reply | Threaded
Open this post in threaded view
|

Re: No Effect of omitNorms and omitTermFreqAndPositions when using MLT handler?

Ravish Bhagdev
Ahh, this is because I have to override DefaultSimilarity to turn off
tf/idf scoring?  But this will apply to all the fields and general search
on text fields as well?  Is there a way to apply custom similarity to
specific field types or fields only?  Is there no way of turning TF/IDF off
without this?

Thanks,
Ravish

On Mon, May 21, 2012 at 10:24 AM, Ravish Bhagdev
<[hidden email]>wrote:

> Hi All,
>
> I was wondering if omitNorms will have any effect on MLT handler at all?
>
> I'm using schema version 1.2 with Solr 1.4 and have defined couple of
> fields, which I want to use for MLT lookup and don't want factors like
> field length or TF/IDF to affect the scores.  The definitions are as below:
>
>      <fieldType name="lowercase" class="solr.TextField"
> positionIncrementGap="100" omitNorms="true" omitTermFreqAndPositions="true">
>  <analyzer>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
>  <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
>  </fieldType>
>
> <fieldType name="text_nonorms" class="solr.TextField"
> positionIncrementGap="100" omitNorms="true" omitTermFreqAndPositions="true">
>  <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory" />
>  <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" />
>  <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"
> />
>  <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
> </analyzer>
>  <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory" />
>  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true" />
>  <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" />
>  <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"
> />
>  <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
> </analyzer>
>  </fieldType>
>
> <!-- and the fields that use the above field types are -->
>  <field name="PROFILE_TAGS" type="lowercase" indexed="true" stored="true"
> multiValued="true" termVectors="true"/>
>  <field name="PROFILE_TAGS_TXT" type="text_nonorms" indexed="true"
> stored="true" multiValued="true" termVectors="true"/>
>
> In My solrconfig.xml I have defined following for my MLT request handler:
>
>   <requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
>  <lst name="defaults">
> <str name="mlt.fl">PROFILE_TAGS,PROFILE_TAGS_TXT</str>
>  <str name="mlt.qf">PROFILE_TAGS^10.0 PROFILE_TAGS_TXT^2.0</str>
> <int name="mlt.mindf">1</int>
>  <int name="mlt.mintf">1</int>
> <str name="fl">id,score</str>
>  <str name="mlt.fl">PROFILE_TAGS,PROFILE_TAGS_TXT</str>
> </lst>
>   </requestHandler>
>
>
> However, when I run my query as follows:
>
> http://localhost:9090/solr/mlt?fl=*,score&start=0&q=id:4417454.matchRecord&qt=/mlt&fq=targetDB:ConnectMeDB&rows=1000&&debugQuery=on
>
> the debug scoring info shows following:
>
> <str name="5042172.matchRecord">
> 0.17156276 = (MATCH) product of:
>   1.4296896 = (MATCH) sum of:
>     0.24737607 = (MATCH) weight(PROFILE_TAGS_TXT:system^5.0 in 1472),
> product of:
>       0.06376338 = queryWeight(PROFILE_TAGS_TXT:system^5.0), product of:
>         5.0 = boost
>         3.8795946 = idf(docFreq=538, maxDocs=9598)
>         0.0032871156 = queryNorm
>       3.8795946 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:system in 1472),
> product of:
>         1.0 = tf(termFreq(PROFILE_TAGS_TXT:system)=1)
>         3.8795946 = idf(docFreq=538, maxDocs=9598)
>         1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
>     0.65193653 = (MATCH) weight(PROFILE_TAGS_TXT:adapt^5.0 in 1472),
> product of:
>       0.10351306 = queryWeight(PROFILE_TAGS_TXT:adapt^5.0), product of:
>         5.0 = boost
>         6.298109 = idf(docFreq=47, maxDocs=9598)
>         0.0032871156 = queryNorm
>       6.298109 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:adapt in 1472),
> product of:
>         1.0 = tf(termFreq(PROFILE_TAGS_TXT:adapt)=1)
>         6.298109 = idf(docFreq=47, maxDocs=9598)
>         1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
>     0.530377 = (MATCH) weight(PROFILE_TAGS_TXT:optic^5.0 in 1472), product
> of:
>       0.093365155 = queryWeight(PROFILE_TAGS_TXT:optic^5.0), product of:
>         5.0 = boost
>         5.6806736 = idf(docFreq=88, maxDocs=9598)
>         0.0032871156 = queryNorm
>       5.6806736 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:optic in 1472),
> product of:
>         1.0 = tf(termFreq(PROFILE_TAGS_TXT:optic)=1)
>         5.6806736 = idf(docFreq=88, maxDocs=9598)
>         1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
>   0.12 = coord(3/25)
> </str>
>
> Which seems to suggest that the TF/IDF is being performed on these fields!
>  Also, does it make any difference if I specify omitNorms in <field>
> definition vs specifying in <fieldType> definition?
>
> I will appreciate any help with this.
>
> Thanks,
> Ravish
>
Reply | Threaded
Open this post in threaded view
|

Re: No Effect of omitNorms and omitTermFreqAndPositions when using MLT handler?

Ravish Bhagdev
I found this:

https://issues.apache.org/jira/browse/LUCENE-2236

So, it seems this feature is not supported in Solr 1.4 at all.  Is there
any possible work around?  If not, I'll have to consider splitting my
schema into two which will be quite a big change :(

- Ravish

On Mon, May 21, 2012 at 11:03 AM, Ravish Bhagdev
<[hidden email]>wrote:

> Ahh, this is because I have to override DefaultSimilarity to turn off
> tf/idf scoring?  But this will apply to all the fields and general search
> on text fields as well?  Is there a way to apply custom similarity to
> specific field types or fields only?  Is there no way of turning TF/IDF off
> without this?
>
> Thanks,
> Ravish
>
>
> On Mon, May 21, 2012 at 10:24 AM, Ravish Bhagdev <[hidden email]
> > wrote:
>
>> Hi All,
>>
>> I was wondering if omitNorms will have any effect on MLT handler at all?
>>
>> I'm using schema version 1.2 with Solr 1.4 and have defined couple of
>> fields, which I want to use for MLT lookup and don't want factors like
>> field length or TF/IDF to affect the scores.  The definitions are as below:
>>
>>      <fieldType name="lowercase" class="solr.TextField"
>> positionIncrementGap="100" omitNorms="true" omitTermFreqAndPositions="true">
>>  <analyzer>
>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory" />
>> </analyzer>
>>  </fieldType>
>>
>> <fieldType name="text_nonorms" class="solr.TextField"
>> positionIncrementGap="100" omitNorms="true" omitTermFreqAndPositions="true">
>>  <analyzer type="index">
>> <tokenizer class="solr.WhitespaceTokenizerFactory" />
>>  <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>>  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" />
>>  <filter class="solr.LowerCaseFilterFactory" />
>> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"
>> />
>>  <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>> </analyzer>
>>  <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory" />
>>  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true" />
>>  <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>>  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" />
>>  <filter class="solr.LowerCaseFilterFactory" />
>> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"
>> />
>>  <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>> </analyzer>
>>  </fieldType>
>>
>> <!-- and the fields that use the above field types are -->
>>  <field name="PROFILE_TAGS" type="lowercase" indexed="true"
>> stored="true" multiValued="true" termVectors="true"/>
>>  <field name="PROFILE_TAGS_TXT" type="text_nonorms" indexed="true"
>> stored="true" multiValued="true" termVectors="true"/>
>>
>> In My solrconfig.xml I have defined following for my MLT request handler:
>>
>>   <requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
>>  <lst name="defaults">
>> <str name="mlt.fl">PROFILE_TAGS,PROFILE_TAGS_TXT</str>
>>  <str name="mlt.qf">PROFILE_TAGS^10.0 PROFILE_TAGS_TXT^2.0</str>
>> <int name="mlt.mindf">1</int>
>>  <int name="mlt.mintf">1</int>
>> <str name="fl">id,score</str>
>>  <str name="mlt.fl">PROFILE_TAGS,PROFILE_TAGS_TXT</str>
>> </lst>
>>   </requestHandler>
>>
>>
>> However, when I run my query as follows:
>>
>> http://localhost:9090/solr/mlt?fl=*,score&start=0&q=id:4417454.matchRecord&qt=/mlt&fq=targetDB:ConnectMeDB&rows=1000&&debugQuery=on
>>
>> the debug scoring info shows following:
>>
>> <str name="5042172.matchRecord">
>> 0.17156276 = (MATCH) product of:
>>   1.4296896 = (MATCH) sum of:
>>     0.24737607 = (MATCH) weight(PROFILE_TAGS_TXT:system^5.0 in 1472),
>> product of:
>>       0.06376338 = queryWeight(PROFILE_TAGS_TXT:system^5.0), product of:
>>         5.0 = boost
>>         3.8795946 = idf(docFreq=538, maxDocs=9598)
>>         0.0032871156 = queryNorm
>>       3.8795946 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:system in 1472),
>> product of:
>>         1.0 = tf(termFreq(PROFILE_TAGS_TXT:system)=1)
>>         3.8795946 = idf(docFreq=538, maxDocs=9598)
>>         1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
>>     0.65193653 = (MATCH) weight(PROFILE_TAGS_TXT:adapt^5.0 in 1472),
>> product of:
>>       0.10351306 = queryWeight(PROFILE_TAGS_TXT:adapt^5.0), product of:
>>         5.0 = boost
>>         6.298109 = idf(docFreq=47, maxDocs=9598)
>>         0.0032871156 = queryNorm
>>       6.298109 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:adapt in 1472),
>> product of:
>>         1.0 = tf(termFreq(PROFILE_TAGS_TXT:adapt)=1)
>>         6.298109 = idf(docFreq=47, maxDocs=9598)
>>         1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
>>     0.530377 = (MATCH) weight(PROFILE_TAGS_TXT:optic^5.0 in 1472),
>> product of:
>>       0.093365155 = queryWeight(PROFILE_TAGS_TXT:optic^5.0), product of:
>>         5.0 = boost
>>         5.6806736 = idf(docFreq=88, maxDocs=9598)
>>         0.0032871156 = queryNorm
>>       5.6806736 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:optic in 1472),
>> product of:
>>         1.0 = tf(termFreq(PROFILE_TAGS_TXT:optic)=1)
>>         5.6806736 = idf(docFreq=88, maxDocs=9598)
>>         1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
>>   0.12 = coord(3/25)
>> </str>
>>
>> Which seems to suggest that the TF/IDF is being performed on these
>> fields!  Also, does it make any difference if I specify omitNorms in
>> <field> definition vs specifying in <fieldType> definition?
>>
>> I will appreciate any help with this.
>>
>> Thanks,
>> Ravish
>>
>
>