Using dismax to find multiple terms across multiple fields

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Using dismax to find multiple terms across multiple fields

Stephanie Belton
Hello,

I am using Solr to index and search documents in Russian. I have successfully set up the RussianAnalyzer but found that it eliminates some tokens such as numbers. I am therefore indexing my text fields in 2 ways, once with a quite literal version of the text using something similar to textTight in the example config:

    <fieldtype name="text_literal" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>
 
And I index my fields again using the RussianAnalyzer to cover the Russian stemming and stop words:
    <fieldtype name="text_ru_RU" class="solr.TextField"  >
      <analyzer class="org.apache.lucene.analysis.ru.RussianAnalyzer"/>
    </fieldtype>

I then specify my field names:
   <dynamicField name="*_ru_RU"   type="text_ru_RU" indexed="true" stored="false"/>
   <dynamicField name="*_literal" type="text_literal" indexed="true" stored="false"/>

And use the copyField feature to index them twice:
   <copyField source="title_ru_RU"         dest="title_literal"    />
   <copyField source="location_ru_RU"   dest="location_literal" />
   <copyField source="body_ru_RU"       dest="body_literal"     />

I then specify my own DisMaxRequestHandler in solrconfig.xml:
  <requestHandler name="dismax_ru_RU" class="solr.DisMaxRequestHandler" >
    <lst name="defaults">
     <float name="tie">0.01</float>
     <str name="qf">
        title_literal^1.5 title_ru_RU^1.3 body_literal^1.0 body_ru_RU^0.8 location_literal^0.5 location_ru_RU^0.4  </str>
     <str name="pf">
        title_literal^1.5 title_ru_RU^1.3 body_literal^1.0 body_ru_RU^0.8 location_literal^0.5 location_ru_RU^0.4  </str>
     <str name="mm">
        100%
     </str>
     <int name="ps">100</int>
    </lst>
  </requestHandler>
 
Because I am searching through classified ads, date sorting is more important to me than relevance. Therefore I am sorting by date first and then by score. I expect the system to return all matches for todays ads sorted by relevance, followed by matches for yesterday’s ads sorted by relevance etc. I would also like the search to only return ads where every single term of the query was found across my 3 fields (title, body, location). I can’t seem to get this to work. When I do a search for ‘1970’, it works fine and returns 2 ads containing 1970. If I search for ‘Ташкент’ I get 3 results incl. one with Russian stemming (Ташкента). But when I do a search for ‘1970 Ташкента’ it seems to ignore 1970 and give me the same results as only looking for ‘Ташкент’. I got it to display the debug info and 1970 seems to be ignored in the matching:

<lst name="debug">
 <str name="rawquerystring">"1970 Ташкент"</str>
 <str name="querystring">"1970 Ташкент"</str>
 <str name="parsedquery">+DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01) DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01)</str>
 <str name="parsedquery_toString">+(body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01 (body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01</str>
 <lst name="explain">
  <str name="id=€#26;੥,internal_docid=4">
0.7263521 = (MATCH) sum of:
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=4)
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=4)
</str>
  <str name="id=€#26;ી,internal_docid=9">
0.7263521 = (MATCH) sum of:
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=9)
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=9)
</str>
  <str name="id=€#26;੕,internal_docid=2">
0.43162674 = (MATCH) sum of:
  0.21581337 = (MATCH) max plus 0.01 times others of:
    0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
      0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
        0.8 = boost
        4.901973 = idf(docFreq=1)
        0.044906225 = queryNorm
      1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
        1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
        4.901973 = idf(docFreq=1)
        0.25 = fieldNorm(field=body_ru_RU, doc=2)
  0.21581337 = (MATCH) max plus 0.01 times others of:
    0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
      0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
        0.8 = boost
        4.901973 = idf(docFreq=1)
        0.044906225 = queryNorm
      1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
        1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
        4.901973 = idf(docFreq=1)
        0.25 = fieldNorm(field=body_ru_RU, doc=2)
</str>
 </lst>

Apologies for the verbosity, can anyone help me achieving my goal?

Thanks
Stephanie


Reply | Threaded
Open this post in threaded view
|

Re: Using dismax to find multiple terms across multiple fields

Yonik Seeley-2
On 11/30/06, Stephanie Belton <[hidden email]> wrote:
> I am using Solr to index and search documents in Russian. I have successfully set up the RussianAnalyzer but found that it eliminates some tokens such as numbers.

You can get better control (and avoid having numbers removed)
by using TokenFilters instead of analyzers.

You might be able to use the Porter stemmer for Russian (but I don't
know how it compares to the other you are using):

    <filter class="solr.SnowballPorterFilterFactory" language="Russian" />

Here is a portion of the code from RussianAnalyzer.java:
    public TokenStream tokenStream(String fieldName, Reader reader)
    {
        TokenStream result = new RussianLetterTokenizer(reader, charset);
        result = new RussianLowerCaseFilter(result, charset);
        result = new StopFilter(result, stopSet);
        result = new RussianStemFilter(result, charset);
        return result;
    }

You could easily create FilterFactories for these Russian specific
ones, and then
gain the ability to use them just like the other factories included in Solr.

It's probably the RussianLetterTokenizer that is throwing away numbers.
Assuming russian uses normal whitespace, you might be able to use the
WhitespaceTokenizer instead.


> I would also like the search to only return ads where every single term of the query was found across my 3 fields (title, body, location). I can't seem to get this to work.  When I do a search for '1970', it works fine and returns 2 ads containing 1970. If I search for 'Ташкент' I get 3 results incl. one with Russian stemming (Ташкента). But when I do a search for '1970 Ташкента' it seems to ignore 1970 and give me the same results as only looking for 'Ташкент'. I got it to display the debug info and 1970 seems to be ignored in the matching:

You are including the russian stemmed fields in the dismax query, and
the analysis of those fields discards numbers, hence 1970 is ignored,
right?  Either querying only the literals, or fixing the stemmed text
to not discard numbers may help (or get you further along at least).


-Yonik


> <lst name="debug">
>  <str name="rawquerystring">"1970 Ташкент"</str>
>  <str name="querystring">"1970 Ташкент"</str>
>  <str name="parsedquery">+DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01) DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01)</str>
>  <str name="parsedquery_toString">+(body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01 (body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01</str>
>  <lst name="explain">
>   <str name="id=€#26;੥,internal_docid=4">
> 0.7263521 = (MATCH) sum of:
>   0.36317605 = (MATCH) max plus 0.01 times others of:
>     0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
>       0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
>         0.4 = boost
>         4.4965076 = idf(docFreq=2)
>         0.044906225 = queryNorm
>       4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
>         1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
>         4.4965076 = idf(docFreq=2)
>         1.0 = fieldNorm(field=location_ru_RU, doc=4)
>   0.36317605 = (MATCH) max plus 0.01 times others of:
>     0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
>       0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
>         0.4 = boost
>         4.4965076 = idf(docFreq=2)
>         0.044906225 = queryNorm
>       4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
>         1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
>         4.4965076 = idf(docFreq=2)
>         1.0 = fieldNorm(field=location_ru_RU, doc=4)
> </str>
>   <str name="id=€#26;ી,internal_docid=9">
> 0.7263521 = (MATCH) sum of:
>   0.36317605 = (MATCH) max plus 0.01 times others of:
>     0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
>       0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
>         0.4 = boost
>         4.4965076 = idf(docFreq=2)
>         0.044906225 = queryNorm
>       4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
>         1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
>         4.4965076 = idf(docFreq=2)
>         1.0 = fieldNorm(field=location_ru_RU, doc=9)
>   0.36317605 = (MATCH) max plus 0.01 times others of:
>     0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
>       0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
>         0.4 = boost
>         4.4965076 = idf(docFreq=2)
>         0.044906225 = queryNorm
>       4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
>         1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
>         4.4965076 = idf(docFreq=2)
>         1.0 = fieldNorm(field=location_ru_RU, doc=9)
> </str>
>   <str name="id=€#26;੕,internal_docid=2">
> 0.43162674 = (MATCH) sum of:
>   0.21581337 = (MATCH) max plus 0.01 times others of:
>     0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
>       0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
>         0.8 = boost
>         4.901973 = idf(docFreq=1)
>         0.044906225 = queryNorm
>       1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
>         1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
>         4.901973 = idf(docFreq=1)
>         0.25 = fieldNorm(field=body_ru_RU, doc=2)
>   0.21581337 = (MATCH) max plus 0.01 times others of:
>     0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
>       0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
>         0.8 = boost
>         4.901973 = idf(docFreq=1)
>         0.044906225 = queryNorm
>       1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
>         1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
>         4.901973 = idf(docFreq=1)
>         0.25 = fieldNorm(field=body_ru_RU, doc=2)
> </str>
>  </lst>
Reply | Threaded
Open this post in threaded view
|

Re: Using dismax to find multiple terms across multiple fields

Chris Hostetter-3

: You are including the russian stemmed fields in the dismax query, and
: the analysis of those fields discards numbers, hence 1970 is ignored,
: right?  Either querying only the literals, or fixing the stemmed text
: to not discard numbers may help (or get you further along at least).

more specificly, you are putting your entire search in quotes, which is
causing it to be treated as a single searchable entity across several
fields -- a single DisjunctionMaxQuery, instead of multiple disjunctions
wrapped in a boolean.  When that quoted chunk of text is analyzed by your
russian analyzer the 1970 is strpped out, so docs that match just the word
parts in the russian field are considered matches.

if you don't use quotes around your input, then the dismax parser will ask
each of the analyzers for hte various fields to analyze the white space
seperated words independently, and you will get a one dismax clause for
each, including one clause just for "1970" across each of the fields.

Allthough it does not appear to be the source of your problem at the
moment, you may also find yourself getting similar "partial matches" if
you don't explicitly set the "mm" param, it determines the minimum number
of clauses the must match for a query to be considered ... if you set it
to "100%" then every part of the input must match something (allthough the
matches can still come from any of the various qf fields)


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Using dismax to find multiple terms across multiple fields

Stephanie Belton-2
In reply to this post by Yonik Seeley-2
Thank you for your message Yonik, that was very helpful. I didn't have much luck with the SnowballPorterFilterFactory so I wrote my own factory last night and as you said it gives me much more flexibility. Here it is for anyone who's interested:

package myApp;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.solr.analysis.BaseTokenFilterFactory;
import org.apache.lucene.analysis.ru.RussianStemFilter;
import org.apache.lucene.analysis.ru.RussianCharsets;


import java.io.Reader;

public class RussianStemFilterFactory extends BaseTokenFilterFactory {
   public TokenStream create(TokenStream input) {
      String charsetName = getArgs().get("charset");
      char[] charset = RussianCharsets.UnicodeRussian;
      if ( charsetName!= null && charsetName.equals("KOI8")) charset = RussianCharsets.KOI8;
      if ( charsetName!= null && charsetName.equals("CP1251")) charset = RussianCharsets.CP1251;
      return new RussianStemFilter(input, charset);
   }
}






Reply | Threaded
Open this post in threaded view
|

RE: Using dismax to find multiple terms across multiple fields

Stephanie Belton-2
In reply to this post by Chris Hostetter-3
: more specificly, you are putting your entire search in quotes, which is
: causing it to be treated as a single searchable entity across several
: fields -- a single DisjunctionMaxQuery, instead of multiple disjunctions
: wrapped in a boolean.  When that quoted chunk of text is analyzed by your
: russian analyzer the 1970 is strpped out, so docs that match just the word
: parts in the russian field are considered matches.

Thank you Chris for the explanation, I managed to get this to work now. I am
fine tuning the settings for dismax and I am not sure about some of the
parameters. In the example of config there is a param called 'fl' which is
not documented here:
http://incubator.apache.org/solr/docs/api/org/apache/solr/request/DisMaxRequ
estHandler.html

<str name="fl">
id,name,price,score
</str>

What is its purpose?

Also, the client should be able to specify a value for the category_id field
which I use for fecetted browsing, I know how to do this using the Lucene
query syntax with the StandardRequestHandler but how is it done with dismax?



Reply | Threaded
Open this post in threaded view
|

Re: Using dismax to find multiple terms across multiple fields

Yonik Seeley-2
On 12/2/06, Stephanie Belton <[hidden email]> wrote:

> : more specificly, you are putting your entire search in quotes, which is
> : causing it to be treated as a single searchable entity across several
> : fields -- a single DisjunctionMaxQuery, instead of multiple disjunctions
> : wrapped in a boolean.  When that quoted chunk of text is analyzed by your
> : russian analyzer the 1970 is strpped out, so docs that match just the word
> : parts in the russian field are considered matches.
>
> Thank you Chris for the explanation, I managed to get this to work now. I am
> fine tuning the settings for dismax and I am not sure about some of the
> parameters. In the example of config there is a param called 'fl' which is
> not documented here:
> http://incubator.apache.org/solr/docs/api/org/apache/solr/request/DisMaxRequ
> estHandler.html
>
> <str name="fl">
> id,name,price,score
> </str>
>
> What is its purpose?

Which stored fields to return.
It's one of the standard "common" parameters that dismax and the
standard request handler share (it's linked from the dismax wiki):
http://wiki.apache.org/solr/CommonQueryParameters

> Also, the client should be able to specify a value for the category_id field
> which I use for fecetted browsing, I know how to do this using the Lucene
> query syntax with the StandardRequestHandler but how is it done with dismax?

If you are trying to restrict your results to some value of
category_id, use a filter...
fq=category_id:10

If you want categorized counts by category_id, use the built-in
faceted browsing.
http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik