edismax parser ignores mm parameter when tokenizer splits tokens (i.e. CJK)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

edismax parser ignores mm parameter when tokenizer splits tokens (i.e. CJK)

Tom Burton-West-2
We are using the edismax query parser with an mm=100%.  However, when a CJK
query ( ABC) gets tokenized by the CJKBigramFilter ([AB] [BC]),  instead of
a Boolean AND for [AB] AND [BC], which is what we expect with mm=100%, this
gets searched as a Boolean "OR" query.

For example searching for "Daya Bay" 大亚湾 (which gets tokenized to 大亚 亚湾) we
get about 10,000 results.
If instead we manually segment the Chinese characters for Daya Bay and
enter the query ["大亚"  "亚湾"] we get 5,000 results.
(Our default Boolean operator is also "AND")

This problem also occurs with non-CJK queries for example [two-thirds]
turns into a Boolean OR query for ( [two] OR [thirds] ).

Is there some way to tell the edismax query parser to stick with mm =100%?

Appended below is the debugQuery output for these two queries and an
exceprt from our schema.xml.


Tom

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search


Entered as  [大亚湾] in Just Full Text

<str name="rawquerystring">
 _query_:"{!edismax qf='ocr^500000 ' pf='' mm='100%' tie='0.9' } 大亚湾"
</str>
-
<str name="querystring">
 _query_:"{!edismax qf='ocr^500000 ' pf='' mm='100%' tie='0.9' } 大亚湾"
</str>
-
<str name="parsedquery">
+DisjunctionMaxQuery((((ocr:大亚 ocr:亚湾)^500000.0))~0.9)
</str>
-----


Entered as two phrases [ "大亚\" "亚湾"] in Just Full Text
We get 4909 hits.  This is what I was expecting with the bigrams above.

<lst name="debug">
-
<str name="rawquerystring">
 _query_:"{!edismax qf='ocr^500000 ' pf='' mm='100%' tie='0.9' } \"大亚\"
\"亚湾\""
</str>
-
<str name="querystring">
 _query_:"{!edismax qf='ocr^500000 ' pf='' mm='100%' tie='0.9' } \"大亚\"
\"亚湾\""
</str>
-
<str name="parsedquery">
+((DisjunctionMaxQuery((ocr:大亚^500000.0)~0.9)
DisjunctionMaxQuery((ocr:亚湾^500000.0)~0.9))~2)
</str>


---
<fieldType name="CJKFullText" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
-
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true"
katakana="false" hangul="false"/>