Quantcast

Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?

Scott Gonyea
Wow, this is probably the most annoying Solr issue I've *ever* dealt
with. First question: How do I debug Dismax, and its query handling?

Issue: When I query against this StrField, I am attempting to do an
*exact* match...  Albeit one that is case-insensitive :).  So, 90%
exact.  It works in a majority of cases.  Indeed, I am teling Solr
that this field is my uniqueField and it enforces uniqueness
perfectly.  The issue comes about when I try to query a document,
based on a key in this field, and the key I'm using has hyphens
(dashes) in it.  Then I get zero results.  Very frustrating.

The keys will always be a URL.  IE,
"http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry"

Here's my configuration info...  schema.xml (the URL exists twice;
once in 'idstr' format, for uniqueness, and once in the 'url' form
below. I am querying against the 'idstr' field):

    <fieldType name="idstr"   class="solr.StrField">
      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="(.*)"
group="1"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    <fieldType name="url"     class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <!-- <tokenizer  class="solr.StandardTokenizerFactory"/> -->
        <tokenizer  class="solr.StandardTokenizerFactory"/>
        <filter     class="solr.LowerCaseFilterFactory"/>
        <filter     class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
        <filter     class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
<!-- snip -->
    <field name="id"            type="idstr"    stored="true"
indexed="true" required="true"/>
    <field name="url"           type="url"      stored="true"
indexed="true" required="true"/>
<!-- snip -->
  <uniqueKey>id</uniqueKey>
  <defaultSearchField>content</defaultSearchField>
  <solrQueryParser defaultOperator="AND"/>


Yes, the PatternTokenizerFactory is inefficient for doing what I
wanted above. It was a quick hack, while I sought something to do
exactly what I'm doing above.  IE, exact / WHOLE string... but lower
case.

Here's my solrconfig.xml:


<requestHandler name="/rb" class="solr.SearchHandler" default="true"  >
  <lst name="defaults">
    <str name="defType">dismax</str>
    <str name="echoParams">explicit</str>
    <float name="tie">0.01</float>
    <str name="qf"> content&#94;1.5 anchor&#94;0.3 title&#94;1.2
mcode&#94;1.0 site_id&#94;1.0 priority&#94;1.0</str>
    <str name="fl"> * </str>
    <bool name="hl">true</bool>
    <str name="q.alt">*:*</str>
    <str name="hl.fl">content title</str>
    <str name="f.title.hl.fragsize">0</str>
    <str name="f.title.hl.alternateField">title</str>
    <str name="f.content.hl.fragmenter">regex3</str>
  </lst>
</requestHandler>


And, finally, when I run that sample URL through the query analyzer...
 here's the output (copied from the HTML)... I appreciate any/all help
anyone can provide.  Seriously.  I'll love you forever :(  :


<h3>Index Analyzer</h3>
<h4>org.apache.solr.analysis.PatternTokenizerFactory   null</h4>
<table width="auto" class="analysis" border="1">
<tr>
<th NOWRAP rowspan="1">term position</th>
<td class="debugdata">1</td></tr>
<tr>
<th NOWRAP rowspan="1">term text</th>
<td class="debugdata">http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry</td></tr>
<tr>
<th NOWRAP rowspan="1">term type</th>
<td class="debugdata">word</td></tr>
<tr>
<th NOWRAP rowspan="1">source start,end</th>
<td class="debugdata">0,58</td></tr>
<tr>
<th NOWRAP rowspan="1">payload</th>
<td class="debugdata"></td></tr>
</table>
<h4>org.apache.solr.analysis.LowerCaseFilterFactory   {}</h4>
<table width="auto" class="analysis" border="1">
<tr>
<th NOWRAP rowspan="1">term position</th>
<td class="debugdata">1</td></tr>
<tr>
<th NOWRAP rowspan="1">term text</th>
<td class="highlight">http://helloworld.abc/i-ruin-your-queries-aghghaahahaagcry</td></tr>
<tr>
<th NOWRAP rowspan="1">term type</th>
<td class="debugdata">word</td></tr>
<tr>
<th NOWRAP rowspan="1">source start,end</th>
<td class="debugdata">0,58</td></tr>
<tr>
<th NOWRAP rowspan="1">payload</th>
<td class="debugdata"></td></tr>
</table>
<h3>Query Analyzer</h3>
<h4>org.apache.solr.analysis.PatternTokenizerFactory   null</h4>
<table width="auto" class="analysis" border="1">
<tr>
<th NOWRAP rowspan="1">term position</th>
<td class="debugdata">1</td></tr>
<tr>
<th NOWRAP rowspan="1">term text</th>
<td class="debugdata">http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry</td></tr>
<tr>
<th NOWRAP rowspan="1">term type</th>
<td class="debugdata">word</td></tr>
<tr>
<th NOWRAP rowspan="1">source start,end</th>
<td class="debugdata">0,58</td></tr>
<tr>
<th NOWRAP rowspan="1">payload</th>
<td class="debugdata"></td></tr>
</table>
<h4>org.apache.solr.analysis.LowerCaseFilterFactory   {}</h4>
<table width="auto" class="analysis" border="1">
<tr>
<th NOWRAP rowspan="1">term position</th>
<td class="debugdata">1</td></tr>
<tr>
<th NOWRAP rowspan="1">term text</th>
<td class="debugdata">http://helloworld.abc/i-ruin-your-queries-aghghaahahaagcry</td></tr>
<tr>
<th NOWRAP rowspan="1">term type</th>
<td class="debugdata">word</td></tr>
<tr>
<th NOWRAP rowspan="1">source start,end</th>
<td class="debugdata">0,58</td></tr>
<tr>
<th NOWRAP rowspan="1">payload</th>
<td class="debugdata"></td></tr>
</table>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?

iorixxx
>
>     <fieldType
> name="idstr"   class="solr.StrField">
>       <analyzer>
>         <tokenizer
> class="solr.PatternTokenizerFactory" pattern="(.*)"
> group="1"/>
>           <filter
> class="solr.LowerCaseFilterFactory"/>
>       </analyzer>

This definition is invalid. You cannot use charfilter/tokenizer/tokenfilter with solr.StrField.

But it is interesting that (i just tested) analysis.jsp (1.4.1) displays as if its working. But if you observe at /schema.jsp you will see that real indexed values are not lowercased.

You can use this definition instead:

<fieldType name="idstr" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.TrimFilterFactory"/>  
   <filter class="solr.LowerCaseFilterFactory"/>  
 </analyzer>
</fieldType>



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?

Scott Gonyea
Wow, that's pretty infuriating.  Thank you for the suggestion.  I
added it to the Wiki, with the hope that if it contains misinformation
then someone will correct it and, consequently, save me from another
one of these experiences :)  (...and to also document that, hey, there
is a tokenizer which treats the entire field as an exact value.)

Will go this route and re-index everything back into Solr...again...sigh.

Scott

On Mon, Oct 4, 2010 at 10:07 AM, Ahmet Arslan <[hidden email]> wrote:

>>
>>     <fieldType
>> name="idstr"   class="solr.StrField">
>>       <analyzer>
>>         <tokenizer
>> class="solr.PatternTokenizerFactory" pattern="(.*)"
>> group="1"/>
>>           <filter
>> class="solr.LowerCaseFilterFactory"/>
>>       </analyzer>
>
> This definition is invalid. You cannot use charfilter/tokenizer/tokenfilter with solr.StrField.
>
> But it is interesting that (i just tested) analysis.jsp (1.4.1) displays as if its working. But if you observe at /schema.jsp you will see that real indexed values are not lowercased.
>
> You can use this definition instead:
>
> <fieldType name="idstr" class="solr.TextField" positionIncrementGap="100">
>  <analyzer>
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.TrimFilterFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
>
>
>
>
Loading...