Odd query result

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Odd query result

Charlie Jackson
I've got an odd scenario with a query a user's running. The user is
searching for the term "I-Car". It will hit if the document contains the
term "I-CAR" (all caps) but not if it's "I-Car".  When I throw the terms
into the analysis page, the resulting tokens look identical, and my
"I-Car" tokens hit on either term.

 

Here's the definition of the field:

 

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.StopFilterFactory"

                ignoreCase="true"

                words="stopwords.txt"

                enablePositionIncrements="true"

                />

        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.StopFilterFactory"

                ignoreCase="true"

                words="stopwords.txt"

                enablePositionIncrements="true"

                />

        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

    </fieldType>

 

I'm pretty sure this has to do with the settings on the
WordDelimiterFactory, but I must be missing something because I don't
see anything that would cause the behavior I'm seeing.

Reply | Threaded
Open this post in threaded view
|

Re: Odd query result

Tom Hill-7
When I run it, with that fieldType, it seems to work for me. Here's a sample
query output

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">17</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="start">0</str>

  <str name="q">xtext:I-Car</str>
  <str name="version">2.2</str>
  <str name="rows">10</str>
 </lst>
</lst>
<result name="response" numFound="2" start="0">
 <doc>
  <str name="id">ALLCAPS</str>

  <str name="xtext">I-CAR</str>
 </doc>
 <doc>
  <str name="id">CAMEL</str>
  <str name="xtext">I-Car</str>
 </doc>
</result>
</response>


Did I miss something?

Could you show the output with debugQuery=on for the user's failing query?
Assuming I did this right, I'd next look for is a copyField. Is the user's
query really being executed against this field?

Schema.xml could be useful, too.

Tom

On Tue, Apr 20, 2010 at 10:19 AM, Charlie Jackson <
[hidden email]> wrote:

> I've got an odd scenario with a query a user's running. The user is
> searching for the term "I-Car". It will hit if the document contains the
> term "I-CAR" (all caps) but not if it's "I-Car".  When I throw the terms
> into the analysis page, the resulting tokens look identical, and my
> "I-Car" tokens hit on either term.
>
>
>
> Here's the definition of the field:
>
>
>
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>
>      <analyzer type="index">
>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>        <filter class="solr.StopFilterFactory"
>
>                ignoreCase="true"
>
>                words="stopwords.txt"
>
>                enablePositionIncrements="true"
>
>                />
>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>      </analyzer>
>
>      <analyzer type="query">
>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>
>        <filter class="solr.StopFilterFactory"
>
>                ignoreCase="true"
>
>                words="stopwords.txt"
>
>                enablePositionIncrements="true"
>
>                />
>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>      </analyzer>
>
>    </fieldType>
>
>
>
> I'm pretty sure this has to do with the settings on the
> WordDelimiterFactory, but I must be missing something because I don't
> see anything that would cause the behavior I'm seeing.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Odd query result

MitchK
It has nothing to do with your problem, since it seems to work when Tom tested it.
However, it seems like you are using the same configurations on query- and index-type analyzer.
If you did not hide anything from (for example own filter-implementations), because you don't want to confuse us, you can just delete the definitions "type=index" and "type=query". If you do so, the whole fieldType-filter-configuration will be applied on both: index- and query-time. There is no need to specify two equal ones.

I think this would be easier to maintain in future :).

Kind regards
- Mitch

-->
      <analyzer> 

        <tokenizer class="solr.WhitespaceTokenizerFactory"/> 

        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/> 

        <filter class="solr.StopFilterFactory"

                ignoreCase="true"

                words="stopwords.txt"

                enablePositionIncrements="true"

                /> 

        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> 

        <filter class="solr.LowerCaseFilterFactory"/> 

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 

      </analyzer>
Reply | Threaded
Open this post in threaded view
|

Re: Odd query result

Tom Hill-7
I agree that, if they are the same, you want to merge them.

In this case, I don't think you want them to be the same. In particular, you
usually don't want to catenateWords and catenateNumbers both index time AND
at query time. You generate the permutations on one, or the other, but you
don't need to do it for both. I usually do it at index time

Tom

On Tue, Apr 20, 2010 at 11:29 AM, MitchK <[hidden email]> wrote:

>
> It has nothing to do with your problem, since it seems to work when Tom
> tested it.
> However, it seems like you are using the same configurations on query- and
> index-type analyzer.
> If you did not hide anything from (for example own filter-implementations),
> because you don't want to confuse us, you can just delete the definitions
> "type=index" and "type=query". If you do so, the whole
> fieldType-filter-configuration will be applied on both: index- and
> query-time. There is no need to specify two equal ones.
>
> I think this would be easier to maintain in future :).
>
> Kind regards
> - Mitch
>
> -->
>      <analyzer>
>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>
>        <filter class="solr.StopFilterFactory"
>
>                ignoreCase="true"
>
>                words="stopwords.txt"
>
>                enablePositionIncrements="true"
>
>                />
>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>      </analyzer>
> --
> View this message in context:
> http://n3.nabble.com/Odd-query-result-tp732958p733095.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

RE: Odd query result

Charlie Jackson
I'll take another look and see if it makes sense to have the index and
query time parameters the same or different.

As far as the initial issue, I think you're right Tom, it is hitting on
both. I think what threw me off was the highlighting -- in one of my
matching documents, the term "I-CAR" is highlighted, but I think it
actually hit on the term "ISHIN-I (car" which is also in the document.

The debug output for my query is

<str name="rawquerystring">ft:I-Car</str>
<str name="querystring">ft:I-Car</str>
<str name="parsedquery">+MultiPhraseQuery(ft:"i (car icar)")</str>
<str name="parsedquery_toString">+ft:"i (car icar)"</str>

Thanks!

-----Original Message-----
From: Tom Hill [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 2:08 PM
To: [hidden email]
Subject: Re: Odd query result

I agree that, if they are the same, you want to merge them.

In this case, I don't think you want them to be the same. In particular,
you
usually don't want to catenateWords and catenateNumbers both index time
AND
at query time. You generate the permutations on one, or the other, but
you
don't need to do it for both. I usually do it at index time

Tom

On Tue, Apr 20, 2010 at 11:29 AM, MitchK <[hidden email]> wrote:

>
> It has nothing to do with your problem, since it seems to work when
Tom
> tested it.
> However, it seems like you are using the same configurations on query-
and
> index-type analyzer.
> If you did not hide anything from (for example own
filter-implementations),
> because you don't want to confuse us, you can just delete the
definitions

> "type=index" and "type=query". If you do so, the whole
> fieldType-filter-configuration will be applied on both: index- and
> query-time. There is no need to specify two equal ones.
>
> I think this would be easier to maintain in future :).
>
> Kind regards
> - Mitch
>
> -->
>      <analyzer>
>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>
>        <filter class="solr.StopFilterFactory"
>
>                ignoreCase="true"
>
>                words="stopwords.txt"
>
>                enablePositionIncrements="true"
>
>                />
>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
>      </analyzer>
> --
> View this message in context:
> http://n3.nabble.com/Odd-query-result-tp732958p733095.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>