Matching Queries with Wildcards and Numbers

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Matching Queries with Wildcards and Numbers

Ellington Kirby
Hi! I am a Solr user having an issue with matches on searches using the
wildcard operators, specifically when the searches include a wildcard
operator with a number. Here is an example.
My query will look like (productTitle:*Sidem2*) and match nothing, when it
should be matching the productTitle Sidem2. However, searching for Sidem
will match the productTitle Sidem2. In addition, I have isolated it to only
fail to match when the productTitle has a number in it, for example a query
for (productTitle:*Cupx Collapsed*) will correctly match the product Cupx
Collapsed. I need to use the wildcard operators around the query so that an
auto-complete feature can be used, where if a user stops typing at a
certain point, a search will be executed on their input so far and it will
match the correct product titles. I have looked all over, through the
excellent book Solr In Action by Grainger and Potter, through Stack
Overflow and several blog posts and have not found anything on this
specific issue. Common advice is to remove the stemmer, which I have done.
I have also added the ReversedWildcardFilterFactory. Here is a copy of my
schema for the specific fieldType if that is any help. Please let me know
if anyone has any tips or clues! I am not a very experienced Solr user and
would really appreciate any advice.


  <fieldType name="text_en_splitting" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
        <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
            <!-- Case insensitive stop word removal.
        -->
            <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
            <!-- Concatenate characters and numbers by setting catenateAll
to 1 - this will avoid problems with alphabetical sort -->
            <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
            <filter class="solr.ReversedWildcardFilterFactory"
withOriginal="true"
             maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2"
maxFractionAsterisk="0"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
            <!-- Concatenate characters and numbers by setting catenateAll
to 1 - this will avoid problems with alphabetical sort -->
            <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        </analyzer>
    </fieldType>


Thank you in advance!
--From a sincerely puzzled Solr user, Ellington Kirby
Reply | Threaded
Open this post in threaded view
|

Re: Matching Queries with Wildcards and Numbers

Erick Erickson
This one's going to be confusing to explain.....

The ability of filters to operate on wildcarded terms at query time is limited
to some specific filters. If you're going into the code, see
MultiTermAware-derived
filters.

Most generally, the MultiTermAware filters only are valid for filters
that do _not_
produce more than one output token for a given input token. Gibberish, I know,
but bear with me.

WordDelimiterFilterFactory is _NOT_ MultiTermAware because, you guessed it,
it can produce more than one token per input token at query time. Specifically
in your example, at index time it'll produce tokens "Sidem" and "2".

However, at query time for "Sidem2" it will just pass the token
through complete.
And since the token is not in your index, it's not found. Hmm, I wonder what
the admin/analysis page would show here....

Anyway, you probably can get what you want by changing the index time
definition of WDFF from catenateAll="0" to catenateAll="1". That will put
Sidem, 2, and Sidem2 in your index. Then the fact that query time processing
for wildcards does _not_ break things up, Sidem2 will go through at query time.
Then the doc should be found.

Of course you have to reindex your docs after the change.

Trying to allow wildcards for filters at query time that emit multiple
output tokens
per input token is an utter and complete disaster.

HTH,
Erick


On Wed, Jun 17, 2015 at 10:56 AM, Ellington Kirby
<[hidden email]> wrote:

> Hi! I am a Solr user having an issue with matches on searches using the
> wildcard operators, specifically when the searches include a wildcard
> operator with a number. Here is an example.
> My query will look like (productTitle:*Sidem2*) and match nothing, when it
> should be matching the productTitle Sidem2. However, searching for Sidem
> will match the productTitle Sidem2. In addition, I have isolated it to only
> fail to match when the productTitle has a number in it, for example a query
> for (productTitle:*Cupx Collapsed*) will correctly match the product Cupx
> Collapsed. I need to use the wildcard operators around the query so that an
> auto-complete feature can be used, where if a user stops typing at a
> certain point, a search will be executed on their input so far and it will
> match the correct product titles. I have looked all over, through the
> excellent book Solr In Action by Grainger and Potter, through Stack
> Overflow and several blog posts and have not found anything on this
> specific issue. Common advice is to remove the stemmer, which I have done.
> I have also added the ReversedWildcardFilterFactory. Here is a copy of my
> schema for the specific fieldType if that is any help. Please let me know
> if anyone has any tips or clues! I am not a very experienced Solr user and
> would really appreciate any advice.
>
>
>   <fieldType name="text_en_splitting" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>         <analyzer type="index">
>             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>             <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>             <!-- Case insensitive stop word removal.
>         -->
>             <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="lang/stopwords_en.txt"
>                 />
>             <!-- Concatenate characters and numbers by setting catenateAll
> to 1 - this will avoid problems with alphabetical sort -->
>             <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>             <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>             <filter class="solr.ReversedWildcardFilterFactory"
> withOriginal="true"
>              maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2"
> maxFractionAsterisk="0"/>
>         </analyzer>
>         <analyzer type="query">
>             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>             <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>             <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="lang/stopwords_en.txt"
>                 />
>             <!-- Concatenate characters and numbers by setting catenateAll
> to 1 - this will avoid problems with alphabetical sort -->
>             <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>             <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>         </analyzer>
>     </fieldType>
>
>
> Thank you in advance!
> --From a sincerely puzzled Solr user, Ellington Kirby
Reply | Threaded
Open this post in threaded view
|

Re: Matching Queries with Wildcards and Numbers

rakeshaspl
In reply to this post by Ellington Kirby
Do you find any solution for above issue ?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Matching Queries with Wildcards and Numbers

rakeshaspl
In reply to this post by Ellington Kirby
Hi,
Do you find any solution for above issue?
Br,
Rakesh



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Matching Queries with Wildcards and Numbers

tapan1707
Hello Rakesh,
As pointed out by Erick, changing *catenateAll* from 0 to 1 should work.
What this means is that, generateWordParts="1" generates tokens for words
for e.g. in the case of i-pad, it generates i, pad and ipad.and
generateNumberParts="1" generates tokens for numbers for e.g in the case of
88-77, it would generate 88,77 and 8877.
So When using catenateAll="1", Solr would generate a token Sidem2(query
asked in the original post).
Also as already been pointed out by Erick, one has to reindex the documents
so that Solr can refect the changes and create tokenize indexes.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Matching Queries with Wildcards and Numbers

rakeshaspl
Hi tapan,

please check below.

*Conf:-*

      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1" generateNumberParts="1"
catenateWords="1"
                catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory" />
      </analyzer>

*Check attached image for issue:-*

<http://lucene.472066.n3.nabble.com/file/t493858/1.png>

<http://lucene.472066.n3.nabble.com/file/t493858/2.png>

When i search with *ec* results are ok and there are many results with *ec
1* initial as you can see in first screenshot, but when i search using *ec 1
* its reruns weird results as you can see in screenshot 2.

Br,
Rakesh  



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Matching Queries with Wildcards and Numbers

tapan1707
I think it should have worked. Could you share the results for both queries
with &debug=true?
Also, what's the result for ec1?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html