Re: How to use stopwords, synonyms along with fuzzy match in a SOLR

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: How to use stopwords, synonyms along with fuzzy match in a SOLR

Erick Erickson
Well, I’d start by adding debug=true, that’ll show you the parsed query as well as why certain documents scored the way they did. But do note that q=junk~ will search against the default text field (the ”df” parameter in the request handler definition in solrconfig.xml). Is that what you’re expecting?

Or, I suppose, it’s searching against the fields defined if you’re using (e)dismax as your query parser. But the debut output (parsed query part) will show what the actual search is.

You should also look at the admin/analysis page. For instance, the way you have the field defined at index time, it’ll break on whitespace. But “junk.” won’t be found because your stopword doesn’t contain the period.

Plus, your EdgeNGramFilterFactory is pretty strange. A min gram size of 1 means you’re searching for single characters.

So what I’d do is back off the definition and build it up bit by bit to see if/when you have this problem. But if stopwords are working correctly at index time, the “junk” will not be _in_ the index, therefore it’ll be impossible to find fuzzy search or not. So you’re making some assumptions that aren’t true, and the analysis process combined with looking at the parsed query should show you quite a lot.

Best,
Erick

> On May 8, 2019, at 4:43 PM, bbarani <[hidden email]> wrote:
>
> Hi,
> Is there a way to use stopwords and fuzzy match in a SOLR query?
>
> The below query matches 'jack' too and I added 'junk' to the stopwords (in
> query) to avoid returning results but looks like its not honoring the
> stopwords when using the fuzzy search.
>
> solr/collection1/select?app-qf=title_autoComplete&hl=false&fl=*&group=true&group.limit=-1&group.sort=marketingSequence%20asc&group.field=productId&group.ngroups=true&facet=on&facet.field=categoryFilter&sort=defaultMarketingSequence%20asc&q=junk~
>
>
>    <fieldType name="edgytext" class="solr.TextField">
>        <analyzer type="index">
>            <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>            <filter class="solr.PorterStemFilterFactory"/>
>            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>            <filter class="solr.SynonymFilterFactory" ignoreCase="true"
> synonyms="synonyms.txt"/>
>            <filter class="solr.WordDelimiterFilterFactory"
> catenateNumbers="0" generateNumberParts="0" generateWordParts="0"
> preserveOriginal="1" catenateAll="0" catenateWords="1"/>
>            <filter class="solr.EdgeNGramFilterFactory" maxGramSize="50"
> minGramSize="1"/>
>        </analyzer>
>        <analyzer type="query">
>            <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>            <filter class="solr.PorterStemFilterFactory"/>
>            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>            <filter class="solr.SynonymFilterFactory" ignoreCase="true"
> synonyms="synonyms.txt"/>
>            <filter class="solr.WordDelimiterFilterFactory"
> catenateNumbers="0" generateNumberParts="0" generateWordParts="0"
> preserveOriginal="1" catenateAll="0" catenateWords="1"/>
>        </analyzer>
>    </fieldType>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: How to use stopwords, synonyms along with fuzzy match in a SOLR

bbarani
This post was updated on .
Thanks for your reply Erick.

I created a simple field type as below for testing and added 'junk' to the stopwords but it doesnt seem to honor it when using fuzzzy search

Analysis image

Btw, I am using qf along with edismax and pass the value in q (sample query below).

/solr/collection1/select?qf=title_autoComplete&hl=false&fl=productName&defType=edismax&q=junk~&debug=true&mm=100%25&sort=defaultMarketingSequence%20asc&rows=1


 <fieldType name="fuzzyType" class="solr.TextField">
        <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
        </analyzer>
    </fieldType>


<doc>
<str name="productName">
 Headphone Jack Adapter Cable
</str>
</doc>
</result>
<lst name="debug">
<str name="rawquerystring">junk~</str>
<str name="querystring">junk~</str>
<str name="parsedquery">
(+DisjunctionMaxQuery((title_autoComplete:junk~2)))/no_coord
</str>
<str name="parsedquery_toString">+(title_autoComplete:junk~2)</str>
<lst name="explain">
<str name="prod8730332!sku8040542">
1.5424817 = sum of: 1.5424817 = weight(title_autoComplete:jack in 190)
[SchemaSimilarity], result of: 1.5424817 = score(doc=190,freq=1.0 =
termFreq=1.0 ), product of: 0.5 = boost 3.0849633 = idf, computed as log(1 +
(docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 37.0 = docFreq 819.0 =
docCount 1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from: 1.0
= termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for
field)
</str>
</lst>



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: How to use stopwords, synonyms along with fuzzy match in a SOLR

Erick Erickson
Ah, I didn’t read thoroughly enough. The problem is stopwords don’t really count for fuzzy searching. By specifying “junk~” you’re not really searching for “junk” or variants. You’re telling Solr “find any term that is a fuzzy match” to “junk”. Under the covers, a search is being made for “jank OR jack OR…) for however many terms are within the edit distance specified for “junk”.

So Solr is behaving as expected. Imagine if it worked as you expect and stopwords were removed before applying the fuzzy logic. Then the complaint would be “Hey, I know I have words in my corpus ('jack' in this case) that should match the fuzzy term 'junk~’ but I don’t get any results back”.

Notice that no document with straight “junk” in the text will be returned absent other matching fuzzy terms.

Best,
Erick

> On May 9, 2019, at 11:17 AM, bbarani <[hidden email]> wrote:
>
> <fieldType name="fuzzyType" class="solr.TextField">
>        <analyzer type="index">
>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>            <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>        </analyzer>
>        <analyzer type="query">
>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>            <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>        </analyzer>
>    </fieldType>