Dismax Minimum Match/Stopwords Bug

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Dismax Minimum Match/Stopwords Bug

Jeff Newburn
I have discovered some weirdness with our Minimum Match functionality.
Essentially it comes up with absolutely no results on certain queries.
Basically, searches with 2 words and 1 being ³the² don¹t have a return
result.  From what we can gather the minimum match criteria is making it
such that if there are 2 words then both are required.  Unfortunately, the
stopwords are pulled resulting in ³the² being removed and then solr is
requiring 2 words when only 1 exists to match on.  Is there a way around
this?  I really need it to either require only non-stopwords or not filter
out stopwords.  We know stopwords are causing the issue because taking out
the stopwords fixes the problem.  Also, we can change mm setting to 75% and
fix the problem.

Example:
Brand: The North Face
Search: the north (returns no results)

Our config is basically:
MM: str name="mm">2&lt;-1</str>
FieldType:
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
               <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
               <filter class="solr.LowerCaseFilterFactory"/>
               <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
               <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

               <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
               <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dismax Minimum Match/Stopwords Bug

hossman

: I have discovered some weirdness with our Minimum Match functionality.
: Essentially it comes up with absolutely no results on certain queries.
: Basically, searches with 2 words and 1 being ³the² don¹t have a return
: result.  From what we can gather the minimum match criteria is making it
: such that if there are 2 words then both are required.  Unfortunately, the

you haven't mentioned what qf you're using, and you only listed one field
type, which includes stopwords -- but i suspect your qf contains at least
one field that *doesn't* remove stopwords.

this is in fact an unfortunate aspect of the way dismax works --
each "chunk" of text recognized by the querypaser is passed to each
analyzer for each field.  Any chunk that produces a query for a field
becomes a DisjunctionMaxQuery, and is included in the "mm" count -- even
if that "chunk" is a stopword in every other field (and produces no query)

so you have to either be consistent with your stopwords across all fields,
or make your mm really small.  searching for "dismax stopwords" turns this
up...

http://www.nabble.com/Re%3A-DisMax-request-handler-doesn%27t-work-with-stopwords--p11016770.html

...if i'm wrong about your situation (some fields in the qf with stopwords
and some fields without) then please post all of the params you are using
(not just mm) and the full parsedquery_tostring from when debugQuery=true
is turned on.




-Hoss
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dismax Minimum Match/Stopwords Bug

Matthew Runo
Would this mean that, for example, if we wanted to search productId  
(long) we'd need to make a field type that had stopwords in it rather  
than simply using (long)?

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
[hidden email] - 702-943-7833

On Dec 12, 2008, at 11:56 PM, Chris Hostetter wrote:

>
> : I have discovered some weirdness with our Minimum Match  
> functionality.
> : Essentially it comes up with absolutely no results on certain  
> queries.
> : Basically, searches with 2 words and 1 being ³the² don¹t have a  
> return
> : result.  From what we can gather the minimum match criteria is  
> making it
> : such that if there are 2 words then both are required.  
> Unfortunately, the
>
> you haven't mentioned what qf you're using, and you only listed one  
> field
> type, which includes stopwords -- but i suspect your qf contains at  
> least
> one field that *doesn't* remove stopwords.
>
> this is in fact an unfortunate aspect of the way dismax works --
> each "chunk" of text recognized by the querypaser is passed to each
> analyzer for each field.  Any chunk that produces a query for a field
> becomes a DisjunctionMaxQuery, and is included in the "mm" count --  
> even
> if that "chunk" is a stopword in every other field (and produces no  
> query)
>
> so you have to either be consistent with your stopwords across all  
> fields,
> or make your mm really small.  searching for "dismax stopwords"  
> turns this
> up...
>
> http://www.nabble.com/Re%3A-DisMax-request-handler-doesn%27t-work-with-stopwords--p11016770.html
>
> ...if i'm wrong about your situation (some fields in the qf with  
> stopwords
> and some fields without) then please post all of the params you are  
> using
> (not just mm) and the full parsedquery_tostring from when  
> debugQuery=true
> is turned on.
>
>
>
>
> -Hoss

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dismax Minimum Match/Stopwords Bug

hossman

: Would this mean that, for example, if we wanted to search productId (long)
: we'd need to make a field type that had stopwords in it rather than simply
: using (long)?

not really ... that's kind of a special usecase.  if someone searches for
a productId that's usually *all* they search for (1 "chunk" of input fro
mthe query parser) so it's mandatory and produces a clause across all
fields.  It doesn't matter if the other fields have stopwords -- even if
the productId happens to be a stop word, that just means it doesn't
produce a clause on those "stop worded" fields, but it will will on your
productId field.

The only case where you might get into trouble is if someone searches for
"the 123456" ... now you have two chunks of input, so the mm param
comes into play you have no stopwords on your productId field so both
"the" and "123456" produce clauses, but "the" isn't going to be found in
your productId field, and because of stopwords it doens't exist in the
other fields at all ... so you don't match anything.

FWIW: if i remember right if you want to put numeric fields in the qf, i
think you need *all* of them to be numeric and all of your input needs to
be numeric, or you get exceptions from the FieldType (not the dismax
parser) when people search for normal words.   i always copyField
productId into a productId_str field for purposes like this.


-Hoss

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dismax Minimum Match/Stopwords Bug

Matthew Runo
Hmm, that makes sense to me - however I still think that even if we  
have mm set to "2" and we have "the 7449078" it should still match  
7449078 in a productId field (it does not: http://zeta.zappos.com/search?department=&term=the+7449078)
. This seems like it works against the way one would reasonably expect  
it to - that stopwords shouldn't impact the counts for mm (so, "the  
7449078" would count as 1 term for mm since "the" is a stopword).

Would there be a way around this? Could we possibly get it reworked?  
What would the downside to that be?

We have people asking for "the north" to return results from a brand  
called "the north face" - but it doesn't, and can't, because of this  
mm issue.

Thanks for your time helping us with this issue =)

Matthew Runo
Software Engineer, Zappos.com
[hidden email] - 702-943-7833

On Dec 20, 2008, at 10:45 AM, Chris Hostetter wrote:

>
> : Would this mean that, for example, if we wanted to search  
> productId (long)
> : we'd need to make a field type that had stopwords in it rather  
> than simply
> : using (long)?
>
> not really ... that's kind of a special usecase.  if someone  
> searches for
> a productId that's usually *all* they search for (1 "chunk" of input  
> fro
> mthe query parser) so it's mandatory and produces a clause across all
> fields.  It doesn't matter if the other fields have stopwords --  
> even if
> the productId happens to be a stop word, that just means it doesn't
> produce a clause on those "stop worded" fields, but it will will on  
> your
> productId field.
>
> The only case where you might get into trouble is if someone  
> searches for
> "the 123456" ... now you have two chunks of input, so the mm param
> comes into play you have no stopwords on your productId field so both
> "the" and "123456" produce clauses, but "the" isn't going to be  
> found in
> your productId field, and because of stopwords it doens't exist in the
> other fields at all ... so you don't match anything.
>
> FWIW: if i remember right if you want to put numeric fields in the  
> qf, i
> think you need *all* of them to be numeric and all of your input  
> needs to
> be numeric, or you get exceptions from the FieldType (not the dismax
> parser) when people search for normal words.   i always copyField
> productId into a productId_str field for purposes like this.
>
>
> -Hoss
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dismax Minimum Match/Stopwords Bug

hossman

: Hmm, that makes sense to me - however I still think that even if we have mm
: set to "2" and we have "the 7449078" it should still match 7449078 in a
: productId field (it does not:
: http://zeta.zappos.com/search?department=&term=the+7449078). This seems like
: it works against the way one would reasonably expect it to - that stopwords
: shouldn't impact the counts for mm (so, "the 7449078" would count as 1 term
: for mm since "the" is a stopword).

this is back to the original "problem"...

"stopwords" is an analyzer concept; "minShouldMatch" is
BooleanQuery/DisMaxQueryParser concept ... if all of the analyzers for all
of your fields agree on the list of stopwords, then q=the+7449078 will
result in "the" getting thrown out and you'll only have one clause.  but
if one of fields has an anayler that says "the" is a valid term, then it's
a valid term and it gets a clause in the query.  if it gets a clause in
the query, then it factors into the minShouldMatch calculation.

in that particular situation i believe the solution you want is to use the
same stopwords like you have on other fields for your productId field as
well, so "the" doesn't get a query clause at all ... unless you want
q=the+7449078 to return product#7449078 if and only if it also has "the"
in it's productId field.

: We have people asking for "the north" to return results from a brand called
: "the north face" - but it doesn't, and can't, because of this mm issue.

it may not work for you right now, but that doesn't mean it can't :)  ...
i'm not sure why it wouldn't actually.

consider a query like this...
 
   q=the north&qf=manu^2 prodName^1 desc^0.5&pf=...&mm=66%

let's say that "desc" uses stop words, but prodName and manu don't
(because we know we have manufacturer and product names like "the north
face"). we're going to get one DisjunctionMaxQuery for "the" (on the manu
and prodName fields) and one DisjunctionMaxQuery for "north" (on manu,
prodName, and desc) and that's 2 clauses on a BooleanQuery whose
mminShouldMatch is going to be 2 (because 66% of 2 rounded up is 2)  so
now all products with "the" and "north" in their manufacturer name *OR*
product name will match -- even if it's "the" in manu and "north"
in prodName.  products will even match if the only place they contain
"north" is in the description -- but only if they also contain "the" in
manu or productName.  if you think "that's silly, why is 'the' required i
want it to be a stopword!" then the solution is make it a stopword
*everywhere* (inlcuding manu and prodName) ... since it's not a stopword,
it's considered significant, so it needs to match.


-Hoss

Loading...