AND query not working on stopwords as expected

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

AND query not working on stopwords as expected

arunr
Solr version 4.2.1

In my schema, I have "text" type defined as follows:
---
    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">

      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="0" catenateAll="1"
splitOnCaseChange="1"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>

      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>

    </fieldType>
---

Field "name" is of type "text".

I have another multi-valued int field called "all_class_ids".

Both fields are indexed. I have 'of' in stopwords.txt file.

I am using lucene query parser.

This query
q=name:of&rows=0
gives no results as expected.

However, this query:
q=name:of AND all_class_ids:(371)&rows=0
gives results and is equal to the same number of results as
q=all_class_ids:(371)&rows=0

This is happening only for stopwords. Why?

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: AND query not working on stopwords as expected

Erick Erickson
Query parsing is not strict boolean logic, this trips up many people
even though AND, NOT and OR are used. See:
https://lucidworks.com/blog/why-not-and-or-and-not/
I think what you've really got at the top level is a single MUST
clause, namely all_class_ids:(371).

What is _not_ happening here is a set intersection as it would if the
logic were strictly boolean, and I suspect that expectation is what's
misleading you.

If that's not the case, post the results of adding &debug=query to
your URL, that'll help.


Best,
Erick

On Mon, Feb 16, 2015 at 1:32 PM, Arun Rangarajan
<[hidden email]> wrote:

> Solr version 4.2.1
>
> In my schema, I have "text" type defined as follows:
> ---
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="0" catenateAll="1"
> splitOnCaseChange="1"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>
>     </fieldType>
> ---
>
> Field "name" is of type "text".
>
> I have another multi-valued int field called "all_class_ids".
>
> Both fields are indexed. I have 'of' in stopwords.txt file.
>
> I am using lucene query parser.
>
> This query
> q=name:of&rows=0
> gives no results as expected.
>
> However, this query:
> q=name:of AND all_class_ids:(371)&rows=0
> gives results and is equal to the same number of results as
> q=all_class_ids:(371)&rows=0
>
> This is happening only for stopwords. Why?
>
> Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: AND query not working on stopwords as expected

Yonik Seeley
In reply to this post by arunr
On Mon, Feb 16, 2015 at 4:32 PM, Arun Rangarajan
<[hidden email]> wrote:
[...]

> This query
> q=name:of&rows=0
> gives no results as expected.
>
> However, this query:
> q=name:of AND all_class_ids:(371)&rows=0
> gives results and is equal to the same number of results as
> q=all_class_ids:(371)&rows=0
>
> This is happening only for stopwords. Why?

This is more of a full-text search thing.
Removal of stopwords is more like a "don't care, it's not important".
Hence a query for "a plane" should return all documents containing
"plane", ignoring the question of if the document contained an "a"
(which we can't tell since stopwords were removed during indexing).

Now I understand your point about consistency too.  Using the example
above, something like q=name:of should arguably match all documents
(or at least all documents with a "name" field).  It is very odd to
add an additional restriction and end up with more docs.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: AND query not working on stopwords as expected

Jack Krupansky-3
In reply to this post by arunr
Specifically what is happening is that the query parser passes "of" to the
analyzer for the name field, which removes the stopwords, including "of",
which results in no term to be queried. A Lucene BooleanQuery with no terms
will match... nothing. But then when you add another clause, you have the
combination of an empty term, and a specific term, which is equivalent to
just using the specific term. Think of a sequence of terms to be ANDed as a
set - if a term analyzing to no terms, there are no terms to add to the set
of terms to be ANDed.

Diving a little deeper, the "AND" operator of the two terms simply means
that all terms "MUST" be present, but since your first term analyzed to no
terms, only one term is present.

Another example where this could happen is a query such as "$,@. AND 371" -
the "$,@." gets parsed as a term, but then all the punctuation gets removed
by the analyzer, leaving no term.

These days, the recommended practice is to keep stopwords in the index but
remove them at query time unless all of the terms in the query are stop
words. In fact, it would be better to only remove stop words at query time
when they are not at either end of the query. This way, queries such as "to
be or not to be", "vitamin a", and "the office" can still provide
meaningful and precise matches even as stop words are generally ignored.



-- Jack Krupansky

On Mon, Feb 16, 2015 at 4:32 PM, Arun Rangarajan <[hidden email]>
wrote:

> Solr version 4.2.1
>
> In my schema, I have "text" type defined as follows:
> ---
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="0" catenateAll="1"
> splitOnCaseChange="1"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>
>     </fieldType>
> ---
>
> Field "name" is of type "text".
>
> I have another multi-valued int field called "all_class_ids".
>
> Both fields are indexed. I have 'of' in stopwords.txt file.
>
> I am using lucene query parser.
>
> This query
> q=name:of&rows=0
> gives no results as expected.
>
> However, this query:
> q=name:of AND all_class_ids:(371)&rows=0
> gives results and is equal to the same number of results as
> q=all_class_ids:(371)&rows=0
>
> This is happening only for stopwords. Why?
>
> Thanks.
>
Reply | Threaded
Open this post in threaded view
|

Re: AND query not working on stopwords as expected

Alexandre Rafalovitch
On 16 February 2015 at 19:12, Jack Krupansky <[hidden email]> wrote:
> In fact, it would be better to only remove stop words at query time
> when they are not at either end of the query.

And how is that achieved in Solr? This sounds interesting but
stretches my knowledge of the available filters.

Regards,
   Alex.

----
Sign up for my Solr resources newsletter at http://www.solr-start.com/
Reply | Threaded
Open this post in threaded view
|

Re: AND query not working on stopwords as expected

Jack Krupansky-3
Notice that I said "would be" rather than "is"!

Yeah, Solr is basically broken WRT intelligent stop word handling, but
nobody wants to admit it. edismax does have some limited support for the
case of the query being all stop words, but that doesn't work for more
complex queries with operators and the case of a leading or trailing
stopword. The old Lucid query parser did have better support for queries
with stop words, but that's no longer available in their current product.

-- Jack Krupansky

On Mon, Feb 16, 2015 at 8:16 PM, Alexandre Rafalovitch <[hidden email]>
wrote:

> On 16 February 2015 at 19:12, Jack Krupansky <[hidden email]>
> wrote:
> > In fact, it would be better to only remove stop words at query time
> > when they are not at either end of the query.
>
> And how is that achieved in Solr? This sounds interesting but
> stretches my knowledge of the available filters.
>
> Regards,
>    Alex.
>
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: AND query not working on stopwords as expected

Alexandre Rafalovitch
Well, there is CommonGrams and CommonGramsQuery filters (e.g.
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/commongrams/CommonGramsQueryFilter.html
). But I haven't seen them in use much.

If the description above (about the first/last token) would actually
be useful, it is probably implementable in a similar fashion. But it
would need a bunch of use-cases and/or reference material to check the
benefits against.

Or it could be just a matter of marking first/last token as keywords.
Well, would have been if StopWordFilter actually checked for
*isKeyword()*. It does not seem to.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 16 February 2015 at 20:33, Jack Krupansky <[hidden email]> wrote:

> Notice that I said "would be" rather than "is"!
>
> Yeah, Solr is basically broken WRT intelligent stop word handling, but
> nobody wants to admit it. edismax does have some limited support for the
> case of the query being all stop words, but that doesn't work for more
> complex queries with operators and the case of a leading or trailing
> stopword. The old Lucid query parser did have better support for queries
> with stop words, but that's no longer available in their current product.
>
> -- Jack Krupansky
>
> On Mon, Feb 16, 2015 at 8:16 PM, Alexandre Rafalovitch <[hidden email]>
> wrote:
>
>> On 16 February 2015 at 19:12, Jack Krupansky <[hidden email]>
>> wrote:
>> > In fact, it would be better to only remove stop words at query time
>> > when they are not at either end of the query.
>>
>> And how is that achieved in Solr? This sounds interesting but
>> stretches my knowledge of the available filters.
>>
>> Regards,
>>    Alex.
>>
>> ----
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>