Grouping in Lucene queries giving unexpected results

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Grouping in Lucene queries giving unexpected results

Michael Peterson
I have a question about the meaning and behavior of grouping behavior with
Lucene queries.

In particular, here is the scenario I am testing. I have indexed 1,000
documents.

|---+-------------------------------------------+---------------|
| # | Query String                              | Result (Hits) |
|---+-------------------------------------------+---------------|
| 1 | *:*                                       |          1000 |
| 2 | host:host_1                               |            46 |
| 3 | location:location_5                       |           100 |
| 4 | host:host_1 AND NOT location:location_5   |            37 |
| 5 | host:host_1 AND (NOT location:location_5) |             0 |
|---+-------------------------------------------+---------------|

I don't understand why the last query returns 0. I would expect queries 4
and 5 to return the same result.

Here's the interpretation based on running it through the Lucene
classic.QueryParser:

|-------------------------------------------+--------------------------------------|
| Query String                              |
QueryParser.parse(qry).toString()    |
|-------------------------------------------+--------------------------------------|
| host:host_1 AND NOT location:location_5   | +host:host_1
-location:location_5    |
| host:host_1 AND (NOT location:location_5) | +host:host_1
+(-location:location_5) |
|-------------------------------------------+--------------------------------------|

I'd like some help understanding why I'm getting this unintuitive behavior.

Also, I see that the StandardSyntaxParser generates a different query
string:

|-------------------------------------------+-------------------------------------------------|
| Query String                              |
StandardSyntaxParser.parse(qry).toQueryString() |
|-------------------------------------------+-------------------------------------------------|
| host:host_1 AND NOT location:location_5   | host:host_1 AND
-location:location_5            |
| host:host_1 AND (NOT location:location_5) | host:host_1 AND (
-location:location_5 )        |
|-------------------------------------------+-------------------------------------------------|

Are these equivalent in Lucene? Should I stop using the classic.QueryParser?

*Details*

Using Lucene 5.5.0.
Using classic.QueryParser and query code is:

    Directory directory = FSDirectory.open(getCurrentDirectory().toPath());
    StandardAnalyzer analyzer = new StandardAnalyzer();
    DirectoryReader reader = DirectoryReader.open(directory);
    IndexSearcher searcher = new IndexSearcher(reader);
    QueryParser parser = new QueryParser("ts", analyzer);
    Query query = parser.parse("host:host_1 AND NOT location:location_5");

    int limit = 1000;
    TopDocs hits = searcher.search(query, limit);
    System.out.println("hits.totalHits = " + hits.totalHits);


Thanks very much for your insights here.

-Michael Peterson
Reply | Threaded
Open this post in threaded view
|

Re: Grouping in Lucene queries giving unexpected results

Trejkaz
On Fri, Feb 17, 2017 at 5:42 AM, Michael Peterson <[hidden email]> wrote:
> I have a question about the meaning and behavior of grouping behavior with
> Lucene queries.

For this query:

    host:host_1 AND (NOT location:location_5)

The right hand side is:

    NOT location:location_5

Which matches nothing, as it has no positive clauses. And, of course,
ANDing that with any other query results in matching nothing.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Grouping in Lucene queries giving unexpected results

Erick Erickson
take a look at this blog by Hossman:
https://lucidworks.com/2011/12/28/why-not-and-or-and-not/

Lucene query logic is not strict Boolean logic, the article above explains why.

Best,
Erick

On Thu, Feb 16, 2017 at 2:08 PM, Trejkaz <[hidden email]> wrote:

> On Fri, Feb 17, 2017 at 5:42 AM, Michael Peterson <[hidden email]> wrote:
>> I have a question about the meaning and behavior of grouping behavior with
>> Lucene queries.
>
> For this query:
>
>     host:host_1 AND (NOT location:location_5)
>
> The right hand side is:
>
>     NOT location:location_5
>
> Which matches nothing, as it has no positive clauses. And, of course,
> ANDing that with any other query results in matching nothing.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Grouping in Lucene queries giving unexpected results

Trejkaz
On Fri, Feb 17, 2017 at 11:14 AM, Erick Erickson
<[hidden email]> wrote:
> Lucene query logic is not strict Boolean logic, the article above explains why.

tl;dr it mostly comes down to scoring and syntax.

The scoring argument will depend on how much you care. (My care for
scoring is pretty close to zero, as I don't care whether the better
results come first, as long as the exact results come back and the
non-results don't.)

For the syntax:

* The article doesn't really address the (-NOT) problem, where
essentially Lucene could insert an implicit *:* when there isn't one,
to make those queries at least get a sane result. You can work around
this by customising the query parser, possible for both for the
classic one (subclass it and override the method to create the
BooleanQuery) and the flexible one (add a processor to the pipeline).

* The article strongly encourages using the +/- syntax instead of
AND/OR/NOT, but the astute might notice that AND/OR/NOT is three
operators, whereas +/- is only two, so clearly one of the boolean
clause types does not have a prefix operator, making it literally
impossible to specify some queries using the prefix operators alone.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Grouping in Lucene queries giving unexpected results

Michael Peterson
Thanks everyone.

For our use case in Rocana Search, we don't use scoring at all. We always
sort by a timestamp field present in every Document, so for us Lucene query
logic is always truly boolean - we only want exact matches using boolean
logic like you would get from a database query.

That being said, I can see now why +/- operators are useful when wanting
"should" vs. "must" for scoring.

Trejkaz - thanks for the deeper explanation. We will, in fact, modify naked
"NOT x" queries (where x might be a complex clause) to be

(*:* AND NOT x)

as that is exactly the interpretation we want.

-Michael Peterson

https://www.rocana.com/


On Thu, Feb 16, 2017 at 8:27 PM, Trejkaz <[hidden email]> wrote:

> On Fri, Feb 17, 2017 at 11:14 AM, Erick Erickson
> <[hidden email]> wrote:
> > Lucene query logic is not strict Boolean logic, the article above
> explains why.
>
> tl;dr it mostly comes down to scoring and syntax.
>
> The scoring argument will depend on how much you care. (My care for
> scoring is pretty close to zero, as I don't care whether the better
> results come first, as long as the exact results come back and the
> non-results don't.)
>
> For the syntax:
>
> * The article doesn't really address the (-NOT) problem, where
> essentially Lucene could insert an implicit *:* when there isn't one,
> to make those queries at least get a sane result. You can work around
> this by customising the query parser, possible for both for the
> classic one (subclass it and override the method to create the
> BooleanQuery) and the flexible one (add a processor to the pipeline).
>
> * The article strongly encourages using the +/- syntax instead of
> AND/OR/NOT, but the astute might notice that AND/OR/NOT is three
> operators, whereas +/- is only two, so clearly one of the boolean
> clause types does not have a prefix operator, making it literally
> impossible to specify some queries using the prefix operators alone.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>