operator precedence and confusing result

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

operator precedence and confusing result

Jenny Brown
I have Lucene running against a fairly large set of documents, and a
user interface for full text querying.  All queries mentioned below go
against the document full text.

I have one query from a user that is giving me bizarre behavior.  Its
structure is:

A)   medicine AND (cat OR dog OR horse OR fish)     =  10,000 results
B)   medicine AND cat OR dog OR horse OR fish        = 4,000 results
C)  (medicine AND cat) OR dog OR horse OR fish     = 90,000 results
(which tells me this isn't how B is interpretted)


This is backwards from what I would expect, as I would guess that case
B would bind (medicine AND cat) OR dog... etc., thus resulting in
-more- hits than A, rather than less.  Can someone explain what Lucene
is doing for case B?

I'm using StandardAnalyzer and a typically tokenized document text
field, and BooleanQuery.  Thanks for ideas and help.


Jenny Brown
Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

hossman

: A)   medicine AND (cat OR dog OR horse OR fish)     =  10,000 results
: B)   medicine AND cat OR dog OR horse OR fish        = 4,000 results
: C)  (medicine AND cat) OR dog OR horse OR fish     = 90,000 results
: (which tells me this isn't how B is interpretted)

it's important to rememebr that Lucene isn't a boolean matching system --
the undelrying semantics are MUST/MUST_NOT/SHOULD -- the AND/OR
keywords are just syntactic sugar that attempt to apply underlying
semantics as binary opperators.

if you look at the toString output of queries you can see what the parser
has done to try and apply your use of AND/OR to the underlying query
structures.  here's an example of what the toString for those queries
looks like (my analyzer does some stemming, but the structure of hte
queries is going to be the same regardless of the analyzer)


A) +text:medicin +(text:cat text:dog text:hors text:fish)
B) +text:medicin +text:cat text:dog text:hors text:fish
C) (+text:medicin +text:cat) text:dog text:hors text:fish

...Does the discrepency in the number of results make sense now?


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

Mike Klaas

On 8-Mar-09, at 3:46 PM, Chris Hostetter wrote:
>
> it's important to rememebr that Lucene isn't a boolean matching  
> system --
> the undelrying semantics are MUST/MUST_NOT/SHOULD -- the AND/OR
> keywords are just syntactic sugar that attempt to apply underlying
> semantics as binary opperators.

Has anyone (but me) considered ripping out this misleading (and imho  
not terribly useful) syntactic sugar in Lucene 3.0?  It seems that the  
boolean operator support is little but an impediment to understanding  
the Lucene query model.  Analogies that only occasionally accord with  
a person's intuition are among the most dangerous.

There could always be a query parser mode or module that is devoted to  
parsing boolean operators correctly.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

Jenny Brown
I use the boolean logic heavily in a production app, because it's the
grammar that my users understand (and they put together complex
boolean queries in other apps too).  Also, we're not using relevance
ranking.  A document either "matches the query" and gets returned, or
"doesn't match" and doesn't get returned.  We only want yes/no
answers.

I haven't had time to really figure out what the earlier commenter
meant with the + operators syntax conversion.  I still thought it
would have meant the same thing as the query I had posted, ie, article
has to match all terms in the AND clauses, and at least one of the
terms in the OR list.  I guess I'm still missing what his explanation
was trying to demonstrate.

Anyway, just a note to say that boolean matching is important to me
and my users; it'd be good if it worked the way it looks like it
would.  If it doesn't, I need to understand better what the current
limitations are.

I'll be out of town and away from net for 4 days, starting tomorrow,
so I'll have to continue conversation once I'm back.  Thanks for the
help and comments thus far.

Jenny Brown


On Wed, Mar 11, 2009 at 8:58 PM, Mike Klaas <[hidden email]> wrote:

>
> On 8-Mar-09, at 3:46 PM, Chris Hostetter wrote:
>>
>> it's important to rememebr that Lucene isn't a boolean matching system --
>> the undelrying semantics are MUST/MUST_NOT/SHOULD -- the AND/OR
>> keywords are just syntactic sugar that attempt to apply underlying
>> semantics as binary opperators.
>
> Has anyone (but me) considered ripping out this misleading (and imho not
> terribly useful) syntactic sugar in Lucene 3.0?  It seems that the boolean
> operator support is little but an impediment to understanding the Lucene
> query model.  Analogies that only occasionally accord with a person's
> intuition are among the most dangerous.
>
> There could always be a query parser mode or module that is devoted to
> parsing boolean operators correctly.
>
> -Mike
>
Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

Mike Klaas

On 11-Mar-09, at 7:13 PM, Jenny Brown wrote:

> I use the boolean logic heavily in a production app, because it's the
> grammar that my users understand (and they put together complex
> boolean queries in other apps too).  Also, we're not using relevance
> ranking.  A document either "matches the query" and gets returned, or
> "doesn't match" and doesn't get returned.  We only want yes/no
> answers.
>
> I haven't had time to really figure out what the earlier commenter
> meant with the + operators syntax conversion.  I still thought it
> would have meant the same thing as the query I had posted, ie, article
> has to match all terms in the AND clauses, and at least one of the
> terms in the OR list.  I guess I'm still missing what his explanation
> was trying to demonstrate.
>
> Anyway, just a note to say that boolean matching is important to me
> and my users; it'd be good if it worked the way it looks like it
> would.  If it doesn't, I need to understand better what the current
> limitations are.

Well, this is precisely why I am suggesting that we remove it (in some  
future version of Lucene).  Lucene doesn't have a hierarchical boolean  
query model that works like people "expect", and bugs filed that  
report discrepancies between the way boolean operators work and  
intuition are rejected.  We are left with something that is convenient  
if you understand how it works, but if that is so, there is no reason  
that translation into the alternate syntax can't be used.

Lucene's query model is based on REQUIRED, OPTIONAL, and EXCLUDED  
clauses.  A clause with no annotation is always OPTIONAL, and doesn't  
affect matching unless there are only OPTIONAL clauses on that level.  
brackets () create a subclause (note that this is OPTIONAL by  
default!).  AND terms are translated into REQUIRED clauses, AND NOT's  
are translated into EXCLUDED clauses.  Require clauses are annotated  
with +'s

A AND B OR C OR D OR E OR F
-> +A +B C D E F
-> find documents that match clause A and clause B (other clauses  
don't affect matching)

C OR D OR E OR F
-> C D E F
-> find documents matching at least one of these clauses

A AND (B OR C OR D OR E OR F)
-> +A +(B C D E F)
-> find documents that match A, and match one of B, C, D, E, or F

(A AND B) OR C OR D OR E OR F
-> (+A +B) C D E F
-> find documents that match at least one of C, D, E, F, or both of A  
and B

The key takeaway: once you have an AND in a grouped set of clauses,  
the OR are completely irrelevant for matching.

-Mike


Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

hossman

to elaborate a bit on Mike's comments...

: version of Lucene).  Lucene doesn't have a hierarchical boolean query model

...it has a relevancy based query model.  documents don't just
match(true|false) ... if they match, they have a score.

: A AND B OR C OR D OR E OR F
: -> +A +B C D E F
: -> find documents that match clause A and clause B (other clauses don't affect
: matching)

...but the other caluses do affect scoring, the more of those clauses that
match, the higher the score.

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

Ted Dunning
In reply to this post by Mike Klaas
Actually, the nesting of BooleanQueries with required, optional and excluded
clauses are equivalent in power to (), AND, OR and AND-NOT.  Any expression
in one form can be expressed in the other.  Moreover, the Lucene form allows
very nice intepretation with relevance scoring that is difficult with the
normal Boolean expressions.

The only thing you need to have is an accurate translation.  No need to
remove anything.


On Fri, Mar 13, 2009 at 10:51 AM, Mike Klaas <[hidden email]> wrote:

> Anyway, just a note to say that boolean matching is important to me
>> and my users; it'd be good if it worked the way it looks like it
>> would.  If it doesn't, I need to understand better what the current
>> limitations are.
>>
>
> Well, this is precisely why I am suggesting that we remove it (in some
> future version of Lucene).  Lucene doesn't have a hierarchical boolean query
> model that works like people "expect", and bugs filed that report
> discrepancies between the way boolean operators work and intuition are
> rejected.  We are left with something that is convenient if you understand
> how it works, but if that is so, there is no reason that translation into
> the alternate syntax can't be used.
>
> Lucene's query model is based on REQUIRED, OPTIONAL, and EXCLUDED clauses.
>  A clause with no annotation is always OPTIONAL, and doesn't affect matching
> unless there are only OPTIONAL clauses on that level.  brackets () create a
> subclause (note that this is OPTIONAL by default!).  AND terms are
> translated into REQUIRED clauses, AND NOT's are translated into EXCLUDED
> clauses.  Require clauses are annotated with +'s





--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

Jenny Brown
In reply to this post by hossman
Ok, after Mike Klaas's explanation of how it's translated, and a few
minutes of really studying your examples, this makes sense to me now.
And the number of results makes sense.  Now my challenge is how to
explain this to my users - oi, what a difference, and subtleties.

Thanks for the help.  I do have one remaining question.  How do
proximity searches fit into this?
I often have users performing a search like:

"music arts"~10 OR "band festival" OR "rock concert"

and then reporting back to me that they'll see results containing just
the word band, or just the word concert, that don't match an entire
phrase.  Am I misunderstanding phrase search here?  Is the proximity
search changing the behavior of the whole query?

Jenny Brown



On Sun, Mar 8, 2009 at 5:46 PM, Chris Hostetter
<[hidden email]> wrote:

>
> : A)   medicine AND (cat OR dog OR horse OR fish)     =  10,000 results
> : B)   medicine AND cat OR dog OR horse OR fish        = 4,000 results
> : C)  (medicine AND cat) OR dog OR horse OR fish     = 90,000 results
> : (which tells me this isn't how B is interpretted)
>
> it's important to rememebr that Lucene isn't a boolean matching system --
> the undelrying semantics are MUST/MUST_NOT/SHOULD -- the AND/OR
> keywords are just syntactic sugar that attempt to apply underlying
> semantics as binary opperators.
>
> if you look at the toString output of queries you can see what the parser
> has done to try and apply your use of AND/OR to the underlying query
> structures.  here's an example of what the toString for those queries
> looks like (my analyzer does some stemming, but the structure of hte
> queries is going to be the same regardless of the analyzer)
>
>
> A) +text:medicin +(text:cat text:dog text:hors text:fish)
> B) +text:medicin +text:cat text:dog text:hors text:fish
> C) (+text:medicin +text:cat) text:dog text:hors text:fish
>
> ...Does the discrepency in the number of results make sense now?
>
>
> -Hoss
>
>
Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

Ted Dunning
Jenny,

The best thing to do is to use Luke to explain how the query parser is
reading your query.

See http://luke.sourceforge.net/ or http://www.getopt.org/luke/ for a
download and  web start link.

And, no, the proximity query should not match if both words are not
present.  But users are notoriously unreliable witnesses relative to this
sort of thing.

On Mon, Mar 16, 2009 at 12:55 PM, Jenny Brown <[hidden email]> wrote:

> I often have users performing a search like:
>
> "music arts"~10 OR "band festival" OR "rock concert"
>
> and then reporting back to me that they'll see results containing just
> the word band, or just the word concert, that don't match an entire
> phrase.  Am I misunderstanding phrase search here?  Is the proximity
> search changing the behavior of the whole query?
>
>
Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

Mike Klaas
In reply to this post by Ted Dunning

On 13-Mar-09, at 11:33 AM, Ted Dunning wrote:

> Actually, the nesting of BooleanQueries with required, optional and  
> excluded
> clauses are equivalent in power to (), AND, OR and AND-NOT.  Any  
> expression
> in one form can be expressed in the other.  Moreover, the Lucene  
> form allows
> very nice intepretation with relevance scoring that is difficult  
> with the
> normal Boolean expressions.
>
> The only thing you need to have is an accurate translation.  No need  
> to
> remove anything.

Yes, but until that work is done (and it probably belongs in a  
separate queryparser module, imo), we're left with the current  
situation, which doesn't work but seductively gives the appearance of  
working.

Anyway, the point seems moot given the volumous silence my suggestion  
has incited.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: operator precedence and confusing result

hossman

: Yes, but until that work is done (and it probably belongs in a separate
: queryparser module, imo), we're left with the current situation, which doesn't
: work but seductively gives the appearance of working.
:
: Anyway, the point seems moot given the volumous silence my suggestion has
: incited.

In general i agree with you Mike: the world would be a lot simpler if we
removed the AND/NOT/OR keywords from the query parser ... but the hitch is
that when used in a particular way (i don't want to say "correctly"
because that's a subjective evaluation, so instead i'll say
"unambiguously") they can be extremely effective for people.

   ((A AND B) OR (C AND D)) AND X

...works the way *everyone* would exepect it to work, and is much simpler
for some people to udnerstand then...

   +((+A +B) (+C +D)) +X

Personally: I'd be very happy if there were a lot more QueryParsers, and
we had one that supported the +/- syntax and a seperate one that supported
the AND/OR/NOT syntax ...  in fact: i think there's a nice long thread
(that i really need to find time to read) on java-dev about how the IBM
folks may have solved that problem for everyone.


-Hoss