QueryParser Is Badly Broken

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

QueryParser Is Badly Broken

Renaud Waldura-3
I'm developing an application used by scientists -- people who have a pretty
good idea of what logic is -- and they were shocked to find out that neither
of these queries return the same results:

1- banana AND apple OR orange
2- banana AND (apple OR orange)
3- (banana AND apple) OR orange

I'd expect (1) to be either (2) or (3), but it turns out it's parsed as
"+banana apple orange". I was rather, uh, dismayed by this find, as it
doesn't seem to make sense.

I just spent half a day reading up on the various ways QueryParser is
broken, by going through the bugs and the mailing-list archives. And I'm
still unable to come to a conclusion. Here's where I'm at:

    a- queries which mix boolean operators require strict parenthesizing to
work right

    b- "+" isn't shorthand for "AND"; using it with "AND"/"OR"/"NOT" and the
default operator "" rarely does what you expect

    c- the stock QueryParser doesn't work well in these cases

    d- there's a new PrecedenceQueryParser at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneous that
solves *some* of the issues but creates others

    e- there is a non-Lucene effort to create a query parser with a
different syntax at http://famestalker.com/devwiki/

While we are also developing a query-building UI, users must be able to
enter text queries as well. What do other folks do? I mean, this is pretty
bad. I can hardly go back to my scientists and tell them Lucene is unable to
handle 2 boolean operators, that they should parenthesize everything by
hand. I mean, that's just cheesy.

--Renaud



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser Is Badly Broken

Mark Miller-3
There is also the Surround Query Parser in contrib by the way...I would bet
that Paul will tell you that it does not have these issues. I can't wait to
see the replies on this one...I didn't realize that the QueryParser had
these problems and am a bit skeptical...unfortunately I am away from home
and cannot check it out.

On another note...http://famestalker.com/devwik/ will be done soon...I only
have not gotten around to finishing the final touches because there did not
appear to be a lot of initial interest (and what there was has waned
drastically) and I am not ready to use it myself yet. It does correctly
handle order of operations however, and as far as I know is the only parser
to handle arbitray nesting and mixing of boolean and proximity queries.
(perhaps surround does as well...I would be really interested to know, but I
assume that it handles only the base cases ie not "(car & basket) within 2
of (horse & carriage within 3 of car). Of course who really cares about such
queries, but hey ;)

You'll get better advice from others more experienced, but my bet is that
Paul's surround parser is top notch and correctly does what you want.

- Mark

>

On 10/12/06, Renaud Waldura <[hidden email]> wrote:

>
> I'm developing an application used by scientists -- people who have a
> pretty
> good idea of what logic is -- and they were shocked to find out that
> neither
> of these queries return the same results:
>
> 1- banana AND apple OR orange
> 2- banana AND (apple OR orange)
> 3- (banana AND apple) OR orange
>
> I'd expect (1) to be either (2) or (3), but it turns out it's parsed as
> "+banana apple orange". I was rather, uh, dismayed by this find, as it
> doesn't seem to make sense.
>
> I just spent half a day reading up on the various ways QueryParser is
> broken, by going through the bugs and the mailing-list archives. And I'm
> still unable to come to a conclusion. Here's where I'm at:
>
>     a- queries which mix boolean operators require strict parenthesizing
> to
> work right
>
>     b- "+" isn't shorthand for "AND"; using it with "AND"/"OR"/"NOT" and
> the
> default operator "" rarely does what you expect
>
>     c- the stock QueryParser doesn't work well in these cases
>
>     d- there's a new PrecedenceQueryParser at
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneousthat
> solves *some* of the issues but creates others
>
>     e- there is a non-Lucene effort to create a query parser with a
> different syntax at http://famestalker.com/devwiki/
>
> While we are also developing a query-building UI, users must be able to
> enter text queries as well. What do other folks do? I mean, this is pretty
> bad. I can hardly go back to my scientists and tell them Lucene is unable
> to
> handle 2 boolean operators, that they should parenthesize everything by
> hand. I mean, that's just cheesy.
>
> --Renaud
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: QueryParser Is Badly Broken

Daniel Noll-3
In reply to this post by Renaud Waldura-3
Renaud Waldura wrote:
> While we are also developing a query-building UI, users must be able to
> enter text queries as well. What do other folks do? I mean, this is
> pretty bad. I can hardly go back to my scientists and tell them Lucene
> is unable to handle 2 boolean operators, that they should parenthesize
> everything by hand. I mean, that's just cheesy.

What I'm doing for our application is to advise users to put in the
parentheses.  It's cheesy, but safe.

What I would do if the application were relatively new is to advise
against using AND/OR/NOT, and using + and - instead since it seems that
syntax is relatively reliable.  Actually I would probably consider
removing the AND/OR/NOT syntax from the query parser itself.

Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser Is Badly Broken

Erik Hatcher
In reply to this post by Renaud Waldura-3

On Oct 12, 2006, at 7:11 PM, Renaud Waldura wrote:

> I'm developing an application used by scientists -- people who have  
> a pretty good idea of what logic is -- and they were shocked to  
> find out that neither of these queries return the same results:
>
> 1- banana AND apple OR orange
> 2- banana AND (apple OR orange)
> 3- (banana AND apple) OR orange
>
> I'd expect (1) to be either (2) or (3), but it turns out it's  
> parsed as "+banana apple orange". I was rather, uh, dismayed by  
> this find, as it doesn't seem to make sense.

It's not news to the die hard Luceners that QueryParser is mangled.  
It's a kitchen sink syntax with more bells and whistles than most  
applications need.  I've yet to come across  a project that has used  
QueryParser as-is, not because it's "broken", but because every  
application has been unique in how queries are expressed by users.

>    a- queries which mix boolean operators require strict  
> parenthesizing to work right
>
>    b- "+" isn't shorthand for "AND"; using it with "AND"/"OR"/"NOT"  
> and the default operator "" rarely does what you expect

AND/OR are oddly named in terms of how they map to the underlying  
BooleanQuery they create.  AND really means to make both clauses  
MUST, and OR means to make them SHOULD.  And, as you've painfully  
experienced, the precedence is not "logical".

>    c- the stock QueryParser doesn't work well in these cases
>
>    d- there's a new PrecedenceQueryParser at http://svn.apache.org/ 
> repos/asf/lucene/java/trunk/contrib/miscellaneous that solves  
> *some* of the issues but creates others

What issues does PQP create?  Perhaps we can get those fixed and  
replace QueryParser with it.

> While we are also developing a query-building UI, users must be  
> able to enter text queries as well. What do other folks do? I mean,  
> this is pretty bad. I can hardly go back to my scientists and tell  
> them Lucene is unable to handle 2 boolean operators, that they  
> should parenthesize everything by hand. I mean, that's just cheesy.

It really boils down to user interface, from my perspective.  Do the  
users need to type in all of that kind of logic?  Or could they be  
presented with a simpler syntax with just +/- in front of terms to  
indicate MUST/NOT (and SHOULD with no prefix)?  Perhaps they could be  
presented with two text boxes, one for required terms, and another  
for optional terms (and maybe another for prohibited terms)?

We are all certainly very open to improving QueryParser, or PQP.

        Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser Is Badly Broken

Paul Elschot
In reply to this post by Mark Miller-3
On Friday 13 October 2006 01:55, Mark Miller wrote:
> There is also the Surround Query Parser in contrib by the way...I would bet
> that Paul will tell you that it does not have these issues. I can't wait to

Indeed.

> see the replies on this one...I didn't realize that the QueryParser had
> these problems and am a bit skeptical...unfortunately I am away from home
> and cannot check it out.

Surround is not perfect, either. One of the disadvantages of surround
is that it does not map to PhraseQuery.

> On another note...http://famestalker.com/devwik/ will be done soon...I only

The url gives a not found 404 error here.

> have not gotten around to finishing the final touches because there did not
> appear to be a lot of initial interest (and what there was has waned
> drastically) and I am not ready to use it myself yet. It does correctly
> handle order of operations however, and as far as I know is the only parser
> to handle arbitray nesting and mixing of boolean and proximity queries.
> (perhaps surround does as well...I would be really interested to know, but I
> assume that it handles only the base cases ie not "(car & basket) within 2
> of (horse & carriage within 3 of car). Of course who really cares about such
> queries, but hey ;)

Surround maps proximity queries to SpanNearQuery, and that only allows
OR'ing in its operands. Surround does not map to SpanNotQuery,
but it will parse nested proximity queries and generate a nested
SpanNearQuery.
 

> You'll get better advice from others more experienced, but my bet is that
> Paul's surround parser is top notch and correctly does what you want.
>
> - Mark
>
> >
>
> On 10/12/06, Renaud Waldura <[hidden email]> wrote:
> >
> > I'm developing an application used by scientists -- people who have a
> > pretty
> > good idea of what logic is -- and they were shocked to find out that
> > neither
> > of these queries return the same results:
> >
> > 1- banana AND apple OR orange
> > 2- banana AND (apple OR orange)
> > 3- (banana AND apple) OR orange
> >
> > I'd expect (1) to be either (2) or (3), but it turns out it's parsed as
> > "+banana apple orange". I was rather, uh, dismayed by this find, as it
> > doesn't seem to make sense.
> >
> > I just spent half a day reading up on the various ways QueryParser is
> > broken, by going through the bugs and the mailing-list archives. And I'm
> > still unable to come to a conclusion. Here's where I'm at:
> >
> >     a- queries which mix boolean operators require strict parenthesizing
> > to
> > work right
> >
> >     b- "+" isn't shorthand for "AND"; using it with "AND"/"OR"/"NOT" and
> > the
> > default operator "" rarely does what you expect
> >
> >     c- the stock QueryParser doesn't work well in these cases
> >
> >     d- there's a new PrecedenceQueryParser at
> >
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneousthat

> > solves *some* of the issues but creates others
> >
> >     e- there is a non-Lucene effort to create a query parser with a
> > different syntax at http://famestalker.com/devwiki/
> >
> > While we are also developing a query-building UI, users must be able to
> > enter text queries as well. What do other folks do? I mean, this is pretty
> > bad. I can hardly go back to my scientists and tell them Lucene is unable
> > to
> > handle 2 boolean operators, that they should parenthesize everything by
> > hand. I mean, that's just cheesy.

For a query building UI it might be better to output queries in XML form
to a Lucene server, see contrib/xml-query-parser .

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser Is Badly Broken

Mark Miller-3
> On another note...http://famestalker.com
>
> /devwik/ will be done soon...I only
>
> The url gives a not found 404 error here.


Due to a typo on my part:

 http://famestalker.com/devwiki/



On 10/13/06, Paul Elschot <[hidden email]> wrote:

>
> On Friday 13 October 2006 01:55, Mark Miller wrote:
> > There is also the Surround Query Parser in contrib by the way...I would
> bet
> > that Paul will tell you that it does not have these issues. I can't wait
> to
>
> Indeed.
>
> > see the replies on this one...I didn't realize that the QueryParser had
> > these problems and am a bit skeptical...unfortunately I am away from
> home
> > and cannot check it out.
>
> Surround is not perfect, either. One of the disadvantages of surround
> is that it does not map to PhraseQuery.
>
> > On another note...http://famestalker.com/devwik/ will be done soon...I
> only
>
> The url gives a not found 404 error here.
>
> > have not gotten around to finishing the final touches because there did
> not
> > appear to be a lot of initial interest (and what there was has waned
> > drastically) and I am not ready to use it myself yet. It does correctly
> > handle order of operations however, and as far as I know is the only
> parser
> > to handle arbitray nesting and mixing of boolean and proximity queries.
> > (perhaps surround does as well...I would be really interested to know,
> but I
> > assume that it handles only the base cases ie not "(car & basket) within
> 2
> > of (horse & carriage within 3 of car). Of course who really cares about
> such
> > queries, but hey ;)
>
> Surround maps proximity queries to SpanNearQuery, and that only allows
> OR'ing in its operands. Surround does not map to SpanNotQuery,
> but it will parse nested proximity queries and generate a nested
> SpanNearQuery.
>
> > You'll get better advice from others more experienced, but my bet is
> that
> > Paul's surround parser is top notch and correctly does what you want.
> >
> > - Mark
> >
> > >
> >
> > On 10/12/06, Renaud Waldura <[hidden email]> wrote:
> > >
> > > I'm developing an application used by scientists -- people who have a
> > > pretty
> > > good idea of what logic is -- and they were shocked to find out that
> > > neither
> > > of these queries return the same results:
> > >
> > > 1- banana AND apple OR orange
> > > 2- banana AND (apple OR orange)
> > > 3- (banana AND apple) OR orange
> > >
> > > I'd expect (1) to be either (2) or (3), but it turns out it's parsed
> as
> > > "+banana apple orange". I was rather, uh, dismayed by this find, as it
> > > doesn't seem to make sense.
> > >
> > > I just spent half a day reading up on the various ways QueryParser is
> > > broken, by going through the bugs and the mailing-list archives. And
> I'm
> > > still unable to come to a conclusion. Here's where I'm at:
> > >
> > >     a- queries which mix boolean operators require strict
> parenthesizing
> > > to
> > > work right
> > >
> > >     b- "+" isn't shorthand for "AND"; using it with "AND"/"OR"/"NOT"
> and
> > > the
> > > default operator "" rarely does what you expect
> > >
> > >     c- the stock QueryParser doesn't work well in these cases
> > >
> > >     d- there's a new PrecedenceQueryParser at
> > >
>
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneousthat
> > > solves *some* of the issues but creates others
> > >
> > >     e- there is a non-Lucene effort to create a query parser with a
> > > different syntax at http://famestalker.com/devwiki/
> > >
> > > While we are also developing a query-building UI, users must be able
> to
> > > enter text queries as well. What do other folks do? I mean, this is
> pretty
> > > bad. I can hardly go back to my scientists and tell them Lucene is
> unable
> > > to
> > > handle 2 boolean operators, that they should parenthesize everything
> by
> > > hand. I mean, that's just cheesy.
>
> For a query building UI it might be better to output queries in XML form
> to a Lucene server, see contrib/xml-query-parser .
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: QueryParser Is Badly Broken

Renaud Waldura-3
In reply to this post by Renaud Waldura-3
I realize my statement of dread may be news to some; here are my references.

QueryParser not handling queries containing AND and OR
http://issues.apache.org/jira/browse/LUCENE-167

Query Parser flags clauses with explicit OR as required when followed by
explicit AND
http://issues.apache.org/jira/browse/LUCENE-218

TERM1 OR NOT TERM2 does not perform as expected (single negated queries
don't work)
http://issues.apache.org/jira/browse/LUCENE-666

Don't mix operators "+", "-" with "AND", "NOT", etc.
http://issues.apache.org/jira/browse/LUCENE-72

Very interesting thread at:
http://marc.theaimsgroup.com/?l=lucene-user&m=107096388328864&w=2


"an expression without parenthesis, when interpreted, assumes terms on
either side of an AND clause are compulsory terms, and any terms on either
side of an OR clause are optional. However, if you combine AND and OR in an
expression, the optional terms have no effect because the others are
compulsory."
http://marc.theaimsgroup.com/?l=lucene-user&m=107107383315532&w=2

All open query parser issues, 19 total.
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&mode=hide&pid=12310110&sorter/order=DESC&sorter/field=priority&resolution=-1&component=12310234
--Renaud



----- Original Message -----
From: "Renaud Waldura" <[hidden email]>
To: <[hidden email]>
Sent: Thursday, October 12, 2006 4:11 PM
Subject: QueryParser Is Badly Broken


> I'm developing an application used by scientists -- people who have a
> pretty good idea of what logic is -- and they were shocked to find out
> that neither of these queries return the same results:
>
> 1- banana AND apple OR orange
> 2- banana AND (apple OR orange)
> 3- (banana AND apple) OR orange
>
> I'd expect (1) to be either (2) or (3), but it turns out it's parsed as
> "+banana apple orange". I was rather, uh, dismayed by this find, as it
> doesn't seem to make sense.
>
> I just spent half a day reading up on the various ways QueryParser is
> broken, by going through the bugs and the mailing-list archives. And I'm
> still unable to come to a conclusion. Here's where I'm at:
>
>    a- queries which mix boolean operators require strict parenthesizing to
> work right
>
>    b- "+" isn't shorthand for "AND"; using it with "AND"/"OR"/"NOT" and
> the default operator "" rarely does what you expect
>
>    c- the stock QueryParser doesn't work well in these cases
>
>    d- there's a new PrecedenceQueryParser at
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneous 
> that solves *some* of the issues but creates others
>
>    e- there is a non-Lucene effort to create a query parser with a
> different syntax at http://famestalker.com/devwiki/
>
> While we are also developing a query-building UI, users must be able to
> enter text queries as well. What do other folks do? I mean, this is pretty
> bad. I can hardly go back to my scientists and tell them Lucene is unable
> to handle 2 boolean operators, that they should parenthesize everything by
> hand. I mean, that's just cheesy.
>
> --Renaud
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser Is Badly Broken

Paul Elschot
In reply to this post by Mark Miller-3
Mark,

you wrote:
> > On another note...http://famestalker.com
> >
...
>
>  http://famestalker.com/devwiki/

Could you explain how "Paragraph/Sentence Proximity Searching"
is implemented in Qsol?

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser Is Badly Broken

Mark Miller-3
In a way that certainly needs more testing (haven't had the time), but here
is the gist:

I modified the SpanNotQuery to allow a certain number of span crossings--
making it something of a WithinSpanQuery. So instead of just being able to
say find "something" and "something else" and don't let it span a paragraph
marker span, you can say find this and it can span up to to 3 paragraph
marker spans. I then made a special standard analyzer that uses a standard
sentence recognizer regex to inject sentence marker tokens. Paragraphs seem
less detectable, so right now the analyzer just looks for the paragraph
symbol...perhaps a double newline might be better though. I still have not
worked out the best para/sent token markers to put in the index or the best
way to mark paragraphs in the input text. I also would like to make it so
that a paragraph marker also works as a sentence marker so that they do not
need to be doubled up.


- Mark

On 10/15/06, Paul Elschot <[hidden email]> wrote:

>
> Mark,
>
> you wrote:
> > > On another note...http://famestalker.com
> > >
> ...
> >
> >  http://famestalker.com/devwiki/
>
> Could you explain how "Paragraph/Sentence Proximity Searching"
> is implemented in Qsol?
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>