Quantcast

Range queries get misinterpreted when parsed twice via the "Standard" parsers

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Range queries get misinterpreted when parsed twice via the "Standard" parsers

Michael Peterson
Hello,

At Rocana we have a search system that builds a Lucene query on a front end
(web)
system and sends the query string to a backend system. The query typed in
by the user
on the front end first gets parsed (for rewriting and adding additional
hidden clauses),
turned back into a Lucene query string and that query string is sent over
the network
to the backend where it is parsed again into a Query object for searching
with the
IndexSearcher.

We are using Lucene 5.5.0.

We've hit a problem with range queries with this model - namely that a
range query
of the form

ts:[1000 TO 2000]

when run through the StandardSyntaxParser and back out as a string gets
changed to

[ts:1000 ts:2000]

Which would be fine, except that when that alternative form of range syntax
is fed
back into either the StandardSyntaxParser or the StandardQueryParser it
misinterprets
it and attaches the default field to it.

Here's code to illustrate:

  String query = "ts:[1000 TO 2000] AND foo";
  String defaultField = "text";

  StandardSyntaxParser p = new StandardSyntaxParser();
  QueryNode queryTree = p.parse(query, defaultField);
  String queryStringFromTree = queryTree.toQueryString(new
EscapeQuerySyntaxImpl()).toString();

  StandardQueryParser qp = new StandardQueryParser(IndexUtil.getAnalyzer());
  org.apache.lucene.search.Query queryFromOrig = qp.parse(query,
defaultField);
  org.apache.lucene.search.Query queryFromTree =
qp.parse(queryStringFromTree, defaultField);

  System.out.println("queryStringFromTree    : " + queryStringFromTree);
  System.out.println("Orig query parsed      : " + queryFromOrig);
  System.out.println("From Tree query parsed : " + queryFromTree);

which prints:

  queryStringFromTree    : [ts:1000 ts:2000] AND text:foo
  Orig query parsed      : +ts:[1000 TO 2000] +text:foo
  From Tree query parsed : +text:[ts:1000 TO ts:2000] +text:foo

What do you recommend to handle this issue?


Thank you,
Michael Peterson

http://www.rocana.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Range queries get misinterpreted when parsed twice via the "Standard" parsers

Erick Erickson
There has never been a guarantee that going back and forth between a
parsed query and its string representation is idempotent. so this
isn't supported.

Best,
Erick

On Thu, Mar 9, 2017 at 5:58 AM, Michael Peterson <[hidden email]> wrote:

> Hello,
>
> At Rocana we have a search system that builds a Lucene query on a front end
> (web)
> system and sends the query string to a backend system. The query typed in
> by the user
> on the front end first gets parsed (for rewriting and adding additional
> hidden clauses),
> turned back into a Lucene query string and that query string is sent over
> the network
> to the backend where it is parsed again into a Query object for searching
> with the
> IndexSearcher.
>
> We are using Lucene 5.5.0.
>
> We've hit a problem with range queries with this model - namely that a
> range query
> of the form
>
> ts:[1000 TO 2000]
>
> when run through the StandardSyntaxParser and back out as a string gets
> changed to
>
> [ts:1000 ts:2000]
>
> Which would be fine, except that when that alternative form of range syntax
> is fed
> back into either the StandardSyntaxParser or the StandardQueryParser it
> misinterprets
> it and attaches the default field to it.
>
> Here's code to illustrate:
>
>   String query = "ts:[1000 TO 2000] AND foo";
>   String defaultField = "text";
>
>   StandardSyntaxParser p = new StandardSyntaxParser();
>   QueryNode queryTree = p.parse(query, defaultField);
>   String queryStringFromTree = queryTree.toQueryString(new
> EscapeQuerySyntaxImpl()).toString();
>
>   StandardQueryParser qp = new StandardQueryParser(IndexUtil.getAnalyzer());
>   org.apache.lucene.search.Query queryFromOrig = qp.parse(query,
> defaultField);
>   org.apache.lucene.search.Query queryFromTree =
> qp.parse(queryStringFromTree, defaultField);
>
>   System.out.println("queryStringFromTree    : " + queryStringFromTree);
>   System.out.println("Orig query parsed      : " + queryFromOrig);
>   System.out.println("From Tree query parsed : " + queryFromTree);
>
> which prints:
>
>   queryStringFromTree    : [ts:1000 ts:2000] AND text:foo
>   Orig query parsed      : +ts:[1000 TO 2000] +text:foo
>   From Tree query parsed : +text:[ts:1000 TO ts:2000] +text:foo
>
> What do you recommend to handle this issue?
>
>
> Thank you,
> Michael Peterson
>
> http://www.rocana.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Range queries get misinterpreted when parsed twice via the "Standard" parsers

Trejkaz
On Fri, 10 Mar 2017 at 01:19, Erick Erickson <[hidden email]>
wrote:

> There has never been a guarantee that going back and forth between a
> parsed query and its string representation is idempotent. so this
> isn't supported.


Maybe delete the toQueryString method...

There is a fundamental design problem with it anyway, in that it produces a
syntax which isn't necessarily the one you parsed in the first place. We
ended up making a whole family of QuerySyntaxFormatter for all node classes
and had it produce exactly the syntax we consider the cleanest. (Still not
what the user typed in, but aiming to be better when the two differ.)

Although in this case, it does seem like it could have moved the field
outside the brackets to avoid this problem...

TX
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Range queries get misinterpreted when parsed twice via the "Standard" parsers

Michael Peterson
Everyone - thanks for the feedback.

Trejkaz,

I agree. The [ts:X ts:Y] range syntax seems odd at best and broken at
worst. If the field name for the range has to be the same for both the
lower and upper bound why put it there twice inside the braces? In
addition, a user cannot type that syntax and have it work, so why use it at
all, even "internally"?

In any case, the solution we settled on was to override the toQueryString
method for the various RangeQueryNode implementations in a
QueryNodeProcessorImpl class that will spit out the "classic" range syntax,
ts:[X TO Y].

That works great, but it seems odd to require users of Lucene to have to
implement that. I would think it could/should be built-in either to the
existing RangeQueryNodes or have a standard QueryNodeProcessorImpl in
Lucene that does this translation for users that need to do query tree
modification.

Feedback welcome if I'm missing something.

-Michael

On Thu, Mar 9, 2017 at 7:06 PM, Trejkaz <[hidden email]> wrote:

> On Fri, 10 Mar 2017 at 01:19, Erick Erickson <[hidden email]>
> wrote:
>
> > There has never been a guarantee that going back and forth between a
> > parsed query and its string representation is idempotent. so this
> > isn't supported.
>
>
> Maybe delete the toQueryString method...
>
> There is a fundamental design problem with it anyway, in that it produces a
> syntax which isn't necessarily the one you parsed in the first place. We
> ended up making a whole family of QuerySyntaxFormatter for all node classes
> and had it produce exactly the syntax we consider the cleanest. (Still not
> what the user typed in, but aiming to be better when the two differ.)
>
> Although in this case, it does seem like it could have moved the field
> outside the brackets to avoid this problem...
>
> TX
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Range queries get misinterpreted when parsed twice via the "Standard" parsers

Michael McCandless-2
Why don't we fix this in Lucene?  It sounds like your fix (overriding
toQueryString for the range query nodes) is contained?  Could you open an
issue and add a patch?

I agree it's silly to produce [ts:X ts:Y] syntax.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 9, 2017 at 8:59 PM, Michael Peterson <[hidden email]> wrote:

> Everyone - thanks for the feedback.
>
> Trejkaz,
>
> I agree. The [ts:X ts:Y] range syntax seems odd at best and broken at
> worst. If the field name for the range has to be the same for both the
> lower and upper bound why put it there twice inside the braces? In
> addition, a user cannot type that syntax and have it work, so why use it at
> all, even "internally"?
>
> In any case, the solution we settled on was to override the toQueryString
> method for the various RangeQueryNode implementations in a
> QueryNodeProcessorImpl class that will spit out the "classic" range syntax,
> ts:[X TO Y].
>
> That works great, but it seems odd to require users of Lucene to have to
> implement that. I would think it could/should be built-in either to the
> existing RangeQueryNodes or have a standard QueryNodeProcessorImpl in
> Lucene that does this translation for users that need to do query tree
> modification.
>
> Feedback welcome if I'm missing something.
>
> -Michael
>
> On Thu, Mar 9, 2017 at 7:06 PM, Trejkaz <[hidden email]> wrote:
>
> > On Fri, 10 Mar 2017 at 01:19, Erick Erickson <[hidden email]>
> > wrote:
> >
> > > There has never been a guarantee that going back and forth between a
> > > parsed query and its string representation is idempotent. so this
> > > isn't supported.
> >
> >
> > Maybe delete the toQueryString method...
> >
> > There is a fundamental design problem with it anyway, in that it
> produces a
> > syntax which isn't necessarily the one you parsed in the first place. We
> > ended up making a whole family of QuerySyntaxFormatter for all node
> classes
> > and had it produce exactly the syntax we consider the cleanest. (Still
> not
> > what the user typed in, but aiming to be better when the two differ.)
> >
> > Although in this case, it does seem like it could have moved the field
> > outside the brackets to avoid this problem...
> >
> > TX
> >
>
Loading...