When does QueryParser creates PhraseQueries

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

When does QueryParser creates PhraseQueries

duiduder
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I have the behaviour that when I search with Luke (version 0.7.1, Lucene
version 2.2.0) inside an arbritray field, the QueryParser creates a PhraseQuery
when I type in
~              termA/termB      (no "...")
When I read the documentation at the Lucene webside, I only find the syntax
~             "termA termB"
for creating phrase queries.

Did I make a mistake? Can I configure the QueryParser that he simply tokenizes between
termA and termB, and makes a simple BooleanQuery as he do in the case of a whitespace
delimiter?


regards

Christian

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)

iD8DBQFHwssnQoTr50f1tpcRAtuJAJ0Z6bdnX1OUlDWGB0Mf7N/zAiAnLACbBFRe
+9TsMw3ZHW756c8oHWaODPA=
=KACT
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: When does QueryParser creates PhraseQueries

Daniel Noll-3-2
On Tuesday 26 February 2008 01:05:27 [hidden email] wrote:

> Hi all,
>
> I have the behaviour that when I search with Luke (version 0.7.1, Lucene
> version 2.2.0) inside an arbritray field, the QueryParser creates a
> PhraseQuery when I type in
> ~              termA/termB      (no "...")
> When I read the documentation at the Lucene webside, I only find the syntax
> ~             "termA termB"
> for creating phrase queries.
>
> Did I make a mistake? Can I configure the QueryParser that he simply
> tokenizes between termA and termB, and makes a simple BooleanQuery as he do
> in the case of a whitespace delimiter?

You'll find they both go through getFieldQuery() as-is.  The default
implementation of that runs the string through the analyser; if it happens to
return more than one token then it will create a PhraseQuery instead of a
TermQuery.

If you subclass QueryParser than you can override this method and modify it to
do whatever evil trick you want to do.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: When does QueryParser creates PhraseQueries

duiduder
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Daniel, thank you very much for the hint!

I stepped through the code and tried some scenarios.

when I type in with whitespace delimiters
~      termA termB
this will result into two invocations of getFieldQuery, one for each term.

when I type
~      termA/termB
or
~     "termA termB"
this will result into one invocation of getFieldQuery for the whole String.

Why makes the parser a difference between termA/termB and termA termB?
The Analyzer tokenizes between both delimiters.


Christian


Daniel Noll schrieb:
| On Tuesday 26 February 2008 01:05:27 [hidden email] wrote:
|> Hi all,
|>
|> I have the behaviour that when I search with Luke (version 0.7.1, Lucene
|> version 2.2.0) inside an arbritray field, the QueryParser creates a
|> PhraseQuery when I type in
|> ~              termA/termB      (no "...")
|> When I read the documentation at the Lucene webside, I only find the syntax
|> ~             "termA termB"
|> for creating phrase queries.
|>
|> Did I make a mistake? Can I configure the QueryParser that he simply
|> tokenizes between termA and termB, and makes a simple BooleanQuery as he do
|> in the case of a whitespace delimiter?
|
| You'll find they both go through getFieldQuery() as-is.  The default
| implementation of that runs the string through the analyser; if it happens to
| return more than one token then it will create a PhraseQuery instead of a
| TermQuery.
|
| If you subclass QueryParser than you can override this method and modify it to
| do whatever evil trick you want to do.
|
| Daniel
|
| ---------------------------------------------------------------------
| To unsubscribe, e-mail: [hidden email]
| For additional commands, e-mail: [hidden email]
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)

iD8DBQFHw+HIQoTr50f1tpcRAjPvAKCQCfmQj92RLqIdo4ZdpTmH8repBgCfa+tU
zV0eF04C5ijDXVHkG4J8RiM=
=eupM
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: When does QueryParser creates PhraseQueries

duiduder
In reply to this post by Daniel Noll-3-2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So, I stepped throw the QueryParser code further, and I now
have found the source for this behaviour: the QueryParserTokenManager


~        System.out.println("This one returns the whole String:");

~        String strQuery = "home/reuschling";
~        QueryParserTokenManager tokenManager = new QueryParserTokenManager(new FastCharStream(new StringReader(strQuery)));

~        for (Token next = tokenManager.getNextToken(); !next.toString().equals(""); next = tokenManager.getNextToken())
~            System.out.println("'" + next + "'");


~        System.out.println("This returns the two tokenized Strings 'home' and 'reuschling':");

~        strQuery = "home reuschling";
~        tokenManager = new QueryParserTokenManager(new FastCharStream(new StringReader(strQuery)));

~        for (Token next = tokenManager.getNextToken(); !next.toString().equals(""); next = tokenManager.getNextToken())
~            System.out.println("'" + next + "'");

Looks that this is really hard-coded behaviour, and not Analyzer-specific.

I want to search for directories with tokenizing them, e.g. /home/reuschling - this seems to be not possible
with the current queryparser.

| If you subclass QueryParser than you can override this method and modify it to
| do whatever evil trick you want to do.

Overriding getFieldQuery() will not work because I can't differ between "home/reuschling", which should trigger a
PhraseQuery, and home/reuschling without apostrophe, which should trigger a BooleanQuery...I will search whether
I can find a better place for this:)

regards

Christian Reuschling


Daniel Noll schrieb:
| On Tuesday 26 February 2008 01:05:27 [hidden email] wrote:
|> Hi all,
|>
|> I have the behaviour that when I search with Luke (version 0.7.1, Lucene
|> version 2.2.0) inside an arbritray field, the QueryParser creates a
|> PhraseQuery when I type in
|> ~              termA/termB      (no "...")
|> When I read the documentation at the Lucene webside, I only find the syntax
|> ~             "termA termB"
|> for creating phrase queries.
|>
|> Did I make a mistake? Can I configure the QueryParser that he simply
|> tokenizes between termA and termB, and makes a simple BooleanQuery as he do
|> in the case of a whitespace delimiter?
|
| You'll find they both go through getFieldQuery() as-is.  The default
| implementation of that runs the string through the analyser; if it happens to
| return more than one token then it will create a PhraseQuery instead of a
| TermQuery.
|
| If you subclass QueryParser than you can override this method and modify it to
| do whatever evil trick you want to do.
|
| Daniel
|
| ---------------------------------------------------------------------
| To unsubscribe, e-mail: [hidden email]
| For additional commands, e-mail: [hidden email]
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)

iD8DBQFHxBkMQoTr50f1tpcRAq1aAJ9jxYa7jXF5M9iuTcjRNBCSA2cpkACfdOiV
jPrqqbiGvjhMDm3EIi9Eyw4=
=KArL
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: When does QueryParser creates PhraseQueries

Daniel Noll-3-2
On Wednesday 27 February 2008 00:50:04 [hidden email] wrote:
> Looks that this is really hard-coded behaviour, and not Analyzer-specific.

The whitespace part is coded into QueryParser.jj, yes.  So are the quotes
and : and other query-specific things.

> I want to search for directories with tokenizing them, e.g.
> /home/reuschling - this seems to be not possible with the current
> queryparser.

That's possible by changing the analyser.  For instance StandardAnalyzer will
tokenise that as two terms, but WhitespaceAnalyzer will tokenise it as one.

> | If you subclass QueryParser than you can override this method and modify
> | it to do whatever evil trick you want to do.
>
> Overriding getFieldQuery() will not work because I can't differ between
> "home/reuschling", which should trigger a PhraseQuery, and home/reuschling
> without apostrophe, which should trigger a BooleanQuery...I will search
> whether I can find a better place for this:)

That much is true.  Likewise, there is no difference between quoting "cat" and
typing cat without quotes.

You could possibly override the parse(String) method and mangle the string in
some way so that you know.  So if the user enters /a/b it could pass
down /a/b, but if they enter "/a/b" it could pass down "SOMETHING/a/b", and
you then detect the SOMETHING in getFieldQuery.  Just have to make sure the
something isn't tokenised out by the analyser.

Or you could clone QueryParser.jj itself and modify it to call different
methods for the two situations.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: When does QueryParser creates PhraseQueries

duiduder
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks a lot for your help Daniel, I have found a solution :)

The 'token' field is public inside QueryParser, and inside
'token.image' you can read the origin String with apostrophe.

Thus, I can differ between the two situations - and simply
return a BooleanQuery in the case there is no apostrophe.

best regards


Christian


Daniel Noll schrieb:
| On Wednesday 27 February 2008 00:50:04 [hidden email] wrote:
|> Looks that this is really hard-coded behaviour, and not Analyzer-specific.
|
| The whitespace part is coded into QueryParser.jj, yes.  So are the quotes
| and : and other query-specific things.
|
|> I want to search for directories with tokenizing them, e.g.
|> /home/reuschling - this seems to be not possible with the current
|> queryparser.
|
| That's possible by changing the analyser.  For instance StandardAnalyzer will
| tokenise that as two terms, but WhitespaceAnalyzer will tokenise it as one.
|
|> | If you subclass QueryParser than you can override this method and modify
|> | it to do whatever evil trick you want to do.
|>
|> Overriding getFieldQuery() will not work because I can't differ between
|> "home/reuschling", which should trigger a PhraseQuery, and home/reuschling
|> without apostrophe, which should trigger a BooleanQuery...I will search
|> whether I can find a better place for this:)
|
| That much is true.  Likewise, there is no difference between quoting "cat" and
| typing cat without quotes.
|
| You could possibly override the parse(String) method and mangle the string in
| some way so that you know.  So if the user enters /a/b it could pass
| down /a/b, but if they enter "/a/b" it could pass down "SOMETHING/a/b", and
| you then detect the SOMETHING in getFieldQuery.  Just have to make sure the
| something isn't tokenised out by the analyser.
|
| Or you could clone QueryParser.jj itself and modify it to call different
| methods for the two situations.
|
| Daniel
|
| ---------------------------------------------------------------------
| To unsubscribe, e-mail: [hidden email]
| For additional commands, e-mail: [hidden email]
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)

iD8DBQFHx8UlQoTr50f1tpcRAtpvAKCPOzw/DbQeAbcAGr0gclWK+ROJawCfbmu9
9zM2QgBgozErW5sj7xGK1Ns=
=nL71
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]