[jira] Created: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

classic Classic list List threaded Threaded
122 messages Options
1234 ... 7
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
queryparser shouldn't generate phrasequeries based on term count
----------------------------------------------------------------

                 Key: LUCENE-2458
                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
             Project: Lucene - Java
          Issue Type: Bug
          Components: QueryParser
            Reporter: Robert Muir
            Priority: Critical


The current method in the queryparser to generate phrasequeries is wrong:

The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
{noformat}
A Phrase is a group of words surrounded by double quotes such as "hello dolly".
{noformat}

But as we know, this isn't actually true.

Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).

For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.

The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.

I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866341#action_12866341 ]

Hoss Man commented on LUCENE-2458:
----------------------------------

Robter: do you have a specific suggestion for what QueryParser should do if a single "chunk" of input causes the Analyzer to produce multiple tokens that are not at the same position (ie: the current case where QueryParser produces a PhraseQuery even if there are no quotes)

Ie: if the query parser is asked to parse...
{code}fieldName:A-Field-Value{code}
...and the Analyzer produces three tokens...
 * A (at position 0)
 * Field (at position 1)
 * Value (at position 2)

...what should the resulting Query object be?

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866353#action_12866353 ]

Robert Muir commented on LUCENE-2458:
-------------------------------------

bq. ...what should the resulting Query object be?

a Boolean Query formed with the default operator.


> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866363#action_12866363 ]

Hoss Man commented on LUCENE-2458:
----------------------------------

bq. a Boolean Query formed with the default operator.

That seems like equally bad default behavior -- lots of existing TokenFilters produce chains of tokens for situations where the user creating the query string clearly intended to be searching for a single "word" and has no idea that as an implementation detail multiple tokens were produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...)

I haven't thought this through very well, but perhaps this is an area where (the new) Token Attributes could be used to instruct QueryParser as to the intent behind a stream of multiple tokens?  A new Attribute could be used on each token to convey when that token should be combined with teh previous token, and in what way: as a phrase, as a conjunction or as a disjunction.  (this could still be orthogonal to the position, which would indicate slop/span type information like it does currently)

Stock Analysys components that produce multiple tokens could be modified to add this attribute fairly easily (it should be a relatively static value for any component that currently "splits" tokens) and QueryParser could have an option controlling what to do if  it encounters a token w/o this attribute (perhaps even two options: one for quoted input chunks and one for unquoted input chunks).

that way the default could still work in a back compatible way, but people using languages that don't use whitespace separation *and* are using older (or custom) analyzers that don't know about this attribute could set a simple query parser property to force this behavior.

would that make sense? (asks the man who only vaguely understands Token Attributes at this point)

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866368#action_12866368 ]

Robert Muir commented on LUCENE-2458:
-------------------------------------

bq. That seems like equally bad default behavior

Do you have measurements to support this? Because they show its 10x better to use this operator for Chinese :)

bq. I haven't thought this through very well, but perhaps this is an area where (the new) Token Attributes

I disagree. Instead the queryparser should only form phrasequeries when you use double quotes, just like the documentation says.

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866374#action_12866374 ]

Robert Muir commented on LUCENE-2458:
-------------------------------------

by the way hoss man you said it best yourself:

{quote}
lots of existing TokenFilters produce chains of tokens for situations where the user creating the query string clearly intended to be searching for a single "word" and has no idea that as an implementation detail multiple tokens were produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...)
{quote}

User clearly intended is wrong. WordDelimiterFilter will break tibetan text in a similar manner (it uses no spaces between words), yet no user "clearly intended" to form phrase queries.

Users clearly intend to form phrase queries only when they use the phrase query operator, thats how the query parser is documented to work, and its a bug that it doesnt work that way.

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866528#action_12866528 ]

Michael McCandless commented on LUCENE-2458:
--------------------------------------------

This is sneaky behavior on QueryParser's part!  I didn't realize it did this.

What are some real use-cases where this is "good"?  WordDelmiterFilter seems like a good example (eg, Wi-Fi -> Wi Fi).

It sounds like it's a very bad default for non-whitespace languages.

It seems like we should make it controllable, switch it under Version, and change the default going forward to not do this?

bq. Token Attributes could be used to instruct QueryParser as to the intent behind a stream of multiple tokens?

This seems like a good idea (since we seem to have real-world cases where it's very useful and others where it's very bad)?  Could/should it be per-analyzer?  (ie, WDF would always do this but, say, ICUAnalyzer would never).  Or, per-token created?

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Itamar Syn-Hershko-2
In reply to this post by JIRA jira@apache.org
The QueryParser also fails to correctly parse Hebrew acronyms; although not
being an integral part of the current discussion, I thought this would be
the best place to bring that up.

Hebrew acronyms are assembled of letters with a single double-quote char
within, example: MNK"L (Hebrew for CEO). That double-quote char usually
comes at the before-last position of the word, but for some cases it can
come before (MNK"LIT). Since the QP expects two sets of double-quotes
enclosing a phrase, an exception will be thrown if such a word has been
passed to it, or an incorrect phrase query will be produced if two acronyms
are used together in a query string. Not sure which is worse.

Perhaps while you're at it you could make sure to only create a phrase query
if a quote is followed by a space - hence is definitely at the end of a
word, and not just assume it to be equivalent to a white space?

Although there's no good open Hebrew analyzer for Lucene yet hence no
motivation for this to be fixed, I'm working on one as we speak and
hopefully will have something to show in the next few weeks/days. It would
be nice to have at least this issue closed within the Lucene core code.

Thanks,

Itamar Syn-Hershko


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866557#action_12866557 ]

Robert Muir commented on LUCENE-2458:
-------------------------------------

{quote}
What are some real use-cases where this is "good"? WordDelmiterFilter seems like a good example (eg, Wi-Fi -> Wi Fi).
It sounds like it's a very bad default for non-whitespace languages.
{quote}

Its a horrible bug! And to boot, i don't think it helps english much as a default either.
Here's a comparison on an english test collection (Telegraph collection with standardAnalyzer + porter):

||measure||T||TD||TDN||
|% of queries affected|6%|14%|32%|
|positionfilter improvement|+1.704%|+0.213%|+0.805%|

So, turning it off certainly doesn't hurt (I won't try to argue that this small "improvement" by turning it off means anything).
For chinese, its a 10x improvement on TREC5/TREC6: obviously the bug is horrible there because its generating phrase queries all the time.

{quote}
This seems like a good idea (since we seem to have real-world cases where it's very useful and others where it's very bad)? Could/should it be per-analyzer? (ie, WDF would always do this but, say, ICUAnalyzer would never). Or, per-token created?
{quote}

I am strongly opposed to this. My tibetan example with WDF or whatever above is an easy example.
I haven't seen any measured real-world example where this helps, subjectively saying "I like this bug" isnt convincing me.

We don't need to push "what should be phrase query" onto analysis, it doesn't know from unicode properties etc, what the user wanted.
We don't need to put hairy logic into things like StandardTokenizer, to determine if "the user wanted a phrase query" or not in certain contexts.

Instead we should just do what the documentation says, and only issue phrase queries when the user asks for one!!!!!!


> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Robert Muir
In reply to this post by Itamar Syn-Hershko-2
On Wed, May 12, 2010 at 6:05 AM, Itamar Syn-Hershko <[hidden email]> wrote:
> The QueryParser also fails to correctly parse Hebrew acronyms; although not
> being an integral part of the current discussion, I thought this would be
> the best place to bring that up.
>

Just as I don't think Analysis should do QueryParsing, I don't think
QueryParsing should do Analysis either.
Similar problems to this exist in other languages (I have to escape :
for some, because lucene wants to interpret it as a field name).

But this can be easily remedied on the application side, its
documented and understood that the double-quote is a special
character, and there is an escape mechanism so you can escape the ones
you think are acronyms.

This issue is about about a buggy implementation: its not documented
and only internal to how the queryparser determines what is a phrase
query or not (and, contrary to what you would believe from the
documentation, the choice of whether or not to make a PhraseQuery is
not based on syntax one bit!)

--
Robert Muir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Mark Miller-3
On 5/12/10 9:25 AM, Robert Muir wrote:
>(and, contrary to what you would believe from the
> documentation, the choice of whether or not to make a PhraseQuery is
> not based on syntax one bit!)
>

Thats a major exaggeration - quoting text plays a large role in whether
or not you will get a phrase query.


--
- Mark

http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Robert Muir
On Wed, May 12, 2010 at 11:16 AM, Mark Miller <[hidden email]> wrote:
>
> Thats a major exaggeration - quoting text plays a large role in whether or
> not you will get a phrase query.
>

No, it has nothing to do with it in the implementation. It only
"escapes the whitespace", but is discarded. This is clear from looking
at the grammar.

The logic then to determine if you get a phrase query is the huge mess
of code in getFieldQuery, but its not based on the double quotes at
all.

For example a list of chinese or thai words gets a phrase query, only
because they don't use whitespace between words.
But a similar list of english words gets a boolean query.

--
Robert Muir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866595#action_12866595 ]

Marvin Humphrey commented on LUCENE-2458:
-----------------------------------------

I have mixed feelings about this for English.  It's a weakness of our engine
that we do not take position of terms within a query string into account.  At
times I've tried to modify the scoring hierarchy to improve the situation, but
I gave up because it was too difficult.  This behavior of QueryParser is a
sneaky way of getting around that limitation by turning stuff which should
almost certainly be treated as phrase queries as such.  It's the one place
where we actually exploit position data within the query string.

Mike's "wi-fi" example, though, wouldn't suffer that badly.  The terms "wi"
and "fi" are unlikely to occur much outside the context of 'wi-fi/wi fi/wifi'.
And treating "wi-fi" as a phrase still won't conflate results with "wifi" as
it would ideally.  

The example I would use doesn't typically apply to Lucene.  Lucene's
StandardAnalyzer tokenizes URLs as wholes, but KinoSearch's analogous analyzer
breaks them up into individual components.  As described in another recent
thread, this allows a search for 'example.com' to match a document which
contains the URL 'http://www.example.com/index.html'.  It would suck if all of
a sudden a search for 'example.com' started matching every document that
contained 'com'.

You could, and theoretically should, address this problem with sophisticated
analysis.  But it does make it harder to write a good Analyzer.  You make it
more important to solve what Yonik calls the 'e space mail' problem by making
it worse.

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Mark Miller-3
In reply to this post by Robert Muir
On 5/12/10 11:24 AM, Robert Muir wrote:

> On Wed, May 12, 2010 at 11:16 AM, Mark Miller<[hidden email]>  wrote:
>>
>> Thats a major exaggeration - quoting text plays a large role in whether or
>> not you will get a phrase query.
>>
>
> No, it has nothing to do with it in the implementation. It only
> "escapes the whitespace", but is discarded. This is clear from looking
> at the grammar.
>
> The logic then to determine if you get a phrase query is the huge mess
> of code in getFieldQuery, but its not based on the double quotes at
> all.
>
> For example a list of chinese or thai words gets a phrase query, only
> because they don't use whitespace between words.
> But a similar list of english words gets a boolean query.
>

Quotes play a part, or quoting something would simply not create a
phrase query - quoting something ensures that it hits the analyzer as
one chunk, rather than getting meta parsed by the grammar and fed to the
analyzer a token at a time. This ensures that multiple tokens hit the
funky logic to create a phrase query. The grammar specifically looks for
quoted chunks.

--
- Mark

http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866603#action_12866603 ]

Marvin Humphrey commented on LUCENE-2458:
-----------------------------------------

> Because they show its 10x better to use this operator for Chinese

Another way to achieve this 10x improvement is to change how QP performs its
first stage of tokenization, as you and I discussed at ApacheCon Oakland.

Right now QP splits on whitespace.  If that behavior were customizable, e.g.
via a "splitter" Analyzer, then individual Han characters would get submitted
to getFieldQuery() -- and thus getFieldQuery() would no longer turn long
strings of Han characters into a PhraseQuery.  It seems wrong to continue to
push entire query strings from non-whitespace-delimited languages down into
getFieldQuery().

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866648#action_12866648 ]

Robert Muir commented on LUCENE-2458:
-------------------------------------

{quote}
As described in another recent
thread, this allows a search for 'example.com' to match a document which
contains the URL 'http://www.example.com/index.html'. It would suck if all of
a sudden a search for 'example.com' started matching every document that
contained 'com'.
{quote}

You could solve this with better analysis, for example recognizing the full URL and decomposing it into its parts (forming n-grams of them).
This would be more performant than the current "english hacking" anyway.

I'm honestly having a tough time seeing where to proceed on this issue.

Lucene's queryparsing is completely broken for several languages due to this bug, and such language-specific hacking (heuristically forming phrase queries based on things that people subjectively feel helps for english) really doesn't belong in core lucene, but instead elsewhere, perhaps in some special optional pass to the control query parser.

The queryparser really should be language-independent and work well on average, this would fix it for several languages.

However, given the *huge* english bias I see here, i have a tough time seeing what concrete direction (e.g. code) i can work on to try to fix it. I feel such work would only be rejected since so many people seem opposed to simplifying the query parser and removing this language-specific hack.

If someone brings up an issue with the query parser (for instance i brought up several language-specific problems at apachecon), then people are quick to say that this doesn't belong in the queryparser, but should be dealt with on a special case. Why isn't english treated this way too? I don't consider this bias towards english "at all costs" including preventing languages such as Chinese from working at all very fair, I think its a really ugly stance for Lucene to take.



> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866648#action_12866648 ]

Robert Muir edited comment on LUCENE-2458 at 5/12/10 1:40 PM:
--------------------------------------------------------------

edit: s/control/contrib. I apologize for the typo.

{quote}
As described in another recent
thread, this allows a search for 'example.com' to match a document which
contains the URL 'http://www.example.com/index.html'. It would suck if all of
a sudden a search for 'example.com' started matching every document that
contained 'com'.
{quote}

You could solve this with better analysis, for example recognizing the full URL and decomposing it into its parts (forming n-grams of them).
This would be more performant than the current "english hacking" anyway.

I'm honestly having a tough time seeing where to proceed on this issue.

Lucene's queryparsing is completely broken for several languages due to this bug, and such language-specific hacking (heuristically forming phrase queries based on things that people subjectively feel helps for english) really doesn't belong in core lucene, but instead elsewhere, perhaps in some special optional pass to the contrib query parser.

The queryparser really should be language-independent and work well on average, this would fix it for several languages.

However, given the *huge* english bias I see here, i have a tough time seeing what concrete direction (e.g. code) i can work on to try to fix it. I feel such work would only be rejected since so many people seem opposed to simplifying the query parser and removing this language-specific hack.

If someone brings up an issue with the query parser (for instance i brought up several language-specific problems at apachecon), then people are quick to say that this doesn't belong in the queryparser, but should be dealt with on a special case. Why isn't english treated this way too? I don't consider this bias towards english "at all costs" including preventing languages such as Chinese from working at all very fair, I think its a really ugly stance for Lucene to take.



      was (Author: rcmuir):
    {quote}
As described in another recent
thread, this allows a search for 'example.com' to match a document which
contains the URL 'http://www.example.com/index.html'. It would suck if all of
a sudden a search for 'example.com' started matching every document that
contained 'com'.
{quote}

You could solve this with better analysis, for example recognizing the full URL and decomposing it into its parts (forming n-grams of them).
This would be more performant than the current "english hacking" anyway.

I'm honestly having a tough time seeing where to proceed on this issue.

Lucene's queryparsing is completely broken for several languages due to this bug, and such language-specific hacking (heuristically forming phrase queries based on things that people subjectively feel helps for english) really doesn't belong in core lucene, but instead elsewhere, perhaps in some special optional pass to the control query parser.

The queryparser really should be language-independent and work well on average, this would fix it for several languages.

However, given the *huge* english bias I see here, i have a tough time seeing what concrete direction (e.g. code) i can work on to try to fix it. I feel such work would only be rejected since so many people seem opposed to simplifying the query parser and removing this language-specific hack.

If someone brings up an issue with the query parser (for instance i brought up several language-specific problems at apachecon), then people are quick to say that this doesn't belong in the queryparser, but should be dealt with on a special case. Why isn't english treated this way too? I don't consider this bias towards english "at all costs" including preventing languages such as Chinese from working at all very fair, I think its a really ugly stance for Lucene to take.


 

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866665#action_12866665 ]

Ivan Provalov commented on LUCENE-2458:
---------------------------------------

Robert has asked me to post our test results on the Chinese Collection. We used the following data collection from TREC:

http://trec.nist.gov/data/qrels_noneng/index.html
qrels.trec6.29-54.chinese.gz
qrels.1-28.chinese.gz

http://trec.nist.gov/data/topics_noneng
TREC-6 Chinese topics (.gz)
TREC-5 Chinese topics (.gz)

Mandarin Data Collection
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T52

Analyzer Name Plain analyzers Added PositionFilter (only at query time)
ChineseAnalyzer 0.028 0.264
CJKAnalyzer 0.027 0.284
SmartChinese 0.027 0.265
IKAnalyzer 0.028 0.259

(Note: IKAnalyzer has its own IKQueryParser which yields 0.084 for the average precision)

Thanks,

Ivan Provalov

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866693#action_12866693 ]

Marvin Humphrey commented on LUCENE-2458:
-----------------------------------------

> I'm honestly having a tough time seeing where to proceed on this issue.

Change the initial split on whitespace to be customizable.  Override the
splitting behavior for non-whitespace-delimited languages and feed
getFieldQuery() smaller chunks.

That solves your problem without removing behavior most people believe to be
helpful.  Insisting on that orthogonal change is what is holding things up.


> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866695#action_12866695 ]

Robert Muir commented on LUCENE-2458:
-------------------------------------

{quote}
Change the initial split on whitespace to be customizable. Override the
splitting behavior for non-whitespace-delimited languages and feed
getFieldQuery() smaller chunks.
{quote}

Whitespace doesn't separate words in the majority of the world's languages, including english.

The responsibility should be instead on english to do its language-specific processing, not on everyone else to dodge it.

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

1234 ... 7