Keyword fields, Porter stemming, and QueryParser

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Keyword fields, Porter stemming, and QueryParser

Dmitry Goldenberg
I'm having a problem with keyword fields and how they're treated by QueryParser.
 
At indexing time, I index my documents, as follows:
 
 Content - tokenized, indexed field (the default field)
 DocType - not tokenized, indexed, stored field
 ....... - other fields
 
The analyzer I use utilizes Porter stemming.
 
At searching time, a query may come from the application's front end, as follows:
 
 +Content:"content model" AND DocType:xls
 
I use the same analyzer as the one used for indexing:
 
      QueryParser parser = new QueryParser("Content", getAnalyzer());
      Query query = parser.parse(strQuery); // strQuery is +Content:"content model" AND DocType:xls
 
When this Query is built, the second clause gets represented in it as DocType:xl -
I guess the 's' gets dropped due to the stemming.
 
What I have in the index is actually DocType:xls.  So, the query does not bring back the expected results.
 
Has anyone run into this issue? How do I work around it?
 
Thanks,
- Dmitry
Reply | Threaded
Open this post in threaded view
|

Re: Keyword fields, Porter stemming, and QueryParser

davekor@gmail.com
If reindexing doesn't take too much time and effor, you can reindex
using the PerFieldAnalyzerWrapper to have different analyzers for each
field.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Keyword fields, Porter stemming, and QueryParser

Dmitry Goldenberg
Dave,
 
Thanks for the pointer.  The Wrapper worked marvellously!  This was exactly the situation - wanting to treat the standard fields and keyword fields differently as far as stemming is concerned (no stemming for the latter).
 
- Dmitry

________________________________

From: Dave Kor [mailto:[hidden email]]
Sent: Tue 1/24/2006 5:05 PM
To: [hidden email]
Subject: Re: Keyword fields, Porter stemming, and QueryParser



If reindexing doesn't take too much time and effor, you can reindex
using the PerFieldAnalyzerWrapper to have different analyzers for each
field.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

QueryParser behaviour ..

sergiu gordea
In reply to this post by Dmitry Goldenberg
  Hi all,

 I built a wrong query string "word1,word2,word3" instead of "word1
word2 word3"
therefore I got a wrong query:  field:"word1 word2 word3" instead of  
field:word1 field:word2  field:word3.

 Is this an espected behaviour?
 I used Standard analyzer, probably therefore, the comas were replaced
with spaces.
Indeded was no space between the words, just comas.

 Is this a bug? Does it make sense to indicate this situation through a
Parse Exception?

 Best,

  Sergiu

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser behaviour ..

Chris Hostetter-3

:  I built a wrong query string "word1,word2,word3" instead of "word1
: word2 word3"
: therefore I got a wrong query:  field:"word1 word2 word3" instead of
: field:word1 field:word2  field:word3.
:
:  Is this an espected behaviour?
:  I used Standard analyzer, probably therefore, the comas were replaced
: with spaces.

the commas weren't replaced ... your analyzer split on them and threw
them away.

they key to understanding why that resulted in a phrase query instead of
three term queries is that QueryParser doesn't treat comma as a special
character, so it saw the string word1,word2,word3 and gave it to your
analyzer.  Since your analyzer gave back several tokens QueryParser built
a phrase query out of it.

likewise, in the case of "word1 word2 word3" the quotes *are* a special
character to QueryParser which tells it it should *not* split on the
spaces betwen the quotes, and hand the individual words to the analyzer.
instead it hands the whole thing to the analyzer as one big string again.


:  Is this a bug? Does it make sense to indicate this situation through a
: Parse Exception?

a parse error should really onl come up when the query parser sees a
character that it does consider special, but sees it in a place that
doesn't make sense (or doesn't see one in a plkace it needs one).  in this
case QP is perfectly happy to let you query for a word that contains a
comma -- it's your analyzer that's putting it's foot down and saying that
can't be in a word.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser behaviour ..

sergiu gordea
Chris Hostetter wrote:

>:  I built a wrong query string "word1,word2,word3" instead of "word1
>: word2 word3"
>: therefore I got a wrong query:  field:"word1 word2 word3" instead of
>: field:word1 field:word2  field:word3.
>:
>:  Is this an espected behaviour?
>:  I used Standard analyzer, probably therefore, the comas were replaced
>: with spaces.
>
>the commas weren't replaced ... your analyzer split on them and threw
>them away.
>
>they key to understanding why that resulted in a phrase query instead of
>three term queries is that QueryParser doesn't treat comma as a special
>character, so it saw the string word1,word2,word3 and gave it to your
>analyzer.  Since your analyzer gave back several tokens QueryParser built
>a phrase query out of it.
>  
>
Exactly this is my question, why the QueryParser creates a Phrase query
when he gets several tokens from analyzer
and not a BooleanQuery?

>likewise, in the case of "word1 word2 word3" the quotes *are* a special
>character to QueryParser which tells it it should *not* split on the
>spaces betwen the quotes, and hand the individual words to the analyzer.
>instead it hands the whole thing to the analyzer as one big string again.
>
>  
>
It was not this situation, the string was without quotes.... (String
searchString =  "word1,word2,word3"; )
I just preserved java quotes to delimit the string.

>:  Is this a bug? Does it make sense to indicate this situation through a
>: Parse Exception?
>
>a parse error should really onl come up when the query parser sees a
>character that it does consider special, but sees it in a place that
>doesn't make sense (or doesn't see one in a plkace it needs one).  in this
>case QP is perfectly happy to let you query for a word that contains a
>comma -- it's your analyzer that's putting it's foot down and saying that
>can't be in a word.
>  
>
Ok .. it is not the case of ParseException, should situations like this
(change from TermQuery to PhraseQuery)
indicated in log files? I mean, this will help developers to debug their
code easier.

 Best,

 Sergiu

>
>-Hoss
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser behaviour ..

Chris Hostetter-3

: >they key to understanding why that resulted in a phrase query instead of
: >three term queries is that QueryParser doesn't treat comma as a special
: >character, so it saw the string word1,word2,word3 and gave it to your
: >analyzer.  Since your analyzer gave back several tokens QueryParser built
: >a phrase query out of it.
: >
: Exactly this is my question, why the QueryParser creates a Phrase query
: when he gets several tokens from analyzer
: and not a BooleanQuery?

Because if it did that, there would be no way to write phrase queries :)

QueryParser only returns a BooleanQuery when *it* can tell you have
several clauses.  For each "chunk" of text that it thinks of as one
continuous piece of text (either because it doesn't contain whitespaces or
because it has quotes around it) it gives it to the analyzer, if the
analyzer says there are multiple Terms there then QueryParser makes a
PhraseQuery out of it.   or in a nutshell:
   1) if the Parser detects multiple terms, it makes a boolean query
   2) if the Analyzer detects multiple terms, it makes a phrase query

if you don't like this behavior, it can all be circumvented by overriding
getFieldQuery().  you don't even have to teal with the analyzer if you
don't want to.  just call super.getFieldQuery() and if you get back a
PhraseQuery take it apart and build TermQueries wrapped in a boolean
query.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser behaviour ..

sergiu gordea
Chris Hostetter wrote:

>: Exactly this is my question, why the QueryParser creates a Phrase query
>: when he gets several tokens from analyzer
>: and not a BooleanQuery?
>
>Because if it did that, there would be no way to write phrase queries :)
>  
>
I'm not very sure about this ...

>QueryParser only returns a BooleanQuery when *it* can tell you have
>several clauses.  For each "chunk" of text that it thinks of as one
>continuous piece of text (either because it doesn't contain whitespaces or
>  
>
wouldn't be better to let the analyzer decide if there is a continuous
piece of text?
and to build PhraseQueries only when the quote sign is found?

>because it has quotes around it) it gives it to the analyzer, if the
>analyzer says there are multiple Terms there then QueryParser makes a
>PhraseQuery out of it.   or in a nutshell:
>   1) if the Parser detects multiple terms, it makes a boolean query
>   2) if the Analyzer detects multiple terms, it makes a phrase query
>  
>
this is related with my comment above. From the user's point of view I
think it will make sense to
build a phrase query only when the quotes are found in the search string.

I think there are pro and con arguments, for "unifying" the behaviour.
I would be happy if the QueryParser wouldn't create phrase queries if i
didn't explicitly  asked to do it.

Does someone have a different opinion?

>if you don't like this behavior, it can all be circumvented by overriding
>getFieldQuery().  you don't even have to teal with the analyzer if you
>don't want to.  just call super.getFieldQuery() and if you get back a
>PhraseQuery take it apart and build TermQueries wrapped in a boolean
>query.
>  
>
Well,  there is  all  the time  a work around.  It is obvious that
searching for word1,word2,word3 was a
silly mistake, but I needed one hour to find why a PhraseQuery is
created when no quotes existed in the query string.

So ... my opinion is that what I suggest will improve the usability of
lucene, I hope that  the  lucene developers  share
my opinion.

 Best,

 Sergiu

>
>
>
>-Hoss
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser behaviour ..

Yonik Seeley
> From the user's point of view I think it will make sense to
> build a phrase query only when the quotes are found in the search string.

You make an interesting point Sergiu.  Your proposal would increase
the expressive power of the QueryParser by allowing the construction
of either phrase queries or boolean queries when multiple tokens are
produced by analysis.

The main downside is that it's not backward compatible, and without
quotes (and hence phrase queries) many older queries will produce
worse results.  I also think that a majority of the time, when
multiple tokens are produced, you do want a phrase search (or at least
a sloppy one).

Of course, the backward compatible thing can be fixed via a flag on
the query parser that defaults to the old behavior.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: QueryParser behaviour ..

sergiu gordea
Yonik Seeley wrote:

>>From the user's point of view I think it will make sense to
>>build a phrase query only when the quotes are found in the search string.
>>    
>>
>
>You make an interesting point Sergiu.  Your proposal would increase
>the expressive power of the QueryParser by allowing the construction
>of either phrase queries or boolean queries when multiple tokens are
>produced by analysis.
>
>The main downside is that it's not backward compatible, and without
>quotes (and hence phrase queries) many older queries will produce
>worse results.  I also think that a majority of the time, when
>multiple tokens are produced, you do want a phrase search (or at least
>a sloppy one).
>
>Of course, the backward compatible thing can be fixed via a flag on
>the query parser that defaults to the old behavior.
>  
>
you are right, it can be a property of QueryParser similar to the AND/OR
behaviour.
This will solve also backward compatibility ... and will implement the
behaiour I espect also.

 Best,

 Sergiu

>-Yonik
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>  
>