Phrase search using quotes -- special Tokenizer

classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Phrase search using quotes -- special Tokenizer

Philip Brown
Hi,

After running some tests using the StandardAnalyzer, and getting 0 results from the search, I believe I need a special Tokenizer/Analyzer.  Does anybody have something that parses like the following:

- doesn't parse apart phrases (in quotes)
- doesn't parse/separate hyphentated or underscored words
other normal stuff like
- parses on whitespace
- removes periods in acronyms
- lowercases everything (even in quotes? -- maybe)

I basically have a set of terms, some of which are multi-worded phrases, but none should ever be broken apart -- not when adding the documents, not when querying the search results, etc.  I'm creating the field in the documents as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to get the results.  Any suggestions and/or existing code that I could re-use to fit this purpose?

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
Philip Brown wrote:

> Hi,
>
> After running some tests using the StandardAnalyzer, and getting 0 results
> from the search, I believe I need a special Tokenizer/Analyzer.  Does
> anybody have something that parses like the following:
>
> - doesn't parse apart phrases (in quotes)
> - doesn't parse/separate hyphentated or underscored words
> other normal stuff like
> - parses on whitespace
> - removes periods in acronyms
> - lowercases everything (even in quotes? -- maybe)
>
> I basically have a set of terms, some of which are multi-worded phrases, but
> none should ever be broken apart -- not when adding the documents, not when
> querying the search results, etc.  I'm creating the field in the documents
> as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to get
> the results.  Any suggestions and/or existing code that I could re-use to
> fit this purpose?
>
> Thanks.
>  
Here is what I would do. Pull the Standard Analyzer out of Lucene.
Modify StandardAnalyzer.jj. This is a JavaCC file. In it, there is some
regex that defines tokens for parsing. Now try some steps similar to
this: add '_' and '-' to the definition of a letter. Add a  new token
type that eats quoted phrases...look at queryparser.jj for an example,
prob about half way down the file <QUOTED>. Now run JavaCC on the
StandardAnalyzer.jj. Search the mailing list when you find out that a
ParseException is screwing up compilation (I really wish someone would
update that for the latest JavaCC if indeed that is the problem. Its
really annoying, and excluding it from compilation doesn't seem to fix
it anymore).

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
Do you mean StandardTokenizer.jj (org.apache.lucene.analysis.standard)?  I'm not seeing StandardAnalyzer.jj in the Lucene source download.
                                                                                                   
Mark Miller-5 wrote
Philip Brown wrote:
> Hi,
>
> After running some tests using the StandardAnalyzer, and getting 0 results
> from the search, I believe I need a special Tokenizer/Analyzer.  Does
> anybody have something that parses like the following:
>
> - doesn't parse apart phrases (in quotes)
> - doesn't parse/separate hyphentated or underscored words
> other normal stuff like
> - parses on whitespace
> - removes periods in acronyms
> - lowercases everything (even in quotes? -- maybe)
>
> I basically have a set of terms, some of which are multi-worded phrases, but
> none should ever be broken apart -- not when adding the documents, not when
> querying the search results, etc.  I'm creating the field in the documents
> as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to get
> the results.  Any suggestions and/or existing code that I could re-use to
> fit this purpose?
>
> Thanks.
>  
Here is what I would do. Pull the Standard Analyzer out of Lucene.
Modify StandardAnalyzer.jj. This is a JavaCC file. In it, there is some
regex that defines tokens for parsing. Now try some steps similar to
this: add '_' and '-' to the definition of a letter. Add a  new token
type that eats quoted phrases...look at queryparser.jj for an example,
prob about half way down the file <QUOTED>. Now run JavaCC on the
StandardAnalyzer.jj. Search the mailing list when you find out that a
ParseException is screwing up compilation (I really wish someone would
update that for the latest JavaCC if indeed that is the problem. Its
really annoying, and excluding it from compilation doesn't seem to fix
it anymore).

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
Philip Brown wrote:

> Do you mean StandardTokenizer.jj (org.apache.lucene.analysis.standard)?  I'm
> not seeing StandardAnalyzer.jj in the Lucene source download.
>                                                                                                    
>
> Mark Miller-5 wrote:
>  
>> Philip Brown wrote:
>>    
>>> Hi,
>>>
>>> After running some tests using the StandardAnalyzer, and getting 0
>>> results
>>> from the search, I believe I need a special Tokenizer/Analyzer.  Does
>>> anybody have something that parses like the following:
>>>
>>> - doesn't parse apart phrases (in quotes)
>>> - doesn't parse/separate hyphentated or underscored words
>>> other normal stuff like
>>> - parses on whitespace
>>> - removes periods in acronyms
>>> - lowercases everything (even in quotes? -- maybe)
>>>
>>> I basically have a set of terms, some of which are multi-worded phrases,
>>> but
>>> none should ever be broken apart -- not when adding the documents, not
>>> when
>>> querying the search results, etc.  I'm creating the field in the
>>> documents
>>> as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to
>>> get
>>> the results.  Any suggestions and/or existing code that I could re-use to
>>> fit this purpose?
>>>
>>> Thanks.
>>>  
>>>      
>> Here is what I would do. Pull the Standard Analyzer out of Lucene.
>> Modify StandardAnalyzer.jj. This is a JavaCC file. In it, there is some
>> regex that defines tokens for parsing. Now try some steps similar to
>> this: add '_' and '-' to the definition of a letter. Add a  new token
>> type that eats quoted phrases...look at queryparser.jj for an example,
>> prob about half way down the file <QUOTED>. Now run JavaCC on the
>> StandardAnalyzer.jj. Search the mailing list when you find out that a
>> ParseException is screwing up compilation (I really wish someone would
>> update that for the latest JavaCC if indeed that is the problem. Its
>> really annoying, and excluding it from compilation doesn't seem to fix
>> it anymore).
>>
>> - Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>    
>
>  
Yes. Standard Tokenizer. Sorry about that...my brain is schizo.
StandardTokenizer.jj in the StandardAnazlyer package.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
In reply to this post by Philip Brown
So this will recognize anything in quotes as a single token and '_' and
'-' will not break up words. There may be some repercussions for the NUM
token but nothing I'd worry about. maybe you want to use Unicode for '-'
and '_' as well...I wouldn't worry about it myself.

- Mark


TOKEN : {                      // token patterns

  // basic word: a sequence of digits & letters
  <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >

| <QUOTED:     "\"" (~["\""])+ "\"">

  // internal apostrophes: O'Reilly, you're, O'Reilly's
  // use a post-filter to remove possesives
| <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >

  // acronyms: U.S.A., I.B.M., etc.
  // use a post-filter to remove dots
| <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >

  // company names like AT&T and Excite@Home.
| <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >

  // email addresses
| <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
(("."|"-") <ALPHANUM>)+ >

  // hostname
| <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >

  // floating point, serial, model numbers, ip addresses, etc.
  // every other segment must have at least one digit
| <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
       | <HAS_DIGIT> <P> <ALPHANUM>
       | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
       | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
        )
  >
| <#P: ("_"|"-"|"/"|"."|",") >
| <#HAS_DIGIT:                      // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
  >

| < #ALPHA: (<LETTER>)+>
| < #LETTER:                      // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff",
       "-", "_"
      ]
  >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
In reply to this post by Philip Brown
One more tip...if you would like to be able to search phrases without
putting in the quotes you must strip them with the analyzer. In
standardfilter (in the standard analyzer code) add this:

 private static final String QUOTED_TYPE = tokenImage[QUOTED];
- youll see where to put that

and youll see where to put this:

else if (type == QUOTED_TYPE)  {
     return new
org.apache.lucene.analysis.Token(text.STRIPMYQUOTESOFFBABY(),t.startOffset(),
t.endOffset(), type);
}

text is a string so don't take my pseudo code literally.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
Thanks, but I don't "think" I need that.  But curious, how will it know it's a phrase if it's not enclosed in quotes?  Won't all its terms be treated separately then?

Philip

Mark Miller-5 wrote
One more tip...if you would like to be able to search phrases without
putting in the quotes you must strip them with the analyzer. In
standardfilter (in the standard analyzer code) add this:

 private static final String QUOTED_TYPE = tokenImage[QUOTED];
- youll see where to put that

and youll see where to put this:

else if (type == QUOTED_TYPE)  {
     return new
org.apache.lucene.analysis.Token(text.STRIPMYQUOTESOFFBABY(),t.startOffset(),
t.endOffset(), type);
}

text is a string so don't take my pseudo code literally.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
That is a good point. I was just thinking that it would be a pain for
searchers to have to include the quotes when searching, but I guess
there is little way around it. The best you could do is have an option
that specified a quoted search...and you might as well make that option
be to put the word in quotes :) I don't claim to know much, just trying
to help. On the other hand, stripping the quotes would save you two
characters of space per quote in the index...ha...

- Mark


Philip Brown wrote:

> Thanks, but I don't "think" I need that.  But curious, how will it know it's
> a phrase if it's not enclosed in quotes?  Won't all its terms be treated
> separately then?
>
> Philip
>
>
> Mark Miller-5 wrote:
>  
>> One more tip...if you would like to be able to search phrases without
>> putting in the quotes you must strip them with the analyzer. In
>> standardfilter (in the standard analyzer code) add this:
>>
>>  private static final String QUOTED_TYPE = tokenImage[QUOTED];
>> - youll see where to put that
>>
>> and youll see where to put this:
>>
>> else if (type == QUOTED_TYPE)  {
>>      return new
>> org.apache.lucene.analysis.Token(text.STRIPMYQUOTESOFFBABY(),t.startOffset(),
>> t.endOffset(), type);
>> }
>>
>> text is a string so don't take my pseudo code literally.
>>
>> - Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
In reply to this post by Mark Miller-3
Well, I tried that, and it doesn't seem to work still.  I would be happy to zip up the new files, so you can see what I'm using -- maybe you can get it to work.  The first time, I tried building the documents without quotes surrounding each phrase.  Then, I retried by enclosing every phrase within double quotes.  Neither seemed to work.  When constructing the query string for the search, I always added the double quotes (otherwise, it'd think it was multiple terms).  (I didn't even test the underscore and hyphenated terms.)  I thought Lucene was (sort of by default) set up to search quoted phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A Phrase is a group of words surrounded by double quotes such as "hello dolly".  So, this should be easy, right?  I must be missing something stupid.

Thanks,

Philip

Mark Miller-5 wrote
So this will recognize anything in quotes as a single token and '_' and
'-' will not break up words. There may be some repercussions for the NUM
token but nothing I'd worry about. maybe you want to use Unicode for '-'
and '_' as well...I wouldn't worry about it myself.

- Mark


TOKEN : {                      // token patterns

  // basic word: a sequence of digits & letters
  <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >

| <QUOTED:     "\"" (~["\""])+ "\"">

  // internal apostrophes: O'Reilly, you're, O'Reilly's
  // use a post-filter to remove possesives
| <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >

  // acronyms: U.S.A., I.B.M., etc.
  // use a post-filter to remove dots
| <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >

  // company names like AT&T and Excite@Home.
| <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >

  // email addresses
| <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM> 
(("."|"-") <ALPHANUM>)+ >

  // hostname
| <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >

  // floating point, serial, model numbers, ip addresses, etc.
  // every other segment must have at least one digit
| <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
       | <HAS_DIGIT> <P> <ALPHANUM>
       | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
       | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
        )
  >
| <#P: ("_"|"-"|"/"|"."|",") >
| <#HAS_DIGIT:                      // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
  >

| < #ALPHA: (<LETTER>)+>
| < #LETTER:                      // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff",
       "-", "_"
      ]
  >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
Sorry to hear you're having trouble. You indeed need the double quotes in
the source text. You will also need them in the query string. Make sure they
are in both places. My machine is hosed right now or I would do it for you
real quick. My guess is that I forgot to mention...no only do you need to
add the <QUOTED> definiton to the TOKEN section, but below that you will
find the grammer...you need to add <QUOTED> to the grammer. If you look how
<NUM> and <APOSTROPHE> are done you will prob see what you should do. If
not, my machine should be back up tomarrow...

- Mark

On 9/1/06, Philip Brown <[hidden email]> wrote:

>
>
> Well, I tried that, and it doesn't seem to work still.  I would be happy
> to
> zip up the new files, so you can see what I'm using -- maybe you can get
> it
> to work.  The first time, I tried building the documents without quotes
> surrounding each phrase.  Then, I retried by enclosing every phrase within
> double quotes.  Neither seemed to work.  When constructing the query
> string
> for the search, I always added the double quotes (otherwise, it'd think it
> was multiple terms).  (I didn't even test the underscore and hyphenated
> terms.)  I thought Lucene was (sort of by default) set up to search quoted
> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
> Phrase is a group of words surrounded by double quotes such as "hello
> dolly".  So, this should be easy, right?  I must be missing something
> stupid.
>
> Thanks,
>
> Philip
>
>
> Mark Miller-5 wrote:
> >
> > So this will recognize anything in quotes as a single token and '_' and
> > '-' will not break up words. There may be some repercussions for the NUM
> > token but nothing I'd worry about. maybe you want to use Unicode for '-'
> > and '_' as well...I wouldn't worry about it myself.
> >
> > - Mark
> >
> >
> > TOKEN : {                      // token patterns
> >
> >   // basic word: a sequence of digits & letters
> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >
> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >
> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >   // use a post-filter to remove possesives
> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >
> >   // acronyms: U.S.A., I.B.M., etc.
> >   // use a post-filter to remove dots
> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >
> >   // company names like AT&T and Excite@Home.
> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >
> >   // email addresses
> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> > (("."|"-") <ALPHANUM>)+ >
> >
> >   // hostname
> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >
> >   // floating point, serial, model numbers, ip addresses, etc.
> >   // every other segment must have at least one digit
> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >         )
> >   >
> > | <#P: ("_"|"-"|"/"|"."|",") >
> > | <#HAS_DIGIT:                      // at least one digit
> >     (<LETTER>|<DIGIT>)*
> >     <DIGIT>
> >     (<LETTER>|<DIGIT>)*
> >   >
> >
> > | < #ALPHA: (<LETTER>)+>
> > | < #LETTER:                      // unicode letters
> >       [
> >        "\u0041"-"\u005a",
> >        "\u0061"-"\u007a",
> >        "\u00c0"-"\u00d6",
> >        "\u00d8"-"\u00f6",
> >        "\u00f8"-"\u00ff",
> >        "\u0100"-"\u1fff",
> >        "-", "_"
> >       ]
> >   >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
Added the <QUOTED> to the other section and reran the javacc and imported the new files...but, I still get the same result -- no results.  (Quotes are in the source text and query string.)  Anything else I might be missing?

Philip

Mark Miller-5 wrote
Sorry to hear you're having trouble. You indeed need the double quotes in
the source text. You will also need them in the query string. Make sure they
are in both places. My machine is hosed right now or I would do it for you
real quick. My guess is that I forgot to mention...no only do you need to
add the <QUOTED> definiton to the TOKEN section, but below that you will
find the grammer...you need to add <QUOTED> to the grammer. If you look how
<NUM> and <APOSTROPHE> are done you will prob see what you should do. If
not, my machine should be back up tomarrow...

- Mark

On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>
>
> Well, I tried that, and it doesn't seem to work still.  I would be happy
> to
> zip up the new files, so you can see what I'm using -- maybe you can get
> it
> to work.  The first time, I tried building the documents without quotes
> surrounding each phrase.  Then, I retried by enclosing every phrase within
> double quotes.  Neither seemed to work.  When constructing the query
> string
> for the search, I always added the double quotes (otherwise, it'd think it
> was multiple terms).  (I didn't even test the underscore and hyphenated
> terms.)  I thought Lucene was (sort of by default) set up to search quoted
> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
> Phrase is a group of words surrounded by double quotes such as "hello
> dolly".  So, this should be easy, right?  I must be missing something
> stupid.
>
> Thanks,
>
> Philip
>
>
> Mark Miller-5 wrote:
> >
> > So this will recognize anything in quotes as a single token and '_' and
> > '-' will not break up words. There may be some repercussions for the NUM
> > token but nothing I'd worry about. maybe you want to use Unicode for '-'
> > and '_' as well...I wouldn't worry about it myself.
> >
> > - Mark
> >
> >
> > TOKEN : {                      // token patterns
> >
> >   // basic word: a sequence of digits & letters
> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >
> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >
> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >   // use a post-filter to remove possesives
> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >
> >   // acronyms: U.S.A., I.B.M., etc.
> >   // use a post-filter to remove dots
> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >
> >   // company names like AT&T and Excite@Home.
> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >
> >   // email addresses
> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> > (("."|"-") <ALPHANUM>)+ >
> >
> >   // hostname
> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >
> >   // floating point, serial, model numbers, ip addresses, etc.
> >   // every other segment must have at least one digit
> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >         )
> >   >
> > | <#P: ("_"|"-"|"/"|"."|",") >
> > | <#HAS_DIGIT:                      // at least one digit
> >     (<LETTER>|<DIGIT>)*
> >     <DIGIT>
> >     (<LETTER>|<DIGIT>)*
> >   >
> >
> > | < #ALPHA: (<LETTER>)+>
> > | < #LETTER:                      // unicode letters
> >       [
> >        "\u0041"-"\u005a",
> >        "\u0061"-"\u007a",
> >        "\u00c0"-"\u00d6",
> >        "\u00d8"-"\u00f6",
> >        "\u00f8"-"\u00ff",
> >        "\u0100"-"\u1fff",
> >        "-", "_"
> >       ]
> >   >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
In reply to this post by Mark Miller-3
Interesting...just ran a test where I put double quotes around everything (including single keywords) of source text and then ran searches for a known keyword with and without double quotes -- doesn't find either time.

Mark Miller-5 wrote
Sorry to hear you're having trouble. You indeed need the double quotes in
the source text. You will also need them in the query string. Make sure they
are in both places. My machine is hosed right now or I would do it for you
real quick. My guess is that I forgot to mention...no only do you need to
add the <QUOTED> definiton to the TOKEN section, but below that you will
find the grammer...you need to add <QUOTED> to the grammer. If you look how
<NUM> and <APOSTROPHE> are done you will prob see what you should do. If
not, my machine should be back up tomarrow...

- Mark

On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>
>
> Well, I tried that, and it doesn't seem to work still.  I would be happy
> to
> zip up the new files, so you can see what I'm using -- maybe you can get
> it
> to work.  The first time, I tried building the documents without quotes
> surrounding each phrase.  Then, I retried by enclosing every phrase within
> double quotes.  Neither seemed to work.  When constructing the query
> string
> for the search, I always added the double quotes (otherwise, it'd think it
> was multiple terms).  (I didn't even test the underscore and hyphenated
> terms.)  I thought Lucene was (sort of by default) set up to search quoted
> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
> Phrase is a group of words surrounded by double quotes such as "hello
> dolly".  So, this should be easy, right?  I must be missing something
> stupid.
>
> Thanks,
>
> Philip
>
>
> Mark Miller-5 wrote:
> >
> > So this will recognize anything in quotes as a single token and '_' and
> > '-' will not break up words. There may be some repercussions for the NUM
> > token but nothing I'd worry about. maybe you want to use Unicode for '-'
> > and '_' as well...I wouldn't worry about it myself.
> >
> > - Mark
> >
> >
> > TOKEN : {                      // token patterns
> >
> >   // basic word: a sequence of digits & letters
> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >
> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >
> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >   // use a post-filter to remove possesives
> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >
> >   // acronyms: U.S.A., I.B.M., etc.
> >   // use a post-filter to remove dots
> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >
> >   // company names like AT&T and Excite@Home.
> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >
> >   // email addresses
> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> > (("."|"-") <ALPHANUM>)+ >
> >
> >   // hostname
> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >
> >   // floating point, serial, model numbers, ip addresses, etc.
> >   // every other segment must have at least one digit
> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >         )
> >   >
> > | <#P: ("_"|"-"|"/"|"."|",") >
> > | <#HAS_DIGIT:                      // at least one digit
> >     (<LETTER>|<DIGIT>)*
> >     <DIGIT>
> >     (<LETTER>|<DIGIT>)*
> >   >
> >
> > | < #ALPHA: (<LETTER>)+>
> > | < #LETTER:                      // unicode letters
> >       [
> >        "\u0041"-"\u005a",
> >        "\u0061"-"\u007a",
> >        "\u00c0"-"\u00d6",
> >        "\u00d8"-"\u00f6",
> >        "\u00f8"-"\u00ff",
> >        "\u0100"-"\u1fff",
> >        "-", "_"
> >       ]
> >   >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Erick Erickson
OK, I've gotta ask. Have you examined your index with Luke to see if what
you *think* is in the index actually *is*???

Erick

On 9/1/06, Philip Brown <[hidden email]> wrote:

>
>
> Interesting...just ran a test where I put double quotes around everything
> (including single keywords) of source text and then ran searches for a
> known
> keyword with and without double quotes -- doesn't find either time.
>
>
> Mark Miller-5 wrote:
> >
> > Sorry to hear you're having trouble. You indeed need the double quotes
> in
> > the source text. You will also need them in the query string. Make sure
> > they
> > are in both places. My machine is hosed right now or I would do it for
> you
> > real quick. My guess is that I forgot to mention...no only do you need
> to
> > add the <QUOTED> definiton to the TOKEN section, but below that you will
> > find the grammer...you need to add <QUOTED> to the grammer. If you look
> > how
> > <NUM> and <APOSTROPHE> are done you will prob see what you should do. If
> > not, my machine should be back up tomarrow...
> >
> > - Mark
> >
> > On 9/1/06, Philip Brown <[hidden email]> wrote:
> >>
> >>
> >> Well, I tried that, and it doesn't seem to work still.  I would be
> happy
> >> to
> >> zip up the new files, so you can see what I'm using -- maybe you can
> get
> >> it
> >> to work.  The first time, I tried building the documents without quotes
> >> surrounding each phrase.  Then, I retried by enclosing every phrase
> >> within
> >> double quotes.  Neither seemed to work.  When constructing the query
> >> string
> >> for the search, I always added the double quotes (otherwise, it'd think
> >> it
> >> was multiple terms).  (I didn't even test the underscore and hyphenated
> >> terms.)  I thought Lucene was (sort of by default) set up to search
> >> quoted
> >> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
> >> Phrase is a group of words surrounded by double quotes such as "hello
> >> dolly".  So, this should be easy, right?  I must be missing something
> >> stupid.
> >>
> >> Thanks,
> >>
> >> Philip
> >>
> >>
> >> Mark Miller-5 wrote:
> >> >
> >> > So this will recognize anything in quotes as a single token and '_'
> and
> >> > '-' will not break up words. There may be some repercussions for the
> >> NUM
> >> > token but nothing I'd worry about. maybe you want to use Unicode for
> >> '-'
> >> > and '_' as well...I wouldn't worry about it myself.
> >> >
> >> > - Mark
> >> >
> >> >
> >> > TOKEN : {                      // token patterns
> >> >
> >> >   // basic word: a sequence of digits & letters
> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >> >
> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >> >
> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >> >   // use a post-filter to remove possesives
> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >> >
> >> >   // acronyms: U.S.A., I.B.M., etc.
> >> >   // use a post-filter to remove dots
> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >> >
> >> >   // company names like AT&T and Excite@Home.
> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >> >
> >> >   // email addresses
> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> >> > (("."|"-") <ALPHANUM>)+ >
> >> >
> >> >   // hostname
> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >> >
> >> >   // floating point, serial, model numbers, ip addresses, etc.
> >> >   // every other segment must have at least one digit
> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >> >         )
> >> >   >
> >> > | <#P: ("_"|"-"|"/"|"."|",") >
> >> > | <#HAS_DIGIT:                      // at least one digit
> >> >     (<LETTER>|<DIGIT>)*
> >> >     <DIGIT>
> >> >     (<LETTER>|<DIGIT>)*
> >> >   >
> >> >
> >> > | < #ALPHA: (<LETTER>)+>
> >> > | < #LETTER:                      // unicode letters
> >> >       [
> >> >        "\u0041"-"\u005a",
> >> >        "\u0061"-"\u007a",
> >> >        "\u00c0"-"\u00d6",
> >> >        "\u00d8"-"\u00f6",
> >> >        "\u00f8"-"\u00ff",
> >> >        "\u0100"-"\u1fff",
> >> >        "-", "_"
> >> >       ]
> >> >   >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [hidden email]
> >> > For additional commands, e-mail: [hidden email]
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> >> Sent from the Lucene - Java Users forum at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
No, I've never used Luke.  Is there an easy way to examine my RAMDirectory index?  I can create the index with no quoted keywords, and when I search for a keyword, I get back the expected results (just can't search for a phrase that has whitespace in it).  If I create the index with phrases in quotes, then when I search for anything in double quotes, I get back nothing.  If I create the index with everything in quotes, then when I search for anything by the keyword field, I get nothing, regardless of whether I use quotes in the query string or not.  (I can get results back by searching on other fields.)  What do you think?

Philip

Erick Erickson wrote
OK, I've gotta ask. Have you examined your index with Luke to see if what
you *think* is in the index actually *is*???

Erick

On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>
>
> Interesting...just ran a test where I put double quotes around everything
> (including single keywords) of source text and then ran searches for a
> known
> keyword with and without double quotes -- doesn't find either time.
>
>
> Mark Miller-5 wrote:
> >
> > Sorry to hear you're having trouble. You indeed need the double quotes
> in
> > the source text. You will also need them in the query string. Make sure
> > they
> > are in both places. My machine is hosed right now or I would do it for
> you
> > real quick. My guess is that I forgot to mention...no only do you need
> to
> > add the <QUOTED> definiton to the TOKEN section, but below that you will
> > find the grammer...you need to add <QUOTED> to the grammer. If you look
> > how
> > <NUM> and <APOSTROPHE> are done you will prob see what you should do. If
> > not, my machine should be back up tomarrow...
> >
> > - Mark
> >
> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
> >>
> >>
> >> Well, I tried that, and it doesn't seem to work still.  I would be
> happy
> >> to
> >> zip up the new files, so you can see what I'm using -- maybe you can
> get
> >> it
> >> to work.  The first time, I tried building the documents without quotes
> >> surrounding each phrase.  Then, I retried by enclosing every phrase
> >> within
> >> double quotes.  Neither seemed to work.  When constructing the query
> >> string
> >> for the search, I always added the double quotes (otherwise, it'd think
> >> it
> >> was multiple terms).  (I didn't even test the underscore and hyphenated
> >> terms.)  I thought Lucene was (sort of by default) set up to search
> >> quoted
> >> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
> >> Phrase is a group of words surrounded by double quotes such as "hello
> >> dolly".  So, this should be easy, right?  I must be missing something
> >> stupid.
> >>
> >> Thanks,
> >>
> >> Philip
> >>
> >>
> >> Mark Miller-5 wrote:
> >> >
> >> > So this will recognize anything in quotes as a single token and '_'
> and
> >> > '-' will not break up words. There may be some repercussions for the
> >> NUM
> >> > token but nothing I'd worry about. maybe you want to use Unicode for
> >> '-'
> >> > and '_' as well...I wouldn't worry about it myself.
> >> >
> >> > - Mark
> >> >
> >> >
> >> > TOKEN : {                      // token patterns
> >> >
> >> >   // basic word: a sequence of digits & letters
> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >> >
> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >> >
> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >> >   // use a post-filter to remove possesives
> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >> >
> >> >   // acronyms: U.S.A., I.B.M., etc.
> >> >   // use a post-filter to remove dots
> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >> >
> >> >   // company names like AT&T and Excite@Home.
> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >> >
> >> >   // email addresses
> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> >> > (("."|"-") <ALPHANUM>)+ >
> >> >
> >> >   // hostname
> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >> >
> >> >   // floating point, serial, model numbers, ip addresses, etc.
> >> >   // every other segment must have at least one digit
> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >> >         )
> >> >   >
> >> > | <#P: ("_"|"-"|"/"|"."|",") >
> >> > | <#HAS_DIGIT:                      // at least one digit
> >> >     (<LETTER>|<DIGIT>)*
> >> >     <DIGIT>
> >> >     (<LETTER>|<DIGIT>)*
> >> >   >
> >> >
> >> > | < #ALPHA: (<LETTER>)+>
> >> > | < #LETTER:                      // unicode letters
> >> >       [
> >> >        "\u0041"-"\u005a",
> >> >        "\u0061"-"\u007a",
> >> >        "\u00c0"-"\u00d6",
> >> >        "\u00d8"-"\u00f6",
> >> >        "\u00f8"-"\u00ff",
> >> >        "\u0100"-"\u1fff",
> >> >        "-", "_"
> >> >       ]
> >> >   >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> >> Sent from the Lucene - Java Users forum at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
I am out of ideas. If I'm feeling perky I'll build you one in the morning.

> No, I've never used Luke.  Is there an easy way to examine my RAMDirectory
> index?  I can create the index with no quoted keywords, and when I search
> for a keyword, I get back the expected results (just can't search for a
> phrase that has whitespace in it).  If I create the index with phrases in
> quotes, then when I search for anything in double quotes, I get back
> nothing.  If I create the index with everything in quotes, then when I
> search for anything by the keyword field, I get nothing, regardless of
> whether I use quotes in the query string or not.  (I can get results back by
> searching on other fields.)  What do you think?
>
> Philip
>
>
> Erick Erickson wrote:
>  
>> OK, I've gotta ask. Have you examined your index with Luke to see if what
>> you *think* is in the index actually *is*???
>>
>> Erick
>>
>> On 9/1/06, Philip Brown <[hidden email]> wrote:
>>    
>>> Interesting...just ran a test where I put double quotes around everything
>>> (including single keywords) of source text and then ran searches for a
>>> known
>>> keyword with and without double quotes -- doesn't find either time.
>>>
>>>
>>> Mark Miller-5 wrote:
>>>      
>>>> Sorry to hear you're having trouble. You indeed need the double quotes
>>>>        
>>> in
>>>      
>>>> the source text. You will also need them in the query string. Make sure
>>>> they
>>>> are in both places. My machine is hosed right now or I would do it for
>>>>        
>>> you
>>>      
>>>> real quick. My guess is that I forgot to mention...no only do you need
>>>>        
>>> to
>>>      
>>>> add the <QUOTED> definiton to the TOKEN section, but below that you
>>>>        
>>> will
>>>      
>>>> find the grammer...you need to add <QUOTED> to the grammer. If you look
>>>> how
>>>> <NUM> and <APOSTROPHE> are done you will prob see what you should do.
>>>>        
>>> If
>>>      
>>>> not, my machine should be back up tomarrow...
>>>>
>>>> - Mark
>>>>
>>>> On 9/1/06, Philip Brown <[hidden email]> wrote:
>>>>        
>>>>> Well, I tried that, and it doesn't seem to work still.  I would be
>>>>>          
>>> happy
>>>      
>>>>> to
>>>>> zip up the new files, so you can see what I'm using -- maybe you can
>>>>>          
>>> get
>>>      
>>>>> it
>>>>> to work.  The first time, I tried building the documents without
>>>>>          
>>> quotes
>>>      
>>>>> surrounding each phrase.  Then, I retried by enclosing every phrase
>>>>> within
>>>>> double quotes.  Neither seemed to work.  When constructing the query
>>>>> string
>>>>> for the search, I always added the double quotes (otherwise, it'd
>>>>>          
>>> think
>>>      
>>>>> it
>>>>> was multiple terms).  (I didn't even test the underscore and
>>>>>          
>>> hyphenated
>>>      
>>>>> terms.)  I thought Lucene was (sort of by default) set up to search
>>>>> quoted
>>>>> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
>>>>> Phrase is a group of words surrounded by double quotes such as "hello
>>>>> dolly".  So, this should be easy, right?  I must be missing something
>>>>> stupid.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Philip
>>>>>
>>>>>
>>>>> Mark Miller-5 wrote:
>>>>>          
>>>>>> So this will recognize anything in quotes as a single token and '_'
>>>>>>            
>>> and
>>>      
>>>>>> '-' will not break up words. There may be some repercussions for the
>>>>>>            
>>>>> NUM
>>>>>          
>>>>>> token but nothing I'd worry about. maybe you want to use Unicode for
>>>>>>            
>>>>> '-'
>>>>>          
>>>>>> and '_' as well...I wouldn't worry about it myself.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>>
>>>>>> TOKEN : {                      // token patterns
>>>>>>
>>>>>>   // basic word: a sequence of digits & letters
>>>>>>   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>>>>>>
>>>>>> | <QUOTED:     "\"" (~["\""])+ "\"">
>>>>>>
>>>>>>   // internal apostrophes: O'Reilly, you're, O'Reilly's
>>>>>>   // use a post-filter to remove possesives
>>>>>> | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>>>>>>
>>>>>>   // acronyms: U.S.A., I.B.M., etc.
>>>>>>   // use a post-filter to remove dots
>>>>>> | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>>>>>>
>>>>>>   // company names like AT&T and Excite@Home.
>>>>>> | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>>>>>>
>>>>>>   // email addresses
>>>>>> | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>>>>>> (("."|"-") <ALPHANUM>)+ >
>>>>>>
>>>>>>   // hostname
>>>>>> | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>>>>>>
>>>>>>   // floating point, serial, model numbers, ip addresses, etc.
>>>>>>   // every other segment must have at least one digit
>>>>>> | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>>>>>>        | <HAS_DIGIT> <P> <ALPHANUM>
>>>>>>        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>>>>>>        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>>>>>>        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
>>>>>>            
>>> <HAS_DIGIT>)+
>>>      
>>>>>>        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
>>>>>>            
>>> <ALPHANUM>)+
>>>      
>>>>>>         )
>>>>>>   >
>>>>>> | <#P: ("_"|"-"|"/"|"."|",") >
>>>>>> | <#HAS_DIGIT:                      // at least one digit
>>>>>>     (<LETTER>|<DIGIT>)*
>>>>>>     <DIGIT>
>>>>>>     (<LETTER>|<DIGIT>)*
>>>>>>   >
>>>>>>
>>>>>> | < #ALPHA: (<LETTER>)+>
>>>>>> | < #LETTER:                      // unicode letters
>>>>>>       [
>>>>>>        "\u0041"-"\u005a",
>>>>>>        "\u0061"-"\u007a",
>>>>>>        "\u00c0"-"\u00d6",
>>>>>>        "\u00d8"-"\u00f6",
>>>>>>        "\u00f8"-"\u00ff",
>>>>>>        "\u0100"-"\u1fff",
>>>>>>        "-", "_"
>>>>>>       ]
>>>>>>   >
>>>>>>
>>>>>>
>>>>>>            
>>> ---------------------------------------------------------------------
>>>      
>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>> For additional commands, e-mail: [hidden email]
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>>          
>>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
>>>      
>>>>> Sent from the Lucene - Java Users forum at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>>          
>>>>        
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
>>> Sent from the Lucene - Java Users forum at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>      
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
In reply to this post by Philip Brown
Well Philip...bad news. I should have thought of this before...I think
the query parser is the problem. You are tokening "all in the quotes" to
one token...but when QueryParser sees that, it doesnt matter what
analyzer you use, it's going to see the quotes and strip them right off
. Then it passes whats inside the quotes to be analyzed. Now whats
inside the quotes will be missing stop words, have no quotes, etc.
Sorry I hadn't thought of this. You're getting deeper...modify the
queryparser? It might only take passing in the entire quoted token in
the queryparser.jj
>   q = getFieldQuery(field, term.image.substring(1,
> term.image.length()-1), s);
just try term.image.

Then queryparser's analyzer would get the whole quoted hunk of goodness.
Think it out a bit though...I haven't yet.

- Mark

> No, I've never used Luke.  Is there an easy way to examine my RAMDirectory
> index?  I can create the index with no quoted keywords, and when I search
> for a keyword, I get back the expected results (just can't search for a
> phrase that has whitespace in it).  If I create the index with phrases in
> quotes, then when I search for anything in double quotes, I get back
> nothing.  If I create the index with everything in quotes, then when I
> search for anything by the keyword field, I get nothing, regardless of
> whether I use quotes in the query string or not.  (I can get results back by
> searching on other fields.)  What do you think?
>
> Philip
>
>
> Erick Erickson wrote:
>  
>> OK, I've gotta ask. Have you examined your index with Luke to see if what
>> you *think* is in the index actually *is*???
>>
>> Erick
>>
>> On 9/1/06, Philip Brown <[hidden email]> wrote:
>>    
>>> Interesting...just ran a test where I put double quotes around everything
>>> (including single keywords) of source text and then ran searches for a
>>> known
>>> keyword with and without double quotes -- doesn't find either time.
>>>
>>>
>>> Mark Miller-5 wrote:
>>>      
>>>> Sorry to hear you're having trouble. You indeed need the double quotes
>>>>        
>>> in
>>>      
>>>> the source text. You will also need them in the query string. Make sure
>>>> they
>>>> are in both places. My machine is hosed right now or I would do it for
>>>>        
>>> you
>>>      
>>>> real quick. My guess is that I forgot to mention...no only do you need
>>>>        
>>> to
>>>      
>>>> add the <QUOTED> definiton to the TOKEN section, but below that you
>>>>        
>>> will
>>>      
>>>> find the grammer...you need to add <QUOTED> to the grammer. If you look
>>>> how
>>>> <NUM> and <APOSTROPHE> are done you will prob see what you should do.
>>>>        
>>> If
>>>      
>>>> not, my machine should be back up tomarrow...
>>>>
>>>> - Mark
>>>>
>>>> On 9/1/06, Philip Brown <[hidden email]> wrote:
>>>>        
>>>>> Well, I tried that, and it doesn't seem to work still.  I would be
>>>>>          
>>> happy
>>>      
>>>>> to
>>>>> zip up the new files, so you can see what I'm using -- maybe you can
>>>>>          
>>> get
>>>      
>>>>> it
>>>>> to work.  The first time, I tried building the documents without
>>>>>          
>>> quotes
>>>      
>>>>> surrounding each phrase.  Then, I retried by enclosing every phrase
>>>>> within
>>>>> double quotes.  Neither seemed to work.  When constructing the query
>>>>> string
>>>>> for the search, I always added the double quotes (otherwise, it'd
>>>>>          
>>> think
>>>      
>>>>> it
>>>>> was multiple terms).  (I didn't even test the underscore and
>>>>>          
>>> hyphenated
>>>      
>>>>> terms.)  I thought Lucene was (sort of by default) set up to search
>>>>> quoted
>>>>> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
>>>>> Phrase is a group of words surrounded by double quotes such as "hello
>>>>> dolly".  So, this should be easy, right?  I must be missing something
>>>>> stupid.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Philip
>>>>>
>>>>>
>>>>> Mark Miller-5 wrote:
>>>>>          
>>>>>> So this will recognize anything in quotes as a single token and '_'
>>>>>>            
>>> and
>>>      
>>>>>> '-' will not break up words. There may be some repercussions for the
>>>>>>            
>>>>> NUM
>>>>>          
>>>>>> token but nothing I'd worry about. maybe you want to use Unicode for
>>>>>>            
>>>>> '-'
>>>>>          
>>>>>> and '_' as well...I wouldn't worry about it myself.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>>
>>>>>> TOKEN : {                      // token patterns
>>>>>>
>>>>>>   // basic word: a sequence of digits & letters
>>>>>>   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>>>>>>
>>>>>> | <QUOTED:     "\"" (~["\""])+ "\"">
>>>>>>
>>>>>>   // internal apostrophes: O'Reilly, you're, O'Reilly's
>>>>>>   // use a post-filter to remove possesives
>>>>>> | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>>>>>>
>>>>>>   // acronyms: U.S.A., I.B.M., etc.
>>>>>>   // use a post-filter to remove dots
>>>>>> | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>>>>>>
>>>>>>   // company names like AT&T and Excite@Home.
>>>>>> | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>>>>>>
>>>>>>   // email addresses
>>>>>> | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>>>>>> (("."|"-") <ALPHANUM>)+ >
>>>>>>
>>>>>>   // hostname
>>>>>> | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>>>>>>
>>>>>>   // floating point, serial, model numbers, ip addresses, etc.
>>>>>>   // every other segment must have at least one digit
>>>>>> | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>>>>>>        | <HAS_DIGIT> <P> <ALPHANUM>
>>>>>>        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>>>>>>        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>>>>>>        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
>>>>>>            
>>> <HAS_DIGIT>)+
>>>      
>>>>>>        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
>>>>>>            
>>> <ALPHANUM>)+
>>>      
>>>>>>         )
>>>>>>   >
>>>>>> | <#P: ("_"|"-"|"/"|"."|",") >
>>>>>> | <#HAS_DIGIT:                      // at least one digit
>>>>>>     (<LETTER>|<DIGIT>)*
>>>>>>     <DIGIT>
>>>>>>     (<LETTER>|<DIGIT>)*
>>>>>>   >
>>>>>>
>>>>>> | < #ALPHA: (<LETTER>)+>
>>>>>> | < #LETTER:                      // unicode letters
>>>>>>       [
>>>>>>        "\u0041"-"\u005a",
>>>>>>        "\u0061"-"\u007a",
>>>>>>        "\u00c0"-"\u00d6",
>>>>>>        "\u00d8"-"\u00f6",
>>>>>>        "\u00f8"-"\u00ff",
>>>>>>        "\u0100"-"\u1fff",
>>>>>>        "-", "_"
>>>>>>       ]
>>>>>>   >
>>>>>>
>>>>>>
>>>>>>            
>>> ---------------------------------------------------------------------
>>>      
>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>> For additional commands, e-mail: [hidden email]
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>>          
>>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
>>>      
>>>>> Sent from the Lucene - Java Users forum at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>>          
>>>>        
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
>>> Sent from the Lucene - Java Users forum at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>      
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Erick Erickson
In reply to this post by Philip Brown
Well, you're flying blind. Is the behavior rooted in the indexing or
querying? Since you can't answer that, you're reduced to trying random
things hoping that one of them works. A little like voodoo. I've wasted
faaaaarrrrrr too much time trying to solve what I was *sure* was the problem
only to find it was somewhere else (the last place I look, of course) <G>...

Using Luke on a RAMDir. No, I don't know how to, but it should be a simple
thing to write the index to an FSDir at the same time you create your RAMDir
and use Luke then. This is debugging, after all.

I'd be really, really, really reluctant to modify the query parser and/or
the tokenizer, since whenever I've been tempted it's usually because I don't
understand the tools already provided. Then I have to maintain my custom
code. Which sucks. Although it sure feels more productive to hack a bunch of
code and get something that works 90% of the time, then spend weeks making
the other 10% work than taking two days to find the 3 lines you *really*
need <G>.

Have you thought of a PatternAnalyzer? It takes a regular expression as the
tokenizer and  (from the Javadoc)
<<< Efficient Lucene analyzer/tokenizer that preferably operates on a String
rather than a Reader<http://java.sun.com/j2se/1.4/docs/api/java/io/Reader.html>,
that can flexibly separate text into terms via a regular expression
Pattern<http://java.sun.com/j2se/1.4/docs/api/java/util/regex/Pattern.html>(with
behaviour identical to
String.split(String)<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html#split%28java.lang.String%29>),
and that combines the functionality of
LetterTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LetterTokenizer.html>,
LowerCaseTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LowerCaseTokenizer.html>,
WhitespaceTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/WhitespaceTokenizer.html>,
StopFilter<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/StopFilter.html>into
a single efficient multi-purpose class.>>>

One word of caution, the regular expression consists of expressions that
*break* tokens, not expressions that *form* words, which threw me at first.
Just like the doc says, like splitstring <G>.... This is in 2.0, although I
*believe* it's also in the contrib section of 1.9 (or is in the regular API,
I forget).

Best
Erick

On 9/1/06, Philip Brown <[hidden email]> wrote:

>
>
> No, I've never used Luke.  Is there an easy way to examine my RAMDirectory
> index?  I can create the index with no quoted keywords, and when I search
> for a keyword, I get back the expected results (just can't search for a
> phrase that has whitespace in it).  If I create the index with phrases in
> quotes, then when I search for anything in double quotes, I get back
> nothing.  If I create the index with everything in quotes, then when I
> search for anything by the keyword field, I get nothing, regardless of
> whether I use quotes in the query string or not.  (I can get results back
> by
> searching on other fields.)  What do you think?
>
> Philip
>
>
> Erick Erickson wrote:
> >
> > OK, I've gotta ask. Have you examined your index with Luke to see if
> what
> > you *think* is in the index actually *is*???
> >
> > Erick
> >
> > On 9/1/06, Philip Brown <[hidden email]> wrote:
> >>
> >>
> >> Interesting...just ran a test where I put double quotes around
> everything
> >> (including single keywords) of source text and then ran searches for a
> >> known
> >> keyword with and without double quotes -- doesn't find either time.
> >>
> >>
> >> Mark Miller-5 wrote:
> >> >
> >> > Sorry to hear you're having trouble. You indeed need the double
> quotes
> >> in
> >> > the source text. You will also need them in the query string. Make
> sure
> >> > they
> >> > are in both places. My machine is hosed right now or I would do it
> for
> >> you
> >> > real quick. My guess is that I forgot to mention...no only do you
> need
> >> to
> >> > add the <QUOTED> definiton to the TOKEN section, but below that you
> >> will
> >> > find the grammer...you need to add <QUOTED> to the grammer. If you
> look
> >> > how
> >> > <NUM> and <APOSTROPHE> are done you will prob see what you should do.
> >> If
> >> > not, my machine should be back up tomarrow...
> >> >
> >> > - Mark
> >> >
> >> > On 9/1/06, Philip Brown <[hidden email]> wrote:
> >> >>
> >> >>
> >> >> Well, I tried that, and it doesn't seem to work still.  I would be
> >> happy
> >> >> to
> >> >> zip up the new files, so you can see what I'm using -- maybe you can
> >> get
> >> >> it
> >> >> to work.  The first time, I tried building the documents without
> >> quotes
> >> >> surrounding each phrase.  Then, I retried by enclosing every phrase
> >> >> within
> >> >> double quotes.  Neither seemed to work.  When constructing the query
> >> >> string
> >> >> for the search, I always added the double quotes (otherwise, it'd
> >> think
> >> >> it
> >> >> was multiple terms).  (I didn't even test the underscore and
> >> hyphenated
> >> >> terms.)  I thought Lucene was (sort of by default) set up to search
> >> >> quoted
> >> >> phrases.  From http://lucene.apache.org/java/docs/api/index.html -->
> A
> >> >> Phrase is a group of words surrounded by double quotes such as
> "hello
> >> >> dolly".  So, this should be easy, right?  I must be missing
> something
> >> >> stupid.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Philip
> >> >>
> >> >>
> >> >> Mark Miller-5 wrote:
> >> >> >
> >> >> > So this will recognize anything in quotes as a single token and
> '_'
> >> and
> >> >> > '-' will not break up words. There may be some repercussions for
> the
> >> >> NUM
> >> >> > token but nothing I'd worry about. maybe you want to use Unicode
> for
> >> >> '-'
> >> >> > and '_' as well...I wouldn't worry about it myself.
> >> >> >
> >> >> > - Mark
> >> >> >
> >> >> >
> >> >> > TOKEN : {                      // token patterns
> >> >> >
> >> >> >   // basic word: a sequence of digits & letters
> >> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >> >> >
> >> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >> >> >
> >> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >> >> >   // use a post-filter to remove possesives
> >> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >> >> >
> >> >> >   // acronyms: U.S.A., I.B.M., etc.
> >> >> >   // use a post-filter to remove dots
> >> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >> >> >
> >> >> >   // company names like AT&T and Excite@Home.
> >> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >> >> >
> >> >> >   // email addresses
> >> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> >> >> > (("."|"-") <ALPHANUM>)+ >
> >> >> >
> >> >> >   // hostname
> >> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >> >> >
> >> >> >   // floating point, serial, model numbers, ip addresses, etc.
> >> >> >   // every other segment must have at least one digit
> >> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >> >> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >> >> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >> >> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
> >> <HAS_DIGIT>)+
> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
> >> <ALPHANUM>)+
> >> >> >         )
> >> >> >   >
> >> >> > | <#P: ("_"|"-"|"/"|"."|",") >
> >> >> > | <#HAS_DIGIT:                      // at least one digit
> >> >> >     (<LETTER>|<DIGIT>)*
> >> >> >     <DIGIT>
> >> >> >     (<LETTER>|<DIGIT>)*
> >> >> >   >
> >> >> >
> >> >> > | < #ALPHA: (<LETTER>)+>
> >> >> > | < #LETTER:                      // unicode letters
> >> >> >       [
> >> >> >        "\u0041"-"\u005a",
> >> >> >        "\u0061"-"\u007a",
> >> >> >        "\u00c0"-"\u00d6",
> >> >> >        "\u00d8"-"\u00f6",
> >> >> >        "\u00f8"-"\u00ff",
> >> >> >        "\u0100"-"\u1fff",
> >> >> >        "-", "_"
> >> >> >       ]
> >> >> >   >
> >> >> >
> >> >> >
> >> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: [hidden email]
> >> >> > For additional commands, e-mail: [hidden email]
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> >> >> Sent from the Lucene - Java Users forum at Nabble.com.
> >> >>
> >> >>
> >> >>
> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [hidden email]
> >> >> For additional commands, e-mail: [hidden email]
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
> >> Sent from the Lucene - Java Users forum at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
I think if he wants to use the queryparser to parse his search strings
that he has no choice but to modify it. It will eat any pair of quotes
going through it no matter what analyzer is used.

- Mark

> Well, you're flying blind. Is the behavior rooted in the indexing or
> querying? Since you can't answer that, you're reduced to trying random
> things hoping that one of them works. A little like voodoo. I've wasted
> faaaaarrrrrr too much time trying to solve what I was *sure* was the
> problem
> only to find it was somewhere else (the last place I look, of course)
> <G>...
>
> Using Luke on a RAMDir. No, I don't know how to, but it should be a
> simple
> thing to write the index to an FSDir at the same time you create your
> RAMDir
> and use Luke then. This is debugging, after all.
>
> I'd be really, really, really reluctant to modify the query parser and/or
> the tokenizer, since whenever I've been tempted it's usually because I
> don't
> understand the tools already provided. Then I have to maintain my custom
> code. Which sucks. Although it sure feels more productive to hack a
> bunch of
> code and get something that works 90% of the time, then spend weeks
> making
> the other 10% work than taking two days to find the 3 lines you *really*
> need <G>.
>
> Have you thought of a PatternAnalyzer? It takes a regular expression
> as the
> tokenizer and  (from the Javadoc)
> <<< Efficient Lucene analyzer/tokenizer that preferably operates on a
> String
> rather than a
> Reader<http://java.sun.com/j2se/1.4/docs/api/java/io/Reader.html>,
> that can flexibly separate text into terms via a regular expression
> Pattern<http://java.sun.com/j2se/1.4/docs/api/java/util/regex/Pattern.html>(with
>
> behaviour identical to
> String.split(String)<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html#split%28java.lang.String%29>),
>
> and that combines the functionality of
> LetterTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LetterTokenizer.html>,
>
> LowerCaseTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LowerCaseTokenizer.html>,
>
> WhitespaceTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/WhitespaceTokenizer.html>,
>
> StopFilter<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/StopFilter.html>into
>
> a single efficient multi-purpose class.>>>
>
> One word of caution, the regular expression consists of expressions that
> *break* tokens, not expressions that *form* words, which threw me at
> first.
> Just like the doc says, like splitstring <G>.... This is in 2.0,
> although I
> *believe* it's also in the contrib section of 1.9 (or is in the
> regular API,
> I forget).
>
> Best
> Erick
>
> On 9/1/06, Philip Brown <[hidden email]> wrote:
>>
>>
>> No, I've never used Luke.  Is there an easy way to examine my
>> RAMDirectory
>> index?  I can create the index with no quoted keywords, and when I
>> search
>> for a keyword, I get back the expected results (just can't search for a
>> phrase that has whitespace in it).  If I create the index with
>> phrases in
>> quotes, then when I search for anything in double quotes, I get back
>> nothing.  If I create the index with everything in quotes, then when I
>> search for anything by the keyword field, I get nothing, regardless of
>> whether I use quotes in the query string or not.  (I can get results
>> back
>> by
>> searching on other fields.)  What do you think?
>>
>> Philip
>>
>>
>> Erick Erickson wrote:
>> >
>> > OK, I've gotta ask. Have you examined your index with Luke to see if
>> what
>> > you *think* is in the index actually *is*???
>> >
>> > Erick
>> >
>> > On 9/1/06, Philip Brown <[hidden email]> wrote:
>> >>
>> >>
>> >> Interesting...just ran a test where I put double quotes around
>> everything
>> >> (including single keywords) of source text and then ran searches
>> for a
>> >> known
>> >> keyword with and without double quotes -- doesn't find either time.
>> >>
>> >>
>> >> Mark Miller-5 wrote:
>> >> >
>> >> > Sorry to hear you're having trouble. You indeed need the double
>> quotes
>> >> in
>> >> > the source text. You will also need them in the query string. Make
>> sure
>> >> > they
>> >> > are in both places. My machine is hosed right now or I would do it
>> for
>> >> you
>> >> > real quick. My guess is that I forgot to mention...no only do you
>> need
>> >> to
>> >> > add the <QUOTED> definiton to the TOKEN section, but below that you
>> >> will
>> >> > find the grammer...you need to add <QUOTED> to the grammer. If you
>> look
>> >> > how
>> >> > <NUM> and <APOSTROPHE> are done you will prob see what you
>> should do.
>> >> If
>> >> > not, my machine should be back up tomarrow...
>> >> >
>> >> > - Mark
>> >> >
>> >> > On 9/1/06, Philip Brown <[hidden email]> wrote:
>> >> >>
>> >> >>
>> >> >> Well, I tried that, and it doesn't seem to work still.  I would be
>> >> happy
>> >> >> to
>> >> >> zip up the new files, so you can see what I'm using -- maybe
>> you can
>> >> get
>> >> >> it
>> >> >> to work.  The first time, I tried building the documents without
>> >> quotes
>> >> >> surrounding each phrase.  Then, I retried by enclosing every
>> phrase
>> >> >> within
>> >> >> double quotes.  Neither seemed to work.  When constructing the
>> query
>> >> >> string
>> >> >> for the search, I always added the double quotes (otherwise, it'd
>> >> think
>> >> >> it
>> >> >> was multiple terms).  (I didn't even test the underscore and
>> >> hyphenated
>> >> >> terms.)  I thought Lucene was (sort of by default) set up to
>> search
>> >> >> quoted
>> >> >> phrases.  From
>> http://lucene.apache.org/java/docs/api/index.html -->
>> A
>> >> >> Phrase is a group of words surrounded by double quotes such as
>> "hello
>> >> >> dolly".  So, this should be easy, right?  I must be missing
>> something
>> >> >> stupid.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Philip
>> >> >>
>> >> >>
>> >> >> Mark Miller-5 wrote:
>> >> >> >
>> >> >> > So this will recognize anything in quotes as a single token and
>> '_'
>> >> and
>> >> >> > '-' will not break up words. There may be some repercussions for
>> the
>> >> >> NUM
>> >> >> > token but nothing I'd worry about. maybe you want to use Unicode
>> for
>> >> >> '-'
>> >> >> > and '_' as well...I wouldn't worry about it myself.
>> >> >> >
>> >> >> > - Mark
>> >> >> >
>> >> >> >
>> >> >> > TOKEN : {                      // token patterns
>> >> >> >
>> >> >> >   // basic word: a sequence of digits & letters
>> >> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>> >> >> >
>> >> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
>> >> >> >
>> >> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
>> >> >> >   // use a post-filter to remove possesives
>> >> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>> >> >> >
>> >> >> >   // acronyms: U.S.A., I.B.M., etc.
>> >> >> >   // use a post-filter to remove dots
>> >> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>> >> >> >
>> >> >> >   // company names like AT&T and Excite@Home.
>> >> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>> >> >> >
>> >> >> >   // email addresses
>> >> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>> >> >> > (("."|"-") <ALPHANUM>)+ >
>> >> >> >
>> >> >> >   // hostname
>> >> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>> >> >> >
>> >> >> >   // floating point, serial, model numbers, ip addresses, etc.
>> >> >> >   // every other segment must have at least one digit
>> >> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
>> >> >> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>> >> >> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>> >> >> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
>> >> <HAS_DIGIT>)+
>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
>> >> <ALPHANUM>)+
>> >> >> >         )
>> >> >> >   >
>> >> >> > | <#P: ("_"|"-"|"/"|"."|",") >
>> >> >> > | <#HAS_DIGIT:                      // at least one digit
>> >> >> >     (<LETTER>|<DIGIT>)*
>> >> >> >     <DIGIT>
>> >> >> >     (<LETTER>|<DIGIT>)*
>> >> >> >   >
>> >> >> >
>> >> >> > | < #ALPHA: (<LETTER>)+>
>> >> >> > | < #LETTER:                      // unicode letters
>> >> >> >       [
>> >> >> >        "\u0041"-"\u005a",
>> >> >> >        "\u0061"-"\u007a",
>> >> >> >        "\u00c0"-"\u00d6",
>> >> >> >        "\u00d8"-"\u00f6",
>> >> >> >        "\u00f8"-"\u00ff",
>> >> >> >        "\u0100"-"\u1fff",
>> >> >> >        "-", "_"
>> >> >> >       ]
>> >> >> >   >
>> >> >> >
>> >> >> >
>> >> ---------------------------------------------------------------------
>> >> >> > To unsubscribe, e-mail: [hidden email]
>> >> >> > For additional commands, e-mail:
>> [hidden email]
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920 
>>
>> >> >> Sent from the Lucene - Java Users forum at Nabble.com.
>> >> >>
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: [hidden email]
>> >> >> For additional commands, e-mail: [hidden email]
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649 
>>
>> >> Sent from the Lucene - Java Users forum at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067 
>>
>> Sent from the Lucene - Java Users forum at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
I tend to agree with Mark.  I tried a query as so...

   TermQuery query = new TermQuery(new Term("keywordField", "phrase test"));
   IndexSearcher searcher= new IndexSearcher(activeIdx);
   Hits hits = searcher.search(query);

And this produced the expected results.  When building the index, I did NOT enclose the keywords in quotes -- just added as UN_TOKENIZED.

Philip

Mark Miller-5 wrote
I think if he wants to use the queryparser to parse his search strings
that he has no choice but to modify it. It will eat any pair of quotes
going through it no matter what analyzer is used.

- Mark
> Well, you're flying blind. Is the behavior rooted in the indexing or
> querying? Since you can't answer that, you're reduced to trying random
> things hoping that one of them works. A little like voodoo. I've wasted
> faaaaarrrrrr too much time trying to solve what I was *sure* was the
> problem
> only to find it was somewhere else (the last place I look, of course)
> <G>...
>
> Using Luke on a RAMDir. No, I don't know how to, but it should be a
> simple
> thing to write the index to an FSDir at the same time you create your
> RAMDir
> and use Luke then. This is debugging, after all.
>
> I'd be really, really, really reluctant to modify the query parser and/or
> the tokenizer, since whenever I've been tempted it's usually because I
> don't
> understand the tools already provided. Then I have to maintain my custom
> code. Which sucks. Although it sure feels more productive to hack a
> bunch of
> code and get something that works 90% of the time, then spend weeks
> making
> the other 10% work than taking two days to find the 3 lines you *really*
> need <G>.
>
> Have you thought of a PatternAnalyzer? It takes a regular expression
> as the
> tokenizer and  (from the Javadoc)
> <<< Efficient Lucene analyzer/tokenizer that preferably operates on a
> String
> rather than a
> Reader<http://java.sun.com/j2se/1.4/docs/api/java/io/Reader.html>,
> that can flexibly separate text into terms via a regular expression
> Pattern<http://java.sun.com/j2se/1.4/docs/api/java/util/regex/Pattern.html>(with
>
> behaviour identical to
> String.split(String)<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html#split%28java.lang.String%29>),
>
> and that combines the functionality of
> LetterTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LetterTokenizer.html>,
>
> LowerCaseTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LowerCaseTokenizer.html>,
>
> WhitespaceTokenizer<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/WhitespaceTokenizer.html>,
>
> StopFilter<file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/StopFilter.html>into
>
> a single efficient multi-purpose class.>>>
>
> One word of caution, the regular expression consists of expressions that
> *break* tokens, not expressions that *form* words, which threw me at
> first.
> Just like the doc says, like splitstring <G>.... This is in 2.0,
> although I
> *believe* it's also in the contrib section of 1.9 (or is in the
> regular API,
> I forget).
>
> Best
> Erick
>
> On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>>
>>
>> No, I've never used Luke.  Is there an easy way to examine my
>> RAMDirectory
>> index?  I can create the index with no quoted keywords, and when I
>> search
>> for a keyword, I get back the expected results (just can't search for a
>> phrase that has whitespace in it).  If I create the index with
>> phrases in
>> quotes, then when I search for anything in double quotes, I get back
>> nothing.  If I create the index with everything in quotes, then when I
>> search for anything by the keyword field, I get nothing, regardless of
>> whether I use quotes in the query string or not.  (I can get results
>> back
>> by
>> searching on other fields.)  What do you think?
>>
>> Philip
>>
>>
>> Erick Erickson wrote:
>> >
>> > OK, I've gotta ask. Have you examined your index with Luke to see if
>> what
>> > you *think* is in the index actually *is*???
>> >
>> > Erick
>> >
>> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>> >>
>> >>
>> >> Interesting...just ran a test where I put double quotes around
>> everything
>> >> (including single keywords) of source text and then ran searches
>> for a
>> >> known
>> >> keyword with and without double quotes -- doesn't find either time.
>> >>
>> >>
>> >> Mark Miller-5 wrote:
>> >> >
>> >> > Sorry to hear you're having trouble. You indeed need the double
>> quotes
>> >> in
>> >> > the source text. You will also need them in the query string. Make
>> sure
>> >> > they
>> >> > are in both places. My machine is hosed right now or I would do it
>> for
>> >> you
>> >> > real quick. My guess is that I forgot to mention...no only do you
>> need
>> >> to
>> >> > add the <QUOTED> definiton to the TOKEN section, but below that you
>> >> will
>> >> > find the grammer...you need to add <QUOTED> to the grammer. If you
>> look
>> >> > how
>> >> > <NUM> and <APOSTROPHE> are done you will prob see what you
>> should do.
>> >> If
>> >> > not, my machine should be back up tomarrow...
>> >> >
>> >> > - Mark
>> >> >
>> >> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>> >> >>
>> >> >>
>> >> >> Well, I tried that, and it doesn't seem to work still.  I would be
>> >> happy
>> >> >> to
>> >> >> zip up the new files, so you can see what I'm using -- maybe
>> you can
>> >> get
>> >> >> it
>> >> >> to work.  The first time, I tried building the documents without
>> >> quotes
>> >> >> surrounding each phrase.  Then, I retried by enclosing every
>> phrase
>> >> >> within
>> >> >> double quotes.  Neither seemed to work.  When constructing the
>> query
>> >> >> string
>> >> >> for the search, I always added the double quotes (otherwise, it'd
>> >> think
>> >> >> it
>> >> >> was multiple terms).  (I didn't even test the underscore and
>> >> hyphenated
>> >> >> terms.)  I thought Lucene was (sort of by default) set up to
>> search
>> >> >> quoted
>> >> >> phrases.  From
>> http://lucene.apache.org/java/docs/api/index.html -->
>> A
>> >> >> Phrase is a group of words surrounded by double quotes such as
>> "hello
>> >> >> dolly".  So, this should be easy, right?  I must be missing
>> something
>> >> >> stupid.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Philip
>> >> >>
>> >> >>
>> >> >> Mark Miller-5 wrote:
>> >> >> >
>> >> >> > So this will recognize anything in quotes as a single token and
>> '_'
>> >> and
>> >> >> > '-' will not break up words. There may be some repercussions for
>> the
>> >> >> NUM
>> >> >> > token but nothing I'd worry about. maybe you want to use Unicode
>> for
>> >> >> '-'
>> >> >> > and '_' as well...I wouldn't worry about it myself.
>> >> >> >
>> >> >> > - Mark
>> >> >> >
>> >> >> >
>> >> >> > TOKEN : {                      // token patterns
>> >> >> >
>> >> >> >   // basic word: a sequence of digits & letters
>> >> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>> >> >> >
>> >> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
>> >> >> >
>> >> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
>> >> >> >   // use a post-filter to remove possesives
>> >> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>> >> >> >
>> >> >> >   // acronyms: U.S.A., I.B.M., etc.
>> >> >> >   // use a post-filter to remove dots
>> >> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>> >> >> >
>> >> >> >   // company names like AT&T and Excite@Home.
>> >> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>> >> >> >
>> >> >> >   // email addresses
>> >> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>> >> >> > (("."|"-") <ALPHANUM>)+ >
>> >> >> >
>> >> >> >   // hostname
>> >> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>> >> >> >
>> >> >> >   // floating point, serial, model numbers, ip addresses, etc.
>> >> >> >   // every other segment must have at least one digit
>> >> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
>> >> >> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>> >> >> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>> >> >> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
>> >> <HAS_DIGIT>)+
>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
>> >> <ALPHANUM>)+
>> >> >> >         )
>> >> >> >   >
>> >> >> > | <#P: ("_"|"-"|"/"|"."|",") >
>> >> >> > | <#HAS_DIGIT:                      // at least one digit
>> >> >> >     (<LETTER>|<DIGIT>)*
>> >> >> >     <DIGIT>
>> >> >> >     (<LETTER>|<DIGIT>)*
>> >> >> >   >
>> >> >> >
>> >> >> > | < #ALPHA: (<LETTER>)+>
>> >> >> > | < #LETTER:                      // unicode letters
>> >> >> >       [
>> >> >> >        "\u0041"-"\u005a",
>> >> >> >        "\u0061"-"\u007a",
>> >> >> >        "\u00c0"-"\u00d6",
>> >> >> >        "\u00d8"-"\u00f6",
>> >> >> >        "\u00f8"-"\u00ff",
>> >> >> >        "\u0100"-"\u1fff",
>> >> >> >        "-", "_"
>> >> >> >       ]
>> >> >> >   >
>> >> >> >
>> >> >> >
>> >> ---------------------------------------------------------------------
>> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> > For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920 
>>
>> >> >> Sent from the Lucene - Java Users forum at Nabble.com.
>> >> >>
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649 
>>
>> >> Sent from the Lucene - Java Users forum at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067 
>>
>> Sent from the Lucene - Java Users forum at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Erick Erickson
Disclaimer: Of course I'm not as familiar with your problem space as you
are, so I may be way out in left field, but...

I *still* think you're making waaaaaay too much work for yourself and need
to examine your assumptions.

0> But when you index something UN_TOKENIZED as in your example, I don't
think you'll  find the words "phrase" and "test" if you just search for them
individually.

1> "doesn't parse apart phrases". Why do you care? The "usual" way of
handling this is to just go ahead and parse them apart, then submit your
query with quotes embedded. Would that serve your needs? If not, I bet you
could make a cool regex that would do this for you and use a
PatternAnalyzer. If neither of those work, instead of hacking Lucene code,
make your own tokenizer by overriding one of the Lucene tokenizers. See
Lucene in Action for example, the section on synonyms as I remember.....
It'll be waaaaay less work in the long run than having to try to stay in
synch by hacking the Lucene code. Especially for the next poor soul who has
to maintain it........

2> doesn't parse/separate underscores and hyphens. Why not use a
PatternAnalyzer? You can make it do most anything you want with a clever
regex.

3> the other thing that might be producing unexpected results is that the
default is OR for QueryParser.......

Since no amount of documentation replaces a program, I've included one
illustrating what I'm talking about. It's less likely we're talk past each
other this way. All it may *really* demonstrate is that I don't understand
what you're trying to do at all, but at least we'll know <G>...

Notice in particular that it finds the hypenated words, but doesn't find
their individual parts. Also,  quoted phrasew are found, even though the
stop words aren't in the index (examined via luke). I used Eclipse to
compile/run it......

Best
Erick

package scratch;

import java.io.IOException;
import java.util.regex.Pattern;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.index.memory.PatternAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;

public class Scratch {

    private Analyzer analyzer = null;

    private Analyzer getAnalyzer() {
        if (analyzer == null) {
            // Break on whitespace or anything that is NOT hyphen,
underscore
            // or letter.
            Pattern pat = Pattern.compile("(\\s|[^-_a-zA-Z])");
            analyzer = new PatternAnalyzer(pat, true,
                    StopFilter.makeStopSet(STOP_WORDS));
        }
        return analyzer;

    }

    private void makeTestIndex() throws Exception {
        IndexWriter writer = new IndexWriter("C:/mydir", getAnalyzer(),
true);
        String text = ("this is the test text hyphenated-Iamhyphenated
underscored_IaMunDerScoRed");
        Document doc = new Document();
        doc = new Document();
        doc.add(new Field("test", text, Field.Store.YES,
                        Field.Index.TOKENIZED));
        writer.addDocument(doc);

        writer.close();
    }

    private void doSearch(String query, int expectedHits) throws Exception {
        try {
            QueryParser qp = new QueryParser("test", getAnalyzer());
            qp.enable_tracing();
            IndexSearcher srch = new IndexSearcher("C:/mydir");
            Query tmp = qp.parse(query);
            // Uncomment to see parsed form of query
            // System.out.println("Parsed form is '" + tmp.toString() +
"'");
            Hits hits = srch.search(tmp);

            String msg = "";

            if (hits.length() == expectedHits) {
                msg = "Test passed ";
            } else {
                msg = "************TEST FAILED************ ";
            }
            System.out.println(msg + "Expected "
                    + Integer.toString(expectedHits) + " hits, got "
                    + Integer.toString(hits.length()) + " hits");

        } catch (IOException e) {
            System.out.println("Caught IOException");
            e.printStackTrace();
        }
    }

    private void doSearchPhrase(String phrase, int expectedHits) throws
Exception {
        doSearch("\"" + phrase + "\"", expectedHits);
    }
    public static void main(String[] args) {
        try {
            Scratch scratch = new Scratch();
            scratch.getAnalyzer();
            scratch.makeTestIndex();
            scratch.doSearch("underscored_iamunderscored", 1);
            scratch.doSearch("underscored iamunderscored", 0);
            scratch.doSearch("hyphenated-iamhyphenated", 1);
            scratch.doSearch("iamunderscored", 0);
            scratch.doSearch("underscored", 0);

            scratch.doSearchPhrase("this is the test text", 1);
            scratch.doSearchPhrase("text with hyphenated-iamhyphenated", 1);
            scratch.doSearchPhrase("text with underscored_iamunderscored",
0);


        } catch (Exception e) {
            System.err.println(e.getMessage());
        }
    }

    protected static final String[] STOP_WORDS = { "a", "an", "and", "are",
            "as", "at", "b", "be", "but", "by", "c", "d", "e", "f", "for",
            "if", "g", "h", "i", "in", "into", "is", "it", "j", "k", "l",
"m",
            "n", "no", "not", "o", "of", "on", "or", "p", "q", "r", "s",
            "such", "t", "that", "the", "their", "then", "there", "these",
            "they", "this", "to", "u", "v", "w", "was", "will", "with", "x",
            "y", "z" };

}



On 9/2/06, Philip Brown <[hidden email]> wrote:

>
>
> I tend to agree with Mark.  I tried a query as so...
>
>    TermQuery query = new TermQuery(new Term("keywordField", "phrase
> test"));
>    IndexSearcher searcher= new IndexSearcher(activeIdx);
>    Hits hits = searcher.search(query);
>
> And this produced the expected results.  When building the index, I did
> NOT
> enclose the keywords in quotes -- just added as UN_TOKENIZED.
>
> Philip
>
>
> Mark Miller-5 wrote:
> >
> > I think if he wants to use the queryparser to parse his search strings
> > that he has no choice but to modify it. It will eat any pair of quotes
> > going through it no matter what analyzer is used.
> >
> > - Mark
> >> Well, you're flying blind. Is the behavior rooted in the indexing or
> >> querying? Since you can't answer that, you're reduced to trying random
> >> things hoping that one of them works. A little like voodoo. I've wasted
> >> faaaaarrrrrr too much time trying to solve what I was *sure* was the
> >> problem
> >> only to find it was somewhere else (the last place I look, of course)
> >> <G>...
> >>
> >> Using Luke on a RAMDir. No, I don't know how to, but it should be a
> >> simple
> >> thing to write the index to an FSDir at the same time you create your
> >> RAMDir
> >> and use Luke then. This is debugging, after all.
> >>
> >> I'd be really, really, really reluctant to modify the query parser
> and/or
> >> the tokenizer, since whenever I've been tempted it's usually because I
> >> don't
> >> understand the tools already provided. Then I have to maintain my
> custom
> >> code. Which sucks. Although it sure feels more productive to hack a
> >> bunch of
> >> code and get something that works 90% of the time, then spend weeks
> >> making
> >> the other 10% work than taking two days to find the 3 lines you
> *really*
> >> need <G>.
> >>
> >> Have you thought of a PatternAnalyzer? It takes a regular expression
> >> as the
> >> tokenizer and  (from the Javadoc)
> >> <<< Efficient Lucene analyzer/tokenizer that preferably operates on a
> >> String
> >> rather than a
> >> Reader<http://java.sun.com/j2se/1.4/docs/api/java/io/Reader.html>,
> >> that can flexibly separate text into terms via a regular expression
> >> Pattern<
> http://java.sun.com/j2se/1.4/docs/api/java/util/regex/Pattern.html>(with
> >>
> >> behaviour identical to
> >> String.split(String)<
> http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html#split%28java.lang.String%29
> >),
> >>
> >> and that combines the functionality of
> >> LetterTokenizer<
> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LetterTokenizer.html
> >,
> >>
> >> LowerCaseTokenizer<
> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LowerCaseTokenizer.html
> >,
> >>
> >> WhitespaceTokenizer<
> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/WhitespaceTokenizer.html
> >,
> >>
> >> StopFilter<
> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/StopFilter.html
> >into
> >>
> >> a single efficient multi-purpose class.>>>
> >>
> >> One word of caution, the regular expression consists of expressions
> that
> >> *break* tokens, not expressions that *form* words, which threw me at
> >> first.
> >> Just like the doc says, like splitstring <G>.... This is in 2.0,
> >> although I
> >> *believe* it's also in the contrib section of 1.9 (or is in the
> >> regular API,
> >> I forget).
> >>
> >> Best
> >> Erick
> >>
> >> On 9/1/06, Philip Brown <[hidden email]> wrote:
> >>>
> >>>
> >>> No, I've never used Luke.  Is there an easy way to examine my
> >>> RAMDirectory
> >>> index?  I can create the index with no quoted keywords, and when I
> >>> search
> >>> for a keyword, I get back the expected results (just can't search for
> a
> >>> phrase that has whitespace in it).  If I create the index with
> >>> phrases in
> >>> quotes, then when I search for anything in double quotes, I get back
> >>> nothing.  If I create the index with everything in quotes, then when I
> >>> search for anything by the keyword field, I get nothing, regardless of
> >>> whether I use quotes in the query string or not.  (I can get results
> >>> back
> >>> by
> >>> searching on other fields.)  What do you think?
> >>>
> >>> Philip
> >>>
> >>>
> >>> Erick Erickson wrote:
> >>> >
> >>> > OK, I've gotta ask. Have you examined your index with Luke to see if
> >>> what
> >>> > you *think* is in the index actually *is*???
> >>> >
> >>> > Erick
> >>> >
> >>> > On 9/1/06, Philip Brown <[hidden email]> wrote:
> >>> >>
> >>> >>
> >>> >> Interesting...just ran a test where I put double quotes around
> >>> everything
> >>> >> (including single keywords) of source text and then ran searches
> >>> for a
> >>> >> known
> >>> >> keyword with and without double quotes -- doesn't find either time.
> >>> >>
> >>> >>
> >>> >> Mark Miller-5 wrote:
> >>> >> >
> >>> >> > Sorry to hear you're having trouble. You indeed need the double
> >>> quotes
> >>> >> in
> >>> >> > the source text. You will also need them in the query string.
> Make
> >>> sure
> >>> >> > they
> >>> >> > are in both places. My machine is hosed right now or I would do
> it
> >>> for
> >>> >> you
> >>> >> > real quick. My guess is that I forgot to mention...no only do you
> >>> need
> >>> >> to
> >>> >> > add the <QUOTED> definiton to the TOKEN section, but below that
> you
> >>> >> will
> >>> >> > find the grammer...you need to add <QUOTED> to the grammer. If
> you
> >>> look
> >>> >> > how
> >>> >> > <NUM> and <APOSTROPHE> are done you will prob see what you
> >>> should do.
> >>> >> If
> >>> >> > not, my machine should be back up tomarrow...
> >>> >> >
> >>> >> > - Mark
> >>> >> >
> >>> >> > On 9/1/06, Philip Brown <[hidden email]> wrote:
> >>> >> >>
> >>> >> >>
> >>> >> >> Well, I tried that, and it doesn't seem to work still.  I would
> be
> >>> >> happy
> >>> >> >> to
> >>> >> >> zip up the new files, so you can see what I'm using -- maybe
> >>> you can
> >>> >> get
> >>> >> >> it
> >>> >> >> to work.  The first time, I tried building the documents without
> >>> >> quotes
> >>> >> >> surrounding each phrase.  Then, I retried by enclosing every
> >>> phrase
> >>> >> >> within
> >>> >> >> double quotes.  Neither seemed to work.  When constructing the
> >>> query
> >>> >> >> string
> >>> >> >> for the search, I always added the double quotes (otherwise,
> it'd
> >>> >> think
> >>> >> >> it
> >>> >> >> was multiple terms).  (I didn't even test the underscore and
> >>> >> hyphenated
> >>> >> >> terms.)  I thought Lucene was (sort of by default) set up to
> >>> search
> >>> >> >> quoted
> >>> >> >> phrases.  From
> >>> http://lucene.apache.org/java/docs/api/index.html -->
> >>> A
> >>> >> >> Phrase is a group of words surrounded by double quotes such as
> >>> "hello
> >>> >> >> dolly".  So, this should be easy, right?  I must be missing
> >>> something
> >>> >> >> stupid.
> >>> >> >>
> >>> >> >> Thanks,
> >>> >> >>
> >>> >> >> Philip
> >>> >> >>
> >>> >> >>
> >>> >> >> Mark Miller-5 wrote:
> >>> >> >> >
> >>> >> >> > So this will recognize anything in quotes as a single token
> and
> >>> '_'
> >>> >> and
> >>> >> >> > '-' will not break up words. There may be some repercussions
> for
> >>> the
> >>> >> >> NUM
> >>> >> >> > token but nothing I'd worry about. maybe you want to use
> Unicode
> >>> for
> >>> >> >> '-'
> >>> >> >> > and '_' as well...I wouldn't worry about it myself.
> >>> >> >> >
> >>> >> >> > - Mark
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > TOKEN : {                      // token patterns
> >>> >> >> >
> >>> >> >> >   // basic word: a sequence of digits & letters
> >>> >> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >>> >> >> >
> >>> >> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >>> >> >> >
> >>> >> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >>> >> >> >   // use a post-filter to remove possesives
> >>> >> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >>> >> >> >
> >>> >> >> >   // acronyms: U.S.A., I.B.M., etc.
> >>> >> >> >   // use a post-filter to remove dots
> >>> >> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >>> >> >> >
> >>> >> >> >   // company names like AT&T and Excite@Home.
> >>> >> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >>> >> >> >
> >>> >> >> >   // email addresses
> >>> >> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@"
> <ALPHANUM>
> >>> >> >> > (("."|"-") <ALPHANUM>)+ >
> >>> >> >> >
> >>> >> >> >   // hostname
> >>> >> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >>> >> >> >
> >>> >> >> >   // floating point, serial, model numbers, ip addresses, etc.
> >>> >> >> >   // every other segment must have at least one digit
> >>> >> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >>> >> >> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >>> >> >> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >>> >> >> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
> >>> >> <HAS_DIGIT>)+
> >>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
> >>> >> <ALPHANUM>)+
> >>> >> >> >         )
> >>> >> >> >   >
> >>> >> >> > | <#P: ("_"|"-"|"/"|"."|",") >
> >>> >> >> > | <#HAS_DIGIT:                      // at least one digit
> >>> >> >> >     (<LETTER>|<DIGIT>)*
> >>> >> >> >     <DIGIT>
> >>> >> >> >     (<LETTER>|<DIGIT>)*
> >>> >> >> >   >
> >>> >> >> >
> >>> >> >> > | < #ALPHA: (<LETTER>)+>
> >>> >> >> > | < #LETTER:                      // unicode letters
> >>> >> >> >       [
> >>> >> >> >        "\u0041"-"\u005a",
> >>> >> >> >        "\u0061"-"\u007a",
> >>> >> >> >        "\u00c0"-"\u00d6",
> >>> >> >> >        "\u00d8"-"\u00f6",
> >>> >> >> >        "\u00f8"-"\u00ff",
> >>> >> >> >        "\u0100"-"\u1fff",
> >>> >> >> >        "-", "_"
> >>> >> >> >       ]
> >>> >> >> >   >
> >>> >> >> >
> >>> >> >> >
> >>> >>
> ---------------------------------------------------------------------
> >>> >> >> > To unsubscribe, e-mail:
> [hidden email]
> >>> >> >> > For additional commands, e-mail:
> >>> [hidden email]
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> >> >>
> >>> >> >> --
> >>> >> >> View this message in context:
> >>> >> >>
> >>> >>
> >>>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> >>>
> >>> >> >> Sent from the Lucene - Java Users forum at Nabble.com.
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> ---------------------------------------------------------------------
> >>> >> >> To unsubscribe, e-mail: [hidden email]
> >>> >> >> For additional commands, e-mail:
> [hidden email]
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >> >
> >>> >>
> >>> >> --
> >>> >> View this message in context:
> >>> >>
> >>>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
> >>>
> >>> >> Sent from the Lucene - Java Users forum at Nabble.com.
> >>> >>
> >>> >>
> >>> >>
> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: [hidden email]
> >>> >> For additional commands, e-mail: [hidden email]
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067
> >>>
> >>> Sent from the Lucene - Java Users forum at Nabble.com.
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6115360
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
12