StandardAnalyzer question

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

StandardAnalyzer question

Ngo, Anh (ISS Southfield)
Hello

The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
token.  Is there a way I can make StandardAnalyzer don't tokenize for
"_" or any given characters?

I'd like to keep all features that StandardAnalyzer have but want to
modified it a bit for my need? How do I control what character is
tokenizable.

Ex: Test_test1_test2 is my data
StandardAnalyzer: Test test1 test2 my data
I'd like to have:  Test_test_test2 my data


Please help.


Thanks,


Anh Ngo


-----Original Message-----
From: Chris Hostetter [mailto:[hidden email]]
Sent: Wednesday, July 19, 2006 12:25 PM
To: [hidden email]
Subject: Re: BooleanQuery question


: If  I search with boolQuery, Lucene doesn't find anything.
: If I modify by hand the query from "+(-(FILE:abstract.htm))
: +(PATH:/bssrs)" to "-(FILE:abstract.htm) +(PATH:/bssrs)", Lucene find
: the correct list of document.
:
: Does somebody know why ?

you can't have a boolean query containing only MUST_NOT clauses (which
is
what (-(FILE:abstract.htm)) is.  it matches no documents, so the
mandatory
qualification on it causes the query to fail for all docs.


:
: Thanks in advance,
:
: Nicolas
:
:
:
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer question

Daniel Naber-5
On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:

> The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
> token.  Is there a way I can make StandardAnalyzer don't tokenize for
> "_" or any given characters?

You need to add "_" to the #LETTER definition in StandardTokenizer.jj, then
rebuild StandardTokenizer.java using the appropriate and task.

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: StandardAnalyzer question

Ngo, Anh (ISS Southfield)
In reply to this post by Ngo, Anh (ISS Southfield)

What is #LETTER definition in SnardarTokernize.jj?


I saw:

| <#P: ("_"|"-"|"/"|"."|",") >
| <#HAS_DIGIT:  // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
  >


Should I remove "_" and recompile the source code?

Sincerely,


Anh Ngo

-----Original Message-----
From: Daniel Naber [mailto:[hidden email]]
Sent: Friday, July 21, 2006 2:49 PM
To: [hidden email]
Subject: Re: StandardAnalyzer question

On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:

> The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
> token.  Is there a way I can make StandardAnalyzer don't tokenize for
> "_" or any given characters?

You need to add "_" to the #LETTER definition in StandardTokenizer.jj, then
rebuild StandardTokenizer.java using the appropriate and task.

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer question

Mark Miller-3
I do not beleive so. If you look above you will see that #P is only used
when looking for a num: a host ip, a phone number, etc. You will be removing
that ability to recognize a "_" while rooting those tokens out. It will
still be parsed when tokenizing an EMAIL as well. I dont think this is the
behavior you want.

- Mark

On 7/21/06, Ngo, Anh (ISS Southfield) <[hidden email]> wrote:

>
>
> What is #LETTER definition in SnardarTokernize.jj?
>
>
> I saw:
>
> | <#P: ("_"|"-"|"/"|"."|",") >
> | <#HAS_DIGIT:                                    // at least one digit
>     (<LETTER>|<DIGIT>)*
>     <DIGIT>
>     (<LETTER>|<DIGIT>)*
>   >
>
>
> Should I remove "_" and recompile the source code?
>
> Sincerely,
>
>
> Anh Ngo
>
> -----Original Message-----
> From: Daniel Naber [mailto:[hidden email]]
> Sent: Friday, July 21, 2006 2:49 PM
> To: [hidden email]
> Subject: Re: StandardAnalyzer question
>
> On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:
>
> > The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
> > token. Is there a way I can make StandardAnalyzer don't tokenize for
> > "_" or any given characters?
>
> You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
> then
> rebuild StandardTokenizer.java using the appropriate and task.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: StandardAnalyzer question

Ngo, Anh (ISS Southfield)
In reply to this post by Ngo, Anh (ISS Southfield)

Hello Mark,


Please show me how to add "-" to #LETTER definition


Thanks,


Anh Ngo

-----Original Message-----
From: Mark Miller [mailto:[hidden email]]
Sent: Friday, July 21, 2006 3:51 PM
To: [hidden email]
Subject: Re: StandardAnalyzer question

I do not beleive so. If you look above you will see that #P is only used
when looking for a num: a host ip, a phone number, etc. You will be
removing
that ability to recognize a "_" while rooting those tokens out. It will
still be parsed when tokenizing an EMAIL as well. I dont think this is
the
behavior you want.

- Mark

On 7/21/06, Ngo, Anh (ISS Southfield) <[hidden email]> wrote:
>
>
> What is #LETTER definition in SnardarTokernize.jj?
>
>
> I saw:
>
> | <#P: ("_"|"-"|"/"|"."|",") >
> | <#HAS_DIGIT:                                    // at least one
digit

>     (<LETTER>|<DIGIT>)*
>     <DIGIT>
>     (<LETTER>|<DIGIT>)*
>   >
>
>
> Should I remove "_" and recompile the source code?
>
> Sincerely,
>
>
> Anh Ngo
>
> -----Original Message-----
> From: Daniel Naber [mailto:[hidden email]]
> Sent: Friday, July 21, 2006 2:49 PM
> To: [hidden email]
> Subject: Re: StandardAnalyzer question
>
> On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:
>
> > The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as
a

> > token. Is there a way I can make StandardAnalyzer don't tokenize for
> > "_" or any given characters?
>
> You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
> then
> rebuild StandardTokenizer.java using the appropriate and task.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer question

Mark Miller-3
In reply to this post by Mark Miller-3
I take it back. Probably exactley what you want. Watch out if you're not
compiling all of lucene...you need to avoid a ParserException using ant if
you try to just extract the Standard Analyzer package (the recommended
approach).


On 7/21/06, Mark Miller <[hidden email]> wrote:

>
> I do not beleive so. If you look above you will see that #P is only used
> when looking for a num: a host ip, a phone number, etc. You will be removing
> that ability to recognize a "_" while rooting those tokens out. It will
> still be parsed when tokenizing an EMAIL as well. I dont think this is the
> behavior you want.
>
> - Mark
>
>
> On 7/21/06, Ngo, Anh (ISS Southfield) < [hidden email]> wrote:
> >
> >
> > What is #LETTER definition in SnardarTokernize.jj?
> >
> >
> > I saw:
> >
> > | <#P: ("_"|"-"|"/"|"."|",") >
> > | <#HAS_DIGIT:                                    // at least one digit
> >     (<LETTER>|<DIGIT>)*
> >     <DIGIT>
> >     (<LETTER>|<DIGIT>)*
> >   >
> >
> >
> > Should I remove "_" and recompile the source code?
> >
> > Sincerely,
> >
> >
> > Anh Ngo
> >
> > -----Original Message-----
> > From: Daniel Naber [mailto: [hidden email]]
> > Sent: Friday, July 21, 2006 2:49 PM
> > To: [hidden email]
> > Subject: Re: StandardAnalyzer question
> >
> > On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:
> >
> > > The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
> > > token. Is there a way I can make StandardAnalyzer don't tokenize for
> > > "_" or any given characters?
> >
> > You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
> > then
> > rebuild StandardTokenizer.java using the appropriate and task.
> >
> > Regards
> > Daniel
> >
> > --
> > http://www.danielnaber.de
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer question

Mark Miller-3
In reply to this post by Ngo, Anh (ISS Southfield)
| < #LETTER:                      // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff"
      ]

becomes

| < #LETTER:                      // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff",
       "\u002d"
      ]

On 7/21/06, Ngo, Anh (ISS Southfield) <[hidden email]> wrote:

>
>
> Hello Mark,
>
>
> Please show me how to add "-" to #LETTER definition
>
>
> Thanks,
>
>
> Anh Ngo
>
> -----Original Message-----
> From: Mark Miller [mailto:[hidden email]]
> Sent: Friday, July 21, 2006 3:51 PM
> To: [hidden email]
> Subject: Re: StandardAnalyzer question
>
> I do not beleive so. If you look above you will see that #P is only used
> when looking for a num: a host ip, a phone number, etc. You will be
> removing
> that ability to recognize a "_" while rooting those tokens out. It will
> still be parsed when tokenizing an EMAIL as well. I dont think this is
> the
> behavior you want.
>
> - Mark
>
> On 7/21/06, Ngo, Anh (ISS Southfield) <[hidden email]> wrote:
> >
> >
> > What is #LETTER definition in SnardarTokernize.jj?
> >
> >
> > I saw:
> >
> > | <#P: ("_"|"-"|"/"|"."|",") >
> > | <#HAS_DIGIT:                                    // at least one
> digit
> >     (<LETTER>|<DIGIT>)*
> >     <DIGIT>
> >     (<LETTER>|<DIGIT>)*
> >   >
> >
> >
> > Should I remove "_" and recompile the source code?
> >
> > Sincerely,
> >
> >
> > Anh Ngo
> >
> > -----Original Message-----
> > From: Daniel Naber [mailto:[hidden email]]
> > Sent: Friday, July 21, 2006 2:49 PM
> > To: [hidden email]
> > Subject: Re: StandardAnalyzer question
> >
> > On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:
> >
> > > The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as
> a
> > > token. Is there a way I can make StandardAnalyzer don't tokenize for
> > > "_" or any given characters?
> >
> > You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
> > then
> > rebuild StandardTokenizer.java using the appropriate and task.
> >
> > Regards
> > Daniel
> >
> > --
> > http://www.danielnaber.de
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer question

Doron Cohen
"\u002d" would add "-".
Originally request was for "_" - "\u005f"


"Mark Miller" <[hidden email]> wrote on 21/07/2006 13:09:28:

> | < #LETTER:                      // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff"
>       ]
>
> becomes
>
> | < #LETTER:                      // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "\u002d"
>       ]
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: StandardAnalyzer question

Ngo, Anh (ISS Southfield)
In reply to this post by Ngo, Anh (ISS Southfield)
I did try it and recompile the whole package but it did not work

My #LETTER is:

| < #LETTER:  // unicode letters
      [
       "\u0041"-"\u005a",
       "\u005f",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff"
      ]
  >

Or:

| < #LETTER:  // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff",
       "\u005f"
      ]
  >

Please help.



Anh Ngo

-----Original Message-----
From: Doron Cohen [mailto:[hidden email]]
Sent: Friday, July 21, 2006 4:30 PM
To: [hidden email]
Subject: Re: StandardAnalyzer question

"\u002d" would add "-".
Originally request was for "_" - "\u005f"


"Mark Miller" <[hidden email]> wrote on 21/07/2006 13:09:28:

> | < #LETTER:                      // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff"
>       ]
>
> becomes
>
> | < #LETTER:                      // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "\u002d"
>       ]
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer question

Mark Miller-3
Ngo, Anh (ISS Southfield) wrote:

> I did try it and recompile the whole package but it did not work
>
> My #LETTER is:
>
> | < #LETTER:  // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u005f",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff"
>       ]
>   >
>
> Or:
>
> | < #LETTER:  // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "\u005f"
>       ]
>   >
>
> Please help.
>
>
>
> Anh Ngo
>
> -----Original Message-----
> From: Doron Cohen [mailto:[hidden email]]
> Sent: Friday, July 21, 2006 4:30 PM
> To: [hidden email]
> Subject: Re: StandardAnalyzer question
>
> "\u002d" would add "-".
> Originally request was for "_" - "\u005f"
>
>
> "Mark Miller" <[hidden email]> wrote on 21/07/2006 13:09:28:
>  
>> | < #LETTER:                      // unicode letters
>>       [
>>        "\u0041"-"\u005a",
>>        "\u0061"-"\u007a",
>>        "\u00c0"-"\u00d6",
>>        "\u00d8"-"\u00f6",
>>        "\u00f8"-"\u00ff",
>>        "\u0100"-"\u1fff"
>>       ]
>>
>> becomes
>>
>> | < #LETTER:                      // unicode letters
>>       [
>>        "\u0041"-"\u005a",
>>        "\u0061"-"\u007a",
>>        "\u00c0"-"\u00d6",
>>        "\u00d8"-"\u00f6",
>>        "\u00f8"-"\u00ff",
>>        "\u0100"-"\u1fff",
>>        "\u002d"
>>       ]
>>
>>    
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  
What failed? Error messages? You have JavaCC? Any info? Psychic power
don't fail me now...


-mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: StandardAnalyzer question

Ngo, Anh (ISS Southfield)
In reply to this post by Ngo, Anh (ISS Southfield)

It works now.

Thank you very much.

I forgot to run javacc for the StandardTokenizer.jj


Sincerely,



Anh Ngo



-----Original Message-----
From: Mark Miller [mailto:[hidden email]]
Sent: Friday, July 21, 2006 5:33 PM
To: [hidden email]
Subject: Re: StandardAnalyzer question

Ngo, Anh (ISS Southfield) wrote:

> I did try it and recompile the whole package but it did not work
>
> My #LETTER is:
>
> | < #LETTER:  // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u005f",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff"
>       ]
>   >
>
> Or:
>
> | < #LETTER:  // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "\u005f"
>       ]
>   >
>
> Please help.
>
>
>
> Anh Ngo
>
> -----Original Message-----
> From: Doron Cohen [mailto:[hidden email]]
> Sent: Friday, July 21, 2006 4:30 PM
> To: [hidden email]
> Subject: Re: StandardAnalyzer question
>
> "\u002d" would add "-".
> Originally request was for "_" - "\u005f"
>
>
> "Mark Miller" <[hidden email]> wrote on 21/07/2006 13:09:28:
>  
>> | < #LETTER:                      // unicode letters
>>       [
>>        "\u0041"-"\u005a",
>>        "\u0061"-"\u007a",
>>        "\u00c0"-"\u00d6",
>>        "\u00d8"-"\u00f6",
>>        "\u00f8"-"\u00ff",
>>        "\u0100"-"\u1fff"
>>       ]
>>
>> becomes
>>
>> | < #LETTER:                      // unicode letters
>>       [
>>        "\u0041"-"\u005a",
>>        "\u0061"-"\u007a",
>>        "\u00c0"-"\u00d6",
>>        "\u00d8"-"\u00f6",
>>        "\u00f8"-"\u00ff",
>>        "\u0100"-"\u1fff",
>>        "\u002d"
>>       ]
>>
>>    
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  
What failed? Error messages? You have JavaCC? Any info? Psychic power
don't fail me now...


-mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]