StandardAnalyzer functionality change

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

StandardAnalyzer functionality change

kiwi clive
Hi all,

Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 and I see StandardAnalyzer has changed its behaviour, particularly when tokenizing email addresses. From reading the forums, I understand StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?


If I pass the string '[hidden email]' through these analyzers, I get the following tokens:

Using StandardAnalyzer(Version.LUCENE_23):  -->  [hidden email] (one token)

Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two tokens)
Using ClassicAnalyzer(Version.LUCENE_36):     -->  [hidden email]  (one token)

StandardAnalyzer is normally a good compromise as a default analyzer but the failure to keep an email address intact makes it less fit for purpose than it used to be. Is this a bug or is it by design ?  If by design, what is the reason for the change and is ClassicAnalyzer now the defacto-analyzer to use ?

Thanks,
Clive
Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer functionality change

Jack Krupansky-2
Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
technical term is "Unicode text segmentation"), period. As the javadoc says,
"As of Lucene version 3.1, this class implements the Word Break rules from
the Unicode Text Segmentation algorithm, as specified in Unicode Standard
Annex #29." That is a "standard".

See:
http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html

-- Jack Krupansky

-----Original Message-----
From: kiwi clive
Sent: Wednesday, October 24, 2012 6:42 AM
To: [hidden email]
Subject: StandardAnalyzer functionality change

Hi all,

Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0
and I see StandardAnalyzer has changed its behaviour, particularly when
tokenizing email addresses. From reading the forums, I understand
StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?


If I pass the string '[hidden email]' through these analyzers, I get the
following tokens:

Using StandardAnalyzer(Version.LUCENE_23):  -->  [hidden email] (one token)

Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two
tokens)
Using ClassicAnalyzer(Version.LUCENE_36):     -->  [hidden email]  (one
token)

StandardAnalyzer is normally a good compromise as a default analyzer but the
failure to keep an email address intact makes it less fit for purpose than
it used to be. Is this a bug or is it by design ?  If by design, what is the
reason for the change and is ClassicAnalyzer now the defacto-analyzer to use
?

Thanks,
Clive


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer functionality change

Ian Lea
If you want email addresses, UAX29URLEmailAnalyzer is another alternative.


--
Ian.


On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <[hidden email]> wrote:

> Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
> technical term is "Unicode text segmentation"), period. As the javadoc says,
> "As of Lucene version 3.1, this class implements the Word Break rules from
> the Unicode Text Segmentation algorithm, as specified in Unicode Standard
> Annex #29." That is a "standard".
>
> See:
> http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
> http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: kiwi clive
> Sent: Wednesday, October 24, 2012 6:42 AM
> To: [hidden email]
> Subject: StandardAnalyzer functionality change
>
>
> Hi all,
>
> Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0
> and I see StandardAnalyzer has changed its behaviour, particularly when
> tokenizing email addresses. From reading the forums, I understand
> StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?
>
>
> If I pass the string '[hidden email]' through these analyzers, I get the
> following tokens:
>
> Using StandardAnalyzer(Version.LUCENE_23):  -->  [hidden email] (one token)
>
> Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two
> tokens)
> Using ClassicAnalyzer(Version.LUCENE_36):     -->  [hidden email]  (one
> token)
>
> StandardAnalyzer is normally a good compromise as a default analyzer but the
> failure to keep an email address intact makes it less fit for purpose than
> it used to be. Is this a bug or is it by design ?  If by design, what is the
> reason for the change and is ClassicAnalyzer now the defacto-analyzer to use
> ?
>
> Thanks,
> Clive
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer functionality change

kiwi clive
Thanks for the responses chaps, very informative, and most appreciated :-)





________________________________
 From: Ian Lea <[hidden email]>
To: [hidden email]
Sent: Wednesday, October 24, 2012 4:19 PM
Subject: Re: StandardAnalyzer functionality change
 
If you want email addresses, UAX29URLEmailAnalyzer is another alternative.


--
Ian.


On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <[hidden email]> wrote:

> Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
> technical term is "Unicode text segmentation"), period. As the javadoc says,
> "As of Lucene version 3.1, this class implements the Word Break rules from
> the Unicode Text Segmentation algorithm, as specified in Unicode Standard
> Annex #29." That is a "standard".
>
> See:
> http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
> http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: kiwi clive
> Sent: Wednesday, October 24, 2012 6:42 AM
> To: [hidden email]
> Subject: StandardAnalyzer functionality change
>
>
> Hi all,
>
> Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0
> and I see StandardAnalyzer has changed its behaviour, particularly when
> tokenizing email addresses. From reading the forums, I understand
> StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?
>
>
> If I pass the string '[hidden email]' through these analyzers, I get the
> following tokens:
>
> Using StandardAnalyzer(Version.LUCENE_23):  -->  [hidden email] (one token)
>
> Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two
> tokens)
> Using ClassicAnalyzer(Version.LUCENE_36):     -->  [hidden email]  (one
> token)
>
> StandardAnalyzer is normally a good compromise as a default analyzer but the
> failure to keep an email address intact makes it less fit for purpose than
> it used to be. Is this a bug or is it by design ?  If by design, what is the
> reason for the change and is ClassicAnalyzer now the defacto-analyzer to use
> ?
>
> Thanks,
> Clive
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

SortField.STRING

Carlos de Luna Saenz
I am migrating code from Lucene 3 to Lucene 4... and i have the following code that i don't know how to change:
 
hits = searcher.search(queryGlobal, searcher.maxDoc(),
                    new Sort(new SortField(ordenarPor, SortField.STRING)));
 
I already change the searcher.maxDoc() to indxr.maxDoc() but SortField.STRING does not exist anymore... what do i have to do?
In the same code i have a class called SpanishAnalyzer:
public class SpanishAnalyzer  extends Analyzer {
     public static final String[] SPANISH_STOP_WORDS = { "." }; 
    
     private Set<Object> stopTable = new HashSet<Object>(); 
    
     private Set<Object> exclTable = new HashSet<Object>(); 
   
     public SpanishAnalyzer() { 
        stopTable = StopFilter.makeStopSet(Version.LUCENE_40,SPANISH_STOP_WORDS); 
       
    } 
     public SpanishAnalyzer(Version version){
         stopTable = StopFilter.makeStopSet(version,SPANISH_STOP_WORDS);
     }
    
     public SpanishAnalyzer(String[] stopWords) { 
        stopTable = StopFilter.makeStopSet(Version.LUCENE_40,stopWords); 
    }
     public SpanishAnalyzer(File stopWords) throws IOException { 
        stopTable = new HashSet(WordlistLoader.getWordSet(new FileReader(stopWords), Version.LUCENE_40)); 
    }
   
//    @Override
//    public TokenStream tokenStream(String fieldName, Reader reader) {
//        return new LowerCaseFilter(Version.LUCENE_40,new ASCIIFoldingFilter(
//                new StopFilter(Version.LUCENE_40,
//                                new StandardTokenizer(Version.LUCENE_40,
//                                                      reader),
//                                stopTable)));
//    }
    @Override
    protected TokenStreamComponents createComponents(String string, Reader reader) {
        throw new UnsupportedOperationException("Not supported yet.");
    }
   
}

The problem is that i can't override TokenStream anymore and now i have to implement the createComponents method and i am not sure what do i supposed to do there... thanks in advance for both troubles.
Reply | Threaded
Open this post in threaded view
|

Re: SortField.STRING

Ian Lea
SortField.Type.STRING maybe?

Can't help with the other question.  It's generally best to send one
question per message.  Looking at the source code might help.


--
Ian.


On Wed, Oct 24, 2012 at 6:55 PM, Carlos de Luna Saenz
<[hidden email]> wrote:

> I am migrating code from Lucene 3 to Lucene 4... and i have the following code that i don't know how to change:
>
> hits = searcher.search(queryGlobal, searcher.maxDoc(),
>                     new Sort(new SortField(ordenarPor, SortField.STRING)));
>
> I already change the searcher.maxDoc() to indxr.maxDoc() but SortField.STRING does not exist anymore... what do i have to do?
> In the same code i have a class called SpanishAnalyzer:
> public class SpanishAnalyzer  extends Analyzer {
>      public static final String[] SPANISH_STOP_WORDS = { "." };
>
>      private Set<Object> stopTable = new HashSet<Object>();
>
>      private Set<Object> exclTable = new HashSet<Object>();
>
>      public SpanishAnalyzer() {
>         stopTable = StopFilter.makeStopSet(Version.LUCENE_40,SPANISH_STOP_WORDS);
>
>     }
>      public SpanishAnalyzer(Version version){
>          stopTable = StopFilter.makeStopSet(version,SPANISH_STOP_WORDS);
>      }
>
>      public SpanishAnalyzer(String[] stopWords) {
>         stopTable = StopFilter.makeStopSet(Version.LUCENE_40,stopWords);
>     }
>      public SpanishAnalyzer(File stopWords) throws IOException {
>         stopTable = new HashSet(WordlistLoader.getWordSet(new FileReader(stopWords), Version.LUCENE_40));
>     }
>
> //    @Override
> //    public TokenStream tokenStream(String fieldName, Reader reader) {
> //        return new LowerCaseFilter(Version.LUCENE_40,new ASCIIFoldingFilter(
> //                new StopFilter(Version.LUCENE_40,
> //                                new StandardTokenizer(Version.LUCENE_40,
> //                                                      reader),
> //                                stopTable)));
> //    }
>     @Override
>     protected TokenStreamComponents createComponents(String string, Reader reader) {
>         throw new UnsupportedOperationException("Not supported yet.");
>     }
>
> }
>
> The problem is that i can't override TokenStream anymore and now i have to implement the createComponents method and i am not sure what do i supposed to do there... thanks in advance for both troubles.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer functionality change

Jack Krupansky-2
In reply to this post by kiwi clive
I didn't explicitly say it, but ClassicAnalyzer does do exactly what you
want it to do - work break plus email and URL, or StandardAnalyzer plus
email and URL.

-- Jack Krupansky

-----Original Message-----
From: kiwi clive
Sent: Wednesday, October 24, 2012 1:27 PM
To: [hidden email]
Subject: Re: StandardAnalyzer functionality change

Thanks for the responses chaps, very informative, and most appreciated :-)





________________________________
From: Ian Lea <[hidden email]>
To: [hidden email]
Sent: Wednesday, October 24, 2012 4:19 PM
Subject: Re: StandardAnalyzer functionality change

If you want email addresses, UAX29URLEmailAnalyzer is another alternative.


--
Ian.


On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <[hidden email]>
wrote:

> Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
> technical term is "Unicode text segmentation"), period. As the javadoc
> says,
> "As of Lucene version 3.1, this class implements the Word Break rules from
> the Unicode Text Segmentation algorithm, as specified in Unicode Standard
> Annex #29." That is a "standard".
>
> See:
> http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
> http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: kiwi clive
> Sent: Wednesday, October 24, 2012 6:42 AM
> To: [hidden email]
> Subject: StandardAnalyzer functionality change
>
>
> Hi all,
>
> Sorry if I'm asking an age old question but we have migrated to lucene
> 3.6.0
> and I see StandardAnalyzer has changed its behaviour, particularly when
> tokenizing email addresses. From reading the forums, I understand
> StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?
>
>
> If I pass the string '[hidden email]' through these analyzers, I get the
> following tokens:
>
> Using StandardAnalyzer(Version.LUCENE_23):  -->  [hidden email] (one
> token)
>
> Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two
> tokens)
> Using ClassicAnalyzer(Version.LUCENE_36):     -->  [hidden email]  (one
> token)
>
> StandardAnalyzer is normally a good compromise as a default analyzer but
> the
> failure to keep an email address intact makes it less fit for purpose than
> it used to be. Is this a bug or is it by design ?  If by design, what is
> the
> reason for the change and is ClassicAnalyzer now the defacto-analyzer to
> use
> ?
>
> Thanks,
> Clive
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer functionality change

Jack Krupansky-2
s/work break/word break/

-- Jack Krupansky

-----Original Message-----
From: Jack Krupansky
Sent: Wednesday, October 24, 2012 3:52 PM
To: [hidden email] ; kiwi clive
Subject: Re: StandardAnalyzer functionality change

I didn't explicitly say it, but ClassicAnalyzer does do exactly what you
want it to do - work break plus email and URL, or StandardAnalyzer plus
email and URL.

-- Jack Krupansky

-----Original Message-----
From: kiwi clive
Sent: Wednesday, October 24, 2012 1:27 PM
To: [hidden email]
Subject: Re: StandardAnalyzer functionality change

Thanks for the responses chaps, very informative, and most appreciated :-)





________________________________
From: Ian Lea <[hidden email]>
To: [hidden email]
Sent: Wednesday, October 24, 2012 4:19 PM
Subject: Re: StandardAnalyzer functionality change

If you want email addresses, UAX29URLEmailAnalyzer is another alternative.


--
Ian.


On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <[hidden email]>
wrote:

> Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
> technical term is "Unicode text segmentation"), period. As the javadoc
> says,
> "As of Lucene version 3.1, this class implements the Word Break rules from
> the Unicode Text Segmentation algorithm, as specified in Unicode Standard
> Annex #29." That is a "standard".
>
> See:
> http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
> http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: kiwi clive
> Sent: Wednesday, October 24, 2012 6:42 AM
> To: [hidden email]
> Subject: StandardAnalyzer functionality change
>
>
> Hi all,
>
> Sorry if I'm asking an age old question but we have migrated to lucene
> 3.6.0
> and I see StandardAnalyzer has changed its behaviour, particularly when
> tokenizing email addresses. From reading the forums, I understand
> StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?
>
>
> If I pass the string '[hidden email]' through these analyzers, I get the
> following tokens:
>
> Using StandardAnalyzer(Version.LUCENE_23):  -->  [hidden email] (one
> token)
>
> Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two
> tokens)
> Using ClassicAnalyzer(Version.LUCENE_36):     -->  [hidden email]  (one
> token)
>
> StandardAnalyzer is normally a good compromise as a default analyzer but
> the
> failure to keep an email address intact makes it less fit for purpose than
> it used to be. Is this a bug or is it by design ?  If by design, what is
> the
> reason for the change and is ClassicAnalyzer now the defacto-analyzer to
> use
> ?
>
> Thanks,
> Clive
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SortField.STRING

Carlos de Luna Saenz
In reply to this post by Ian Lea
Thanks... that's it... sorry to siturb with something that simple.





>________________________________
> De: Ian Lea <[hidden email]>
>Para: [hidden email]; Carlos de Luna Saenz <[hidden email]>
>Enviado: Miércoles, 24 de octubre, 2012 1:55 P.M.
>Asunto: Re: SortField.STRING
>
>SortField.Type.STRING maybe?
>
>Can't help with the other question.  It's generally best to send one
>question per message.  Looking at the source code might help.
>
>
>--
>Ian.
>
>
>On Wed, Oct 24, 2012 at 6:55 PM, Carlos de Luna Saenz
><[hidden email]> wrote:
>> I am migrating code from Lucene 3 to Lucene 4... and i have the following code that i don't know how to change:
>>
>> hits = searcher.search(queryGlobal, searcher.maxDoc(),
>>                     new Sort(new SortField(ordenarPor, SortField.STRING)));
>>
>> I already change the searcher.maxDoc() to indxr.maxDoc() but SortField.STRING does not exist anymore... what do i have to do?
>> In the same code i have a class called SpanishAnalyzer:
>> public class SpanishAnalyzer  extends Analyzer {
>>      public static final String[] SPANISH_STOP_WORDS = { "." };
>>
>>      private Set<Object> stopTable = new HashSet<Object>();
>>
>>      private Set<Object> exclTable = new HashSet<Object>();
>>
>>      public SpanishAnalyzer() {
>>         stopTable = StopFilter.makeStopSet(Version.LUCENE_40,SPANISH_STOP_WORDS);
>>
>>     }
>>      public SpanishAnalyzer(Version version){
>>          stopTable = StopFilter.makeStopSet(version,SPANISH_STOP_WORDS);
>>      }
>>
>>      public SpanishAnalyzer(String[] stopWords) {
>>         stopTable = StopFilter.makeStopSet(Version.LUCENE_40,stopWords);
>>     }
>>      public SpanishAnalyzer(File stopWords) throws IOException {
>>         stopTable = new HashSet(WordlistLoader.getWordSet(new FileReader(stopWords), Version.LUCENE_40));
>>     }
>>
>> //    @Override
>> //    public TokenStream tokenStream(String fieldName, Reader reader) {
>> //        return new LowerCaseFilter(Version.LUCENE_40,new ASCIIFoldingFilter(
>> //                new StopFilter(Version.LUCENE_40,
>> //                                new StandardTokenizer(Version.LUCENE_40,
>> //                                                      reader),
>> //                                stopTable)));
>> //    }
>>     @Override
>>     protected TokenStreamComponents createComponents(String string, Reader reader) {
>>         throw new UnsupportedOperationException("Not supported yet.");
>>     }
>>
>> }
>>
>> The problem is that i can't override TokenStream anymore and now i have to implement the createComponents method and i am not sure what do i supposed to do there... thanks in advance for both troubles.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer functionality change

sarowe
In reply to this post by Jack Krupansky-2
Small correction: UAX29URLEmailAnalyzer = StandardAnalyzer + URL + Email. (Full support for URLs with the file:, ftp:, and http/s: protocols; full email support.)

ClassicAnalyzer is a different beast altogether.  First of all, it doesn't implement Unicode segmentation - it has a non-standard tokenizer that works okay for some English text.  It does recognize some (maybe most?) email addresses, but not all of them (e.g. the '+' character, a valid username char in email addresses, is not supported).  It does not recognize URLs, but rather domain names, aka hostnames.

Steve

On Oct 24, 2012, at 3:52 PM, Jack Krupansky <[hidden email]> wrote:

> I didn't explicitly say it, but ClassicAnalyzer does do exactly what you want it to do - work break plus email and URL, or StandardAnalyzer plus email and URL.
>
> -- Jack Krupansky
>
> -----Original Message----- From: kiwi clive
> Sent: Wednesday, October 24, 2012 1:27 PM
> To: [hidden email]
> Subject: Re: StandardAnalyzer functionality change
>
> Thanks for the responses chaps, very informative, and most appreciated :-)
>
>
>
>
>
> ________________________________
> From: Ian Lea <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, October 24, 2012 4:19 PM
> Subject: Re: StandardAnalyzer functionality change
>
> If you want email addresses, UAX29URLEmailAnalyzer is another alternative.
>
>
> --
> Ian.
>
>
> On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <[hidden email]> wrote:
>> Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
>> technical term is "Unicode text segmentation"), period. As the javadoc says,
>> "As of Lucene version 3.1, this class implements the Word Break rules from
>> the Unicode Text Segmentation algorithm, as specified in Unicode Standard
>> Annex #29." That is a "standard".
>>
>> See:
>> http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
>> http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: kiwi clive
>> Sent: Wednesday, October 24, 2012 6:42 AM
>> To: [hidden email]
>> Subject: StandardAnalyzer functionality change
>>
>>
>> Hi all,
>>
>> Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0
>> and I see StandardAnalyzer has changed its behaviour, particularly when
>> tokenizing email addresses. From reading the forums, I understand
>> StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?
>>
>>
>> If I pass the string '[hidden email]' through these analyzers, I get the
>> following tokens:
>>
>> Using StandardAnalyzer(Version.LUCENE_23):  -->  [hidden email] (one token)
>>
>> Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two
>> tokens)
>> Using ClassicAnalyzer(Version.LUCENE_36):     -->  [hidden email]  (one
>> token)
>>
>> StandardAnalyzer is normally a good compromise as a default analyzer but the
>> failure to keep an email address intact makes it less fit for purpose than
>> it used to be. Is this a bug or is it by design ?  If by design, what is the
>> reason for the change and is ClassicAnalyzer now the defacto-analyzer to use
>> ?
>>
>> Thanks,
>> Clive
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: StandardAnalyzer functionality change

kiwi clive
I did some tests and found for our need, ClassicAnalyzer was better (backwards compatible). Our analyzer uses different tokenizers on certain fields but (used to) fall back to StandardAnalyzer by default. ClassicAnalyzer will meet our needs but I see we should move onto a newer implementation such as the email-specific analyzer going forward.

Thanks for the clarification.