[PATCH] Bug on BrazilianAnalyzer

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Bug on BrazilianAnalyzer

Rafael Almeida
Following is the patch for what I think is a bug on the
BrazilianAnalyzer. The default stopwords list is all in lowercase, so
it will only work if the LowerCaseFilter comes first of if the
StopWordFilter is set to ignore case.

Since the LowerCaseFilter is instantiated anyway I just changed its
order. If there's some problem with that order, then please consider
setting StopWordFilter to ignore case.

Index: BrazilianAnalyzer.java
===================================================================
--- BrazilianAnalyzer.java (revision 718407)
+++ BrazilianAnalyzer.java (working copy)
@@ -131,10 +131,9 @@
  public final TokenStream tokenStream(String fieldName, Reader
reader) { TokenStream result = new StandardTokenizer( reader );
  result = new StandardFilter( result );
+ result = new LowerCaseFilter( result );
  result = new StopFilter( result, stoptable );
  result = new BrazilianStemFilter( result, excltable );
- // Convert to lowercase after stemming!
- result = new LowerCaseFilter( result );
  return result;
  }
 }

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Bug on BrazilianAnalyzer

Adriano Crestani
Hi Rafael,

What is your scenario?

Maybe it was defined this way so it do not filter uppercased stop words. Like, for example, the downcased word "se" is a stopword, but the uppercased "SE" stands for "Sergipe", a brazilian state, so it should not be filtered.

Best Regards,
Adriano Crestani

On Mon, Nov 17, 2008 at 3:39 PM, Rafael Cunha de Almeida <[hidden email]> wrote:
Following is the patch for what I think is a bug on the
BrazilianAnalyzer. The default stopwords list is all in lowercase, so
it will only work if the LowerCaseFilter comes first of if the
StopWordFilter is set to ignore case.

Since the LowerCaseFilter is instantiated anyway I just changed its
order. If there's some problem with that order, then please consider
setting StopWordFilter to ignore case.

Index: BrazilianAnalyzer.java
===================================================================
--- BrazilianAnalyzer.java      (revision 718407)
+++ BrazilianAnalyzer.java      (working copy)
@@ -131,10 +131,9 @@
       public final TokenStream tokenStream(String fieldName, Reader
reader) { TokenStream result = new StandardTokenizer( reader );
               result = new StandardFilter( result );
+               result = new LowerCaseFilter( result );
               result = new StopFilter( result, stoptable );
               result = new BrazilianStemFilter( result, excltable );
-               // Convert to lowercase after stemming!
-               result = new LowerCaseFilter( result );
               return result;
       }
 }

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Bug on BrazilianAnalyzer

Rafael Almeida
On Mon, 17 Nov 2008 19:58:47 -0800
"Adriano Crestani" <[hidden email]> wrote:

> Hi Rafael,
>
> What is your scenario?
>
> Maybe it was defined this way so it do not filter uppercased stop words.
> Like, for example, the downcased word "se" is a stopword, but the uppercased
> "SE" stands for "Sergipe", a brazilian state, so it should not be filtered.

Suppose you are right, but passing it through the LowerCaseFilter can
be useful too, specially if you don't care much about those corner
cases (the GermanAnalyzer, for instance, passes through
LowerCaseFilter first). The class being final doesn't allow to inherit
from it and make the changes if one needs to, which is unfortunate :-(.

I would like to see a change in this whole stemmer's and language
analyzer's API in order to make it more flexible and extensible. The
way it is you have to use them in that predeterminaded way.

It would be nice if there was only one StemFilter, a Stemmer interface
and all Stemmers were subclasses of that. Then, the StemFilter should
get its Stemmer as a constructor parameter. I see no reason for
BrazilianAnalyzer to be public.

Are you interested in those kind of changes? Do you agree with them?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Bug on BrazilianAnalyzer

Rafael Almeida
On Fri, 21 Nov 2008 16:46:30 -0200
Rafael Cunha de Almeida <[hidden email]> wrote:

> On Mon, 17 Nov 2008 19:58:47 -0800
> "Adriano Crestani" <[hidden email]> wrote:
>
> > Hi Rafael,
> >
> > What is your scenario?
> >
> > Maybe it was defined this way so it do not filter uppercased stop words.
> > Like, for example, the downcased word "se" is a stopword, but the uppercased
> > "SE" stands for "Sergipe", a brazilian state, so it should not be filtered.
>
> Suppose you are right, but passing it through the LowerCaseFilter can
> be useful too, specially if you don't care much about those corner
> cases (the GermanAnalyzer, for instance, passes through
> LowerCaseFilter first). The class being final doesn't allow to inherit
> from it and make the changes if one needs to, which is unfortunate :-(.
>
> I would like to see a change in this whole stemmer's and language
> analyzer's API in order to make it more flexible and extensible. The
> way it is you have to use them in that predeterminaded way.
>
> It would be nice if there was only one StemFilter, a Stemmer interface
> and all Stemmers were subclasses of that. Then, the StemFilter should
> get its Stemmer as a constructor parameter. I see no reason for
> BrazilianAnalyzer to be public.

To be final, sorry. I was a bit tired when I wrote all that.

> Are you interested in those kind of changes? Do you agree with them?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Bug on BrazilianAnalyzer

Adriano Crestani
Hi Rafael,

I kind of agree with you. Practically all the StemFilters have the same logic, they might be combined into only one class. All StemFilters seems to have a setStemmer already, we could keep that and also allow to pass the stemmer as a constructor paramenter, like you said. I think you can create a JIRA and  submit a patch for that, let's see what the lucene member will think about it  :)

Now, about the BrazilianAnalyzer being final, it's probably only because they wanted to increase the runtime performance, as long as final classes are faster once the JVM does not need to check for subclassing.

Best Regards,
Adriano Crestani Campos

On Fri, Nov 21, 2008 at 2:15 PM, Rafael Cunha de Almeida <[hidden email]> wrote:
On Fri, 21 Nov 2008 16:46:30 -0200
Rafael Cunha de Almeida <[hidden email]> wrote:

> On Mon, 17 Nov 2008 19:58:47 -0800
> "Adriano Crestani" <[hidden email]> wrote:
>
> > Hi Rafael,
> >
> > What is your scenario?
> >
> > Maybe it was defined this way so it do not filter uppercased stop words.
> > Like, for example, the downcased word "se" is a stopword, but the uppercased
> > "SE" stands for "Sergipe", a brazilian state, so it should not be filtered.
>
> Suppose you are right, but passing it through the LowerCaseFilter can
> be useful too, specially if you don't care much about those corner
> cases (the GermanAnalyzer, for instance, passes through
> LowerCaseFilter first). The class being final doesn't allow to inherit
> from it and make the changes if one needs to, which is unfortunate :-(.
>
> I would like to see a change in this whole stemmer's and language
> analyzer's API in order to make it more flexible and extensible. The
> way it is you have to use them in that predeterminaded way.
>
> It would be nice if there was only one StemFilter, a Stemmer interface
> and all Stemmers were subclasses of that. Then, the StemFilter should
> get its Stemmer as a constructor parameter. I see no reason for
> BrazilianAnalyzer to be public.

To be final, sorry. I was a bit tired when I wrote all that.

> Are you interested in those kind of changes? Do you agree with them?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Bug on BrazilianAnalyzer

Michael McCandless-2

Rafael,

Could you work these changes into a patch,  add a test case, and open  
a Jira issue?  Maybe first make the simple fixes (removing final,  
moving LowerCaseFilter up in the chain), and then as a 2nd issue this  
deeper refactoring of all StemFilters?  Thanks.

I agree the original issue (LowerCaseFilter coming after StopFilter)  
is a bug, though does the BrazilianStemFilter mind if all tokens  
coming it are now lowercased (I would assume not)?

Mike

Adriano Crestani wrote:

> Hi Rafael,
>
> I kind of agree with you. Practically all the StemFilters have the  
> same logic, they might be combined into only one class. All  
> StemFilters seems to have a setStemmer already, we could keep that  
> and also allow to pass the stemmer as a constructor paramenter, like  
> you said. I think you can create a JIRA and  submit a patch for  
> that, let's see what the lucene member will think about it  :)
>
> Now, about the BrazilianAnalyzer being final, it's probably only  
> because they wanted to increase the runtime performance, as long as  
> final classes are faster once the JVM does not need to check for  
> subclassing.
>
> Best Regards,
> Adriano Crestani Campos
>
> On Fri, Nov 21, 2008 at 2:15 PM, Rafael Cunha de Almeida <[hidden email]
> > wrote:
> On Fri, 21 Nov 2008 16:46:30 -0200
> Rafael Cunha de Almeida <[hidden email]> wrote:
>
> > On Mon, 17 Nov 2008 19:58:47 -0800
> > "Adriano Crestani" <[hidden email]> wrote:
> >
> > > Hi Rafael,
> > >
> > > What is your scenario?
> > >
> > > Maybe it was defined this way so it do not filter uppercased  
> stop words.
> > > Like, for example, the downcased word "se" is a stopword, but  
> the uppercased
> > > "SE" stands for "Sergipe", a brazilian state, so it should not  
> be filtered.
> >
> > Suppose you are right, but passing it through the LowerCaseFilter  
> can
> > be useful too, specially if you don't care much about those corner
> > cases (the GermanAnalyzer, for instance, passes through
> > LowerCaseFilter first). The class being final doesn't allow to  
> inherit
> > from it and make the changes if one needs to, which is  
> unfortunate :-(.
> >
> > I would like to see a change in this whole stemmer's and language
> > analyzer's API in order to make it more flexible and extensible. The
> > way it is you have to use them in that predeterminaded way.
> >
> > It would be nice if there was only one StemFilter, a Stemmer  
> interface
> > and all Stemmers were subclasses of that. Then, the StemFilter  
> should
> > get its Stemmer as a constructor parameter. I see no reason for
> > BrazilianAnalyzer to be public.
>
> To be final, sorry. I was a bit tired when I wrote all that.
>
> > Are you interested in those kind of changes? Do you agree with them?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]