Potential bug in StandardTokenizerImpl

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Potential bug in StandardTokenizerImpl

Shai Erera
Hi

This question was asked on the users mailing list, but I think it's a bug,
so I'll describe it here:

The following code should print the output of the StandardAnalyzer:

        Analyzer analyzer = new StandardAnalyzer();
        TokenStream ts = analyzer.tokenStream("content", new
StringReader("<some text>"));
        Token t;
        while ((t = ts.next()) != null) {
            System.out.println(t);
        }

If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>)
(which is correct in my opinion).
However, if you pass "www.abc.com." (notice the extra '.' at the end), the
output is (wwwabccom,0,12,type=<ACRONYM>).

I think the behavior in the second case is incorrect for several reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a sentence, which
is perfectly legal.
3. An ACRONYM, at least to the best of my understanding, is of the form
A.B.C. and not ABC.DEF.

I looked at StandardTokenizerImpl.jflex and I think the problem comes from
this definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+

Notice how the comment relates to acronym as U.S.A., I.B.M. and not
something else. I believe that if we would change the definition to
ACRONYM    =  {LETTER} "." ({LETTER} ".")+
it will solve the problem.

What do you think? Am I wrong?

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Potential bug in StandardTokenizerImpl

hossman

: If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>)
: (which is correct in my opinion).
: However, if you pass "www.abc.com." (notice the extra '.' at the end), the
: output is (wwwabccom,0,12,type=<ACRONYM>).

see also...
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

one hitch which potentially changing this now is that it would break
some searches in applications that have existing indexes built using
previous versions.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Potential bug in StandardTokenizerImpl

Shai Erera
I understand it would change the behavior of existing search solutions,
however the current behavior is just wrong. An ACRONYM cannot be ABC.DEF. If
you look up acronym in Wikipedia, you find only examples of I.B.M. / U.S.A.
like, or NATO, IBM, USA, but nothing of the form StandardAnalyzer currently
recognizes.

There are several ways to solve this change:
1. Create a new analyzer that fixes the problem - that way, applications
that don't want to use it will not have to, if they feel ok with the current
behavior. However, for those who would like to get a correct behavior,
they'll be able to. This is not my favorite solution, but I think it would
be preferable than simply not fixing it.
2. Fix it in the new version (2.3) and specifically mention that in the
release notes. Aren't there releases where applications need to re-build the
index because of fundamental changes?

Am I the only one who thinks that?

BTW, I changed the definition in the jflex file and recompiled using jflex
and it indeed solved the problem. It now recognizes www.abc.com. and
www.abc.com as hosts. I can attach the 'patch' files if you'd like to
compare.

On Nov 27, 2007 9:07 AM, Chris Hostetter <[hidden email]> wrote:

>
> : If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>)
> : (which is correct in my opinion).
> : However, if you pass "www.abc.com." (notice the extra '.' at the end),
> the
> : output is (wwwabccom,0,12,type=<ACRONYM>).
>
> see also...
>
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
>
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
>
> one hitch which potentially changing this now is that it would break
> some searches in applications that have existing indexes built using
> previous versions.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Potential bug in StandardTokenizerImpl

Eugenio Martinez
In reply to this post by Shai Erera

 I am the guy who throw the question about the Acronym - Host detection anomaly in the StandardAnalyzer class.

Thanks to Shai Erera for traslating the discussion into the developers' list. I am surprised about Chris Hostetter's response, as this issue was treated by Erik Hatcher in Novemeber 22, 2005. I am exploring Hatcher's superb book now, Lucene in Action, trying to override this issue, but i can't believe that this wasn't fixed yet.

As i explained at the user's list, i've found that indexing fails to include certain emails and words that are present in the logfile when i launch an IndexWriter over a hughe directory of logs. As I tried to isolate this bug, I got the acronyms' interpretation issue. Maybe there will be more hidden anomalies in the StandardAnalyzer behavior with such a hughe load.

At this moment I can say this behavior is deterministic, so I can reproduce it over subsequent index and search calls, and takes place with the same words and emails over and over. Should it be a collateral efect of document vectorization as the logs are not natural language? As Lucene computes if the token conveys relevant info (as the vector space model states), what about that Lucene decided about the token not to be relevant? All of this supossing it works well, of course...

Any idea about this, or have you heard about?

Thanks and regards.

Eugenio F. Martínez Pacheco

Fundación Instituto Tecnológico de Galicia - Área TIC

TFN: 981 173 206            FAX: 981 173 223

VIDEOCONFERENCIA: 981 173 596

[hidden email]






       
______________________________________________
¿Chef por primera vez?
Sé un mejor Cocinillas.
http://es.answers.yahoo.com/info/welcome
Reply | Threaded
Open this post in threaded view
|

Re: Potential bug in StandardTokenizerImpl

Grant Ingersoll-2
In reply to this post by Shai Erera
Yes, please open a JIRA issue and submit your patches.

I wonder if there is anyway to deprecate functionality in a JFlex  
grammar?  That is, is there anyway we can communicate to people that  
both will be supported through 2.9 and then the correct way will be  
supported in 3.x?

-Grant

On Nov 27, 2007, at 2:18 AM, Shai Erera wrote:

> I understand it would change the behavior of existing search  
> solutions,
> however the current behavior is just wrong. An ACRONYM cannot be  
> ABC.DEF. If
> you look up acronym in Wikipedia, you find only examples of I.B.M. /  
> U.S.A.
> like, or NATO, IBM, USA, but nothing of the form StandardAnalyzer  
> currently
> recognizes.
>
> There are several ways to solve this change:
> 1. Create a new analyzer that fixes the problem - that way,  
> applications
> that don't want to use it will not have to, if they feel ok with the  
> current
> behavior. However, for those who would like to get a correct behavior,
> they'll be able to. This is not my favorite solution, but I think it  
> would
> be preferable than simply not fixing it.
> 2. Fix it in the new version (2.3) and specifically mention that in  
> the
> release notes. Aren't there releases where applications need to re-
> build the
> index because of fundamental changes?
>
> Am I the only one who thinks that?
>
> BTW, I changed the definition in the jflex file and recompiled using  
> jflex
> and it indeed solved the problem. It now recognizes www.abc.com. and
> www.abc.com as hosts. I can attach the 'patch' files if you'd like to
> compare.
>
> On Nov 27, 2007 9:07 AM, Chris Hostetter <[hidden email]>  
> wrote:
>
>>
>> : If you pass "www.abc.com", the output is (www.abc.com,
>> 0,11,type=<HOST>)
>> : (which is correct in my opinion).
>> : However, if you pass "www.abc.com." (notice the extra '.' at the  
>> end),
>> the
>> : output is (wwwabccom,0,12,type=<ACRONYM>).
>>
>> see also...
>>
>> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
>>
>> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
>>
>> one hitch which potentially changing this now is that it would break
>> some searches in applications that have existing indexes built using
>> previous versions.
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Regards,
>
> Shai Erera

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Potential bug in StandardTokenizerImpl

Shai Erera
Ok

I opened https://issues.apache.org/jira/browse/LUCENE-1068 and attached the
patch files.
I don't know if and how you can deprecate a JFlex grammar though.

On Nov 27, 2007 1:43 PM, Grant Ingersoll <[hidden email]> wrote:

> Yes, please open a JIRA issue and submit your patches.
>
> I wonder if there is anyway to deprecate functionality in a JFlex
> grammar?  That is, is there anyway we can communicate to people that
> both will be supported through 2.9 and then the correct way will be
> supported in 3.x?
>
> -Grant
>
> On Nov 27, 2007, at 2:18 AM, Shai Erera wrote:
>
> > I understand it would change the behavior of existing search
> > solutions,
> > however the current behavior is just wrong. An ACRONYM cannot be
> > ABC.DEF. If
> > you look up acronym in Wikipedia, you find only examples of I.B.M. /
> > U.S.A.
> > like, or NATO, IBM, USA, but nothing of the form StandardAnalyzer
> > currently
> > recognizes.
> >
> > There are several ways to solve this change:
> > 1. Create a new analyzer that fixes the problem - that way,
> > applications
> > that don't want to use it will not have to, if they feel ok with the
> > current
> > behavior. However, for those who would like to get a correct behavior,
> > they'll be able to. This is not my favorite solution, but I think it
> > would
> > be preferable than simply not fixing it.
> > 2. Fix it in the new version (2.3) and specifically mention that in
> > the
> > release notes. Aren't there releases where applications need to re-
> > build the
> > index because of fundamental changes?
> >
> > Am I the only one who thinks that?
> >
> > BTW, I changed the definition in the jflex file and recompiled using
> > jflex
> > and it indeed solved the problem. It now recognizes www.abc.com. and
> > www.abc.com as hosts. I can attach the 'patch' files if you'd like to
> > compare.
> >
> > On Nov 27, 2007 9:07 AM, Chris Hostetter <[hidden email]>
> > wrote:
> >
> >>
> >> : If you pass "www.abc.com", the output is (www.abc.com,
> >> 0,11,type=<HOST>)
> >> : (which is correct in my opinion).
> >> : However, if you pass "www.abc.com." (notice the extra '.' at the
> >> end),
> >> the
> >> : output is (wwwabccom,0,12,type=<ACRONYM>).
> >>
> >> see also...
> >>
> >>
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> >>
> >>
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
> >>
> >> one hitch which potentially changing this now is that it would break
> >> some searches in applications that have existing indexes built using
> >> previous versions.
> >>
> >>
> >>
> >> -Hoss
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Potential bug in StandardTokenizerImpl

hossman
In reply to this post by Eugenio Martinez

: Thanks to Shai Erera for traslating the discussion into the developers'
: list. I am surprised about Chris Hostetter's response, as this issue was

to clarify: i'm not saying that the current behavior is ideal, or even
correct -- i'm saying the current behavior is the current behavior, and
changing it could easily break existing indexes -- something that the
Lucene upgrade contract does not allow...

http://wiki.apache.org/lucene-java/BackwardsCompatibility

specificly: if someone built an index with 2.2, that index needs to work
when queried by an app running 2.3 .. if we change the StandardTokenizer
to treat this differnetly, that won't work.

In some cases, being backwards compatible is more important then being
"correct" ... i'm not 100% certain that this is one of those cases, i'm
just pointing out that there is more to this issue then just a one line
patch to some code.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Potential bug in StandardTokenizerImpl

Shai Erera
I agree that being backward compatible is important. But ... I also work at
a company that delivers search solutions to many customers. Sometimes,
customers are being told that a specific fix will require them to rebuild
their indexes. Customers can then choose whether to install the fix or not.
However, from your statement below I gather that once Lucene 3.0 will be
out, we won't have to be backward compatible, and that fix can go into that
release ... if I'm right, then someone can mark that issue for 3.0 and not
2.3 (I'm not sure I have the permissions to do so).

Isn't there a way to include a fix that you can choose whether to install or
not? For example, I may want to download 2.3 (when it's out) and apply this
patch only. I'm sure there's a way to do it. If there is, we could publish
this as official in 3.0 and patch available for 2.3 (I fixed it only in
jflex, but can easily produce a patch for .jj file, so if will fix
2.2version as well).

My only concern is that this patch will get lost if we don't mark it for any
release ...

Shai

On Nov 28, 2007 9:18 PM, Chris Hostetter <[hidden email]> wrote:

>
> : Thanks to Shai Erera for traslating the discussion into the developers'
> : list. I am surprised about Chris Hostetter's response, as this issue was
>
> to clarify: i'm not saying that the current behavior is ideal, or even
> correct -- i'm saying the current behavior is the current behavior, and
> changing it could easily break existing indexes -- something that the
> Lucene upgrade contract does not allow...
>
> http://wiki.apache.org/lucene-java/BackwardsCompatibility
>
> specificly: if someone built an index with 2.2, that index needs to work
> when queried by an app running 2.3 .. if we change the StandardTokenizer
> to treat this differnetly, that won't work.
>
> In some cases, being backwards compatible is more important then being
> "correct" ... i'm not 100% certain that this is one of those cases, i'm
> just pointing out that there is more to this issue then just a one line
> patch to some code.
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: Potential bug in StandardTokenizerImpl

Grant Ingersoll-2
Yeah, one of the things that I am not thrilled about our model is that  
it essentially means we can only make these kinds of changes on 3.0-
dev (i.e. before releasing 3.0), not a big deal in theory, but as  
evidenced by Hoss's history on this particular item, it has been  
around for a long time.  So, either we need to get better about  
marking things for that window right before a major release and fixing  
them at that time or we need some other way of addressing it.

Is there no way in JFlex to set a flag that defaults to the current  
way, else if set it does the proper thing?  And then we could  
deprecate the old way?

-Grant

On Nov 29, 2007, at 12:57 AM, Shai Erera wrote:

> I agree that being backward compatible is important. But ... I also  
> work at
> a company that delivers search solutions to many customers. Sometimes,
> customers are being told that a specific fix will require them to  
> rebuild
> their indexes. Customers can then choose whether to install the fix  
> or not.
> However, from your statement below I gather that once Lucene 3.0  
> will be
> out, we won't have to be backward compatible, and that fix can go  
> into that
> release ... if I'm right, then someone can mark that issue for 3.0  
> and not
> 2.3 (I'm not sure I have the permissions to do so).
>
> Isn't there a way to include a fix that you can choose whether to  
> install or
> not? For example, I may want to download 2.3 (when it's out) and  
> apply this
> patch only. I'm sure there's a way to do it. If there is, we could  
> publish
> this as official in 3.0 and patch available for 2.3 (I fixed it only  
> in
> jflex, but can easily produce a patch for .jj file, so if will fix
> 2.2version as well).
>
> My only concern is that this patch will get lost if we don't mark it  
> for any
> release ...
>
> Shai
>
> On Nov 28, 2007 9:18 PM, Chris Hostetter <[hidden email]>  
> wrote:
>
>>
>> : Thanks to Shai Erera for traslating the discussion into the  
>> developers'
>> : list. I am surprised about Chris Hostetter's response, as this  
>> issue was
>>
>> to clarify: i'm not saying that the current behavior is ideal, or  
>> even
>> correct -- i'm saying the current behavior is the current behavior,  
>> and
>> changing it could easily break existing indexes -- something that the
>> Lucene upgrade contract does not allow...
>>
>> http://wiki.apache.org/lucene-java/BackwardsCompatibility
>>
>> specificly: if someone built an index with 2.2, that index needs to  
>> work
>> when queried by an app running 2.3 .. if we change the  
>> StandardTokenizer
>> to treat this differnetly, that won't work.
>>
>> In some cases, being backwards compatible is more important then  
>> being
>> "correct" ... i'm not 100% certain that this is one of those cases,  
>> i'm
>> just pointing out that there is more to this issue then just a one  
>> line
>> patch to some code.
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Regards,
>
> Shai Erera

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]