[jira] Created: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
Invalid behavior of StandardTokenizerImpl
-----------------------------------------

                 Key: LUCENE-1068
                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
            Reporter: Shai Erera


The following code prints the output of StandardAnalyzer:

        Analyzer analyzer = new StandardAnalyzer();
        TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
        Token t;
        while ((t = ts.next()) != null) {
            System.out.println(t);
        }

If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).

I think the behavior in the second case is incorrect for several reasons:
1. It recognizes the string incorrectly (no argue on that).
2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.

I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+

Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
ACRONYM    =  {LETTER} "." ({LETTER} ".")+
and it solved the problem.

This was also reported here:
http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1068:
-------------------------------

    Attachment: standardTokenizerImpl.patch

This is the result of re-compiling the JFlex fixed file. Not sure how useful this patch is, but I'm attaching it anyway.

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>         Attachments: standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1068:
-------------------------------

    Attachment: standardTokenizerImpl.jflex.patch

This fixes the JFlex definition file. The change simply replaces:
ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
with
ACRONYM    =  {LETTER} "." ({LETTER} ".")+

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>         Attachments: standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1068:
-------------------------------

    Attachment: StandardTokenizerImpl-2.patch

I've found a way to do it (I think):
I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and fixed the current ACRONYM to identify proper ones.
I also marked ACRONYM_DEP as deprecated.
I added code to StandardTokenizer to set the type of a token to HOST if the type returned is ACRONYM_DEP. This behavior can be changed if you think the type should be set to ACRONYM, in case there are applications that count on the Token type.

I wrote these 4 lines of code to verify it works:
        public static void main(String[] args) throws Exception {
                parse("www.abc.com.");
                parse("www.abc.com");
                parse("I.B.M.");
        }

        public static void parse(String text) throws Exception {
                Analyzer analyzer = new StandardAnalyzer();
                TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
                Token t;
                while ((t = ts.next()) != null) {
                        System.out.println(t);
                }
        }
And the output is:
(www.abc.com.,0,12,type=<HOST>)
(www.abc.com,0,11,type=<HOST>)
(ibm,0,6,type=<ACRONYM>)

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>         Attachments: StandardTokenizerImpl-2.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1068:
-------------------------------

    Attachment: StandardTokenizerImpl-3.patch

The previous patch I put was incorrect since it would still break existing applications. The current patch does:
1. Introduces a new type ACRONYM_DEP which is deprecated and recognizes the old ACRONYM format.
2. Fixes ACRONYM to recognize LETTER + "." (LETTER + ".")+.
3. Added a public member to StandardTokenizer and StandardAnalyzer replaceDepAcronym which can be set if the application would like the deprecated acronym format to be treated as ACRONYM or HOST. The default behavior, if not set is to recognize the old ACRONYM as HOST.

This is how it should be used:
        public static void main(String[] args) throws Exception {
                parse("www.abc.com.", false);
                parse("www.abc.com.", true);
                parse("www.abc.com", true);
                parse("I.B.M.", true);
        }

        public static void parse(String text, boolean replaceDepAcronym) throws Exception {
                StandardAnalyzer analyzer = new StandardAnalyzer();
    analyzer.replaceDepAcronym = replaceDepAcronym;
                TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
                Token t;
                while ((t = ts.next()) != null) {
                        System.out.println(t);
                }
        }
And here is the output:
(wwwabccom,0,12,type=<ACRONYM>)
(www.abc.com.,0,12,type=<HOST>)
(www.abc.com,0,11,type=<HOST>)
(ibm,0,6,type=<ACRONYM>)

The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it.

I think that solves it. How should I proceed?

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>         Attachments: StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-1068:
---------------------------------------

    Assignee: Grant Ingersoll

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>         Attachments: StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550192 ]

Grant Ingersoll commented on LUCENE-1068:
-----------------------------------------

Hi Shai,

Thanks for the patch.  Can you please add unit tests in TestStandardAnalyzer?  

Also, if you run svn diff in the Lucene directory then it will generate a patch that doesn't need to be modified (your patch has references to D:/ etc.)



> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>         Attachments: StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550196 ]

Shai Erera commented on LUCENE-1068:
------------------------------------

Hi Grant,

I used Eclipse to generate the patch (right-click on
org.apache.lucene.analysis.standard, select Team and Create Patch). How do I
run svn diff? Can I do it from inside Eclipse or should I install SVN
cmd-line tools?




--
Regards,

Shai Erera


> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>         Attachments: StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550202 ]

Grant Ingersoll commented on LUCENE-1068:
-----------------------------------------

Hmmm, maybe there is a way in Eclipse to make the path relative to the  
working directory?  Otherwise, from the command line in the Lucene  
directory:  svn diff > StandardTokenizer-4.patch

-Grant



--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>         Attachments: StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1068:
-------------------------------

    Attachment: StandardTokenizer-test-4.patch
                StandardTokenizer-java-4.patch

Code fies under java and test packages. This should be applied under "src"

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>         Attachments: StandardTokenizer-java-4.patch, StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Shai Erera
In reply to this post by Nick Burch (Jira)
Hi

I attached two patch files (for "java" and "test"). Due to a problem in my
checkout project in Eclipse, I don't have them under "src".
I also added a test and modified two tests in TestStandardAnalyzer.

On Dec 10, 2007 11:44 PM, Grant Ingersoll (JIRA) <[hidden email]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550202]
>
> Grant Ingersoll commented on LUCENE-1068:
> -----------------------------------------
>
> Hmmm, maybe there is a way in Eclipse to make the path relative to the
> working directory?  Otherwise, from the command line in the Lucene
> directory:  svn diff > StandardTokenizer-4.patch
>
> -Grant
>
>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
> > Invalid behavior of StandardTokenizerImpl
> > -----------------------------------------
> >
> >                 Key: LUCENE-1068
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
> >             Project: Lucene - Java
> >          Issue Type: Bug
> >          Components: Analysis
> >            Reporter: Shai Erera
> >            Assignee: Grant Ingersoll
> >         Attachments: StandardTokenizerImpl-2.patch,
> StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch,
> standardTokenizerImpl.patch
> >
> >
> > The following code prints the output of StandardAnalyzer:
> >         Analyzer analyzer = new StandardAnalyzer();
> >         TokenStream ts = analyzer.tokenStream("content", new
> StringReader("<some text>"));
> >         Token t;
> >         while ((t = ts.next()) != null) {
> >             System.out.println(t);
> >         }
> > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>)
> (which is correct in my opinion).
> > However, if you pass "www.abc.com." (notice the extra '.' at the end),
> the output is (wwwabccom,0,12,type=<ACRONYM>).
> > I think the behavior in the second case is incorrect for several
> reasons:
> > 1. It recognizes the string incorrectly (no argue on that).
> > 2. It kind of prevents you from putting URLs at the end of a sentence,
> which is perfectly legal.
> > 3. An ACRONYM, at least to the best of my understanding, is of the form
> A.B.C. and not ABC.DEF.
> > I looked at StandardTokenizerImpl.jflex and I think the problem comes
> from this definition:
> > // acronyms: U.S.A., I.B.M., etc.
> > // use a post-filter to remove dots
> > ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> > Notice how the comment relates to acronym as U.S.A., I.B.M. and not
> something else. I changed the definition to
> > ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> > and it solved the problem.
> > This was also reported here:
> >
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> >
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550948 ]

Michael Busch commented on LUCENE-1068:
---------------------------------------

{quote}
The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it.
{quote}

Doesn't this mean it is an API change if we make the new behavior the default? Apps that upgrade will see the new behavior unless they set they call replaceDepAcronym.

To be fully backwards compatible I think this patch should use the old behavior as default. Then in 3.0 we can make the new behavior the default.

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>         Attachments: StandardTokenizer-java-4.patch, StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Grant Ingersoll-2

On Dec 12, 2007, at 7:24 AM, Michael Busch (JIRA) wrote:

>
>    [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel 
> #action_12550948 ]
>
> Michael Busch commented on LUCENE-1068:
> ---------------------------------------
>
> {quote}
> The member is marked deprecated so we can remove it in the next  
> release. Applications that would like to new behavior need to do  
> nothing, and therefore will not be impacted once we remove that  
> member. Applications that want the old behavior need to explicitly  
> set it and in the next major release remove it.
> {quote}
>
> Doesn't this mean it is an API change if we make the new behavior  
> the default? Apps that upgrade will see the new behavior unless they  
> set they call replaceDepAcronym.
>
> To be fully backwards compatible I think this patch should use the  
> old behavior as default. Then in 3.0 we can make the new behavior  
> the default.

+1

>
>
>> Invalid behavior of StandardTokenizerImpl
>> -----------------------------------------
>>
>>                Key: LUCENE-1068
>>                URL: https://issues.apache.org/jira/browse/LUCENE-1068
>>            Project: Lucene - Java
>>         Issue Type: Bug
>>         Components: Analysis
>>           Reporter: Shai Erera
>>           Assignee: Grant Ingersoll
>>        Attachments: StandardTokenizer-java-4.patch,  
>> StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch,  
>> StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch,  
>> standardTokenizerImpl.patch
>>
>>
>> The following code prints the output of StandardAnalyzer:
>>        Analyzer analyzer = new StandardAnalyzer();
>>        TokenStream ts = analyzer.tokenStream("content", new  
>> StringReader("<some text>"));
>>        Token t;
>>        while ((t = ts.next()) != null) {
>>            System.out.println(t);
>>        }
>> If you pass "www.abc.com", the output is (www.abc.com,
>> 0,11,type=<HOST>) (which is correct in my opinion).
>> However, if you pass "www.abc.com." (notice the extra '.' at the  
>> end), the output is (wwwabccom,0,12,type=<ACRONYM>).
>> I think the behavior in the second case is incorrect for several  
>> reasons:
>> 1. It recognizes the string incorrectly (no argue on that).
>> 2. It kind of prevents you from putting URLs at the end of a  
>> sentence, which is perfectly legal.
>> 3. An ACRONYM, at least to the best of my understanding, is of the  
>> form A.B.C. and not ABC.DEF.
>> I looked at StandardTokenizerImpl.jflex and I think the problem  
>> comes from this definition:
>> // acronyms: U.S.A., I.B.M., etc.
>> // use a post-filter to remove dots
>> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
>> Notice how the comment relates to acronym as U.S.A., I.B.M. and not  
>> something else. I changed the definition to
>> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
>> and it solved the problem.
>> This was also reported here:
>> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
>> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Shai Erera
Hi

Assuming "+1" means I agree (forgive me for the lack of familiarity with the
jargon), I'll make a new patch shortly.

On Dec 12, 2007 3:14 PM, Grant Ingersoll <[hidden email]> wrote:

>
> On Dec 12, 2007, at 7:24 AM, Michael Busch (JIRA) wrote:
>
> >
> >    [
> https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
> > #action_12550948 ]
> >
> > Michael Busch commented on LUCENE-1068:
> > ---------------------------------------
> >
> > {quote}
> > The member is marked deprecated so we can remove it in the next
> > release. Applications that would like to new behavior need to do
> > nothing, and therefore will not be impacted once we remove that
> > member. Applications that want the old behavior need to explicitly
> > set it and in the next major release remove it.
> > {quote}
> >
> > Doesn't this mean it is an API change if we make the new behavior
> > the default? Apps that upgrade will see the new behavior unless they
> > set they call replaceDepAcronym.
> >
> > To be fully backwards compatible I think this patch should use the
> > old behavior as default. Then in 3.0 we can make the new behavior
> > the default.
>
> +1
>
> >
> >
> >> Invalid behavior of StandardTokenizerImpl
> >> -----------------------------------------
> >>
> >>                Key: LUCENE-1068
> >>                URL: https://issues.apache.org/jira/browse/LUCENE-1068
> >>            Project: Lucene - Java
> >>         Issue Type: Bug
> >>         Components: Analysis
> >>           Reporter: Shai Erera
> >>           Assignee: Grant Ingersoll
> >>        Attachments: StandardTokenizer-java-4.patch,
> >> StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch,
> >> StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch,
> >> standardTokenizerImpl.patch
> >>
> >>
> >> The following code prints the output of StandardAnalyzer:
> >>        Analyzer analyzer = new StandardAnalyzer();
> >>        TokenStream ts = analyzer.tokenStream("content", new
> >> StringReader("<some text>"));
> >>        Token t;
> >>        while ((t = ts.next()) != null) {
> >>            System.out.println(t);
> >>        }
> >> If you pass "www.abc.com", the output is (www.abc.com,
> >> 0,11,type=<HOST>) (which is correct in my opinion).
> >> However, if you pass "www.abc.com." (notice the extra '.' at the
> >> end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> >> I think the behavior in the second case is incorrect for several
> >> reasons:
> >> 1. It recognizes the string incorrectly (no argue on that).
> >> 2. It kind of prevents you from putting URLs at the end of a
> >> sentence, which is perfectly legal.
> >> 3. An ACRONYM, at least to the best of my understanding, is of the
> >> form A.B.C. and not ABC.DEF.
> >> I looked at StandardTokenizerImpl.jflex and I think the problem
> >> comes from this definition:
> >> // acronyms: U.S.A., I.B.M., etc.
> >> // use a post-filter to remove dots
> >> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> >> Notice how the comment relates to acronym as U.S.A., I.B.M. and not
> >> something else. I changed the definition to
> >> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> >> and it solved the problem.
> >> This was also reported here:
> >>
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> >>
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Michael Busch
Shai Erera wrote:
> Hi
>
> Assuming "+1" means I agree (forgive me for the lack of familiarity with the
> jargon), I'll make a new patch shortly.
>

Yes it does ;). OK, please provide a new patch, then we can get it into 2.3.

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1068:
----------------------------------

    Fix Version/s: 2.3
         Priority: Minor  (was: Major)
    Lucene Fields: [Patch Available]  (was: [Patch Available, New])

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: StandardTokenizer-java-4.patch, StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1068:
-------------------------------

    Attachment: StandardTokenizerImpl-5.patch

Changed the default behavior to match the current behavior. Applications that want to use the new definitions of HOST and ACRONYM should call StandardAnalyzer.replaceDepAcronym = true.

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: StandardTokenizer-java-4.patch, StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, StandardTokenizerImpl-5.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Shai Erera
In reply to this post by Michael Busch
Done

Thanks,
Shai

On Dec 12, 2007 3:59 PM, Michael Busch <[hidden email]> wrote:

> Shai Erera wrote:
> > Hi
> >
> > Assuming "+1" means I agree (forgive me for the lack of familiarity with
> the
> > jargon), I'll make a new patch shortly.
> >
>
> Yes it does ;). OK, please provide a new patch, then we can get it into
> 2.3.
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1068:
------------------------------------

    Attachment: LUCENE-1068.patch

Applied patch.  Updated some documentation.  Changed it to use a private boolean along with getters and setters, plus added some new constructors.  All of these should be deprecated and marked as being removed in 3.x.

I will apply patch tomorrow or Friday unless I hear objections

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1068.patch, StandardTokenizer-java-4.patch, StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, StandardTokenizerImpl-5.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12553978 ]

Grant Ingersoll commented on LUCENE-1068:
-----------------------------------------

StandardTokenizer also incorrectly marks numbers as HOST.

For example, on line 108 of TestStandardAnalyzer, the type of 21.35 is HOST when I think it should be NUM.

> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
>                 Key: LUCENE-1068
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1068
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Shai Erera
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1068.patch, StandardTokenizer-java-4.patch, StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, StandardTokenizerImpl-5.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
>         Analyzer analyzer = new StandardAnalyzer();
>         TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
>         Token t;
>         while ((t = ts.next()) != null) {
>             System.out.println(t);
>         }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM    =  {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM    =  {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12