[jira] Created: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
-------------------------------------------------------------------

                 Key: LUCENE-1151
                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 2.4


Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.

This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.

Spinoff from here:

    http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517

I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1151:
---------------------------------------

    Attachment: LUCENE-1151.patch

Attached patch that fixes the original bug (LUCENE-1068) by default, but offers system property & static method to keep backwards compatible yet buggy behavior.

I'll commit in a day or two.

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563548#action_12563548 ]

Grant Ingersoll commented on LUCENE-1151:
-----------------------------------------

Not necessarily related, but can you think of a way that we can keep WikipediaTokenizer and StandardTokenizer in sync for these kind of things.  I guess I need to go look in JFlex to see if there is a way to do inheritance.  Essentially, I want the WikiTokenizer to be StandardTokenizer plus handle the Wiki syntax appropriately.

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563576#action_12563576 ]

Michael McCandless commented on LUCENE-1151:
--------------------------------------------

Very good question ... I don't know.  It would be awesome (and, amazing) if JFlex enabled some kind of inheritance.

Since WikipediaTokenizer doesn't have the backwards compat requirement of StandardTokenizer, you can presumably just fix ACRONYM in WikipediaTokenizer to not match host names (ie hardwire to "true")?

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Grant Ingersoll-2

On Jan 29, 2008, at 12:10 PM, Michael McCandless (JIRA) wrote:

>
>    [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563576 
> #action_12563576 ]
>
> Michael McCandless commented on LUCENE-1151:
> --------------------------------------------
>
> Very good question ... I don't know.  It would be awesome (and,  
> amazing) if JFlex enabled some kind of inheritance.

I asked on the JFlex user list (http://sourceforge.net/mailarchive/forum.php?forum_name=jflex-users 
) but I don't see it in the docs anywhere.

>
>
> Since WikipediaTokenizer doesn't have the backwards compat  
> requirement of StandardTokenizer, you can presumably just fix  
> ACRONYM in WikipediaTokenizer to not match host names (ie hardwire  
> to "true")?

Yes.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563625#action_12563625 ]

Grant Ingersoll commented on LUCENE-1151:
-----------------------------------------

Here's the thread on JFlex for completeness, not that it it effects this patch: http://sourceforge.net/mailarchive/forum.php?thread_name=272037D7-6EA1-4D19-902F-B425A5309C2A%40apache.org&forum_name=jflex-users

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12564045#action_12564045 ]

Jörg Prante commented on LUCENE-1151:
-------------------------------------

Hi Grant,

have you looked at JFlex %implements and %extends directives?

I have used %implements successfully in building my parsers for inheritance, where the Tokens are all constants in an interface generated not by JFlex but by a parser generator.

For example

%%
%class ECQLLexer
%implements ECQLTokens
%unicode
%integer
%eofval{
    return 0;
%eofval}
%line
%column

I am quite sure %extends could also be used to build a tokenizer family.

See http://jflex.de/manual.html

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1151.
----------------------------------------

    Resolution: Fixed

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628568#action_12628568 ]

Mark Lassau commented on LUCENE-1151:
-------------------------------------

Michael,
Great work. I am glad we are moving to have the bug fixed by default, rather than the other way around.

Please indulge me a couple of small nitpicks before I get to my main point in another comment
* Your comment above the static initializer is not correct:
{noformat}
  // Default to false (fixed the bug), unless the system prop is set
{noformat}
should read:
{noformat}
  // Default to true (fixed the bug), unless the system prop is set
{noformat}
* The re-use of the variable a in TestStandardAnalyzer.testDomainNames() does not really guarantee that you are testing the default behaviour of StandardAnalyzer.
I would recommend resetting a in setUp(), or explicitly constructing it in test method.

Given that the code is "temporary" until v3.0, feel free to ignore me ;)

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628573#action_12628573 ]

Mark Lassau commented on LUCENE-1151:
-------------------------------------

I love the solution you have come up with, but would suggest that it is moved to StandardTokenizer instead of StandardAnalyzer.
StandardTokenizer is the class with the actual problem. Fixing it there would mean that everyone that uses StandardTokenizer gets a default fix, not just StandardAnalyzer.

For instance, see LUCENE-1373, where most of the contrib Analyzers still suffer the buggy behavior with no workaround available.
I think that moving your "defaulting logic" to the tokenizer would fix all these Analyzers in one fell swoop.

I would provide suggested patches, but I am just about to go on holidays for 3 weeks. Is there a planned release date for v2.3.3 or v2.4?

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628590#action_12628590 ]

Mark Lassau commented on LUCENE-1151:
-------------------------------------

Added a patch to LUCENE-1373 which moves the logic introduced here from StandardAnalyzer to StandardTokenizer.

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1151) Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628857#action_12628857 ]

Michael McCandless commented on LUCENE-1151:
--------------------------------------------

bq. Please indulge me a couple of small nitpicks before I get to my main point in another comment

Thanks for catching these Mark -- I'll commit a fix shortly.

> Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1151
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1151.patch
>
>
> Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.
> This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.
> Spinoff from here:
>     http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517
> I'll commit that change in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]