[jira] Created: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

Sebastian Nagel (Jira)
Fix SnowballAnalyzer casing behavior for Turkish Language
---------------------------------------------------------

                 Key: LUCENE-2117
                 URL: https://issues.apache.org/jira/browse/LUCENE-2117
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/*
    Affects Versions: 3.0
            Reporter: Simon Willnauer
            Priority: Minor
             Fix For: 3.1


LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior correctly. We should fix the casing behavior in SnowballAnalyzer too as it supports a TurkishStemmer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2117:
--------------------------------

    Attachment: LUCENE-2117.patch

patch for the bug that:
* for Turkish language, when Version >= 3.1, use TurkishLowerCaseFilter instead in SnowballAnalyzer
* Add javadoc note to SnowballFilter noting that it expects lowercased text to work (and in the turkish case, you must use the special filter)
* add contrib/analyzers dependency to contrib/snowball (perhaps not the best but what is the other option?)


> Fix SnowballAnalyzer casing behavior for Turkish Language
> ---------------------------------------------------------
>
>                 Key: LUCENE-2117
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2117
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior correctly. We should fix the casing behavior in SnowballAnalyzer too as it supports a TurkishStemmer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer reassigned LUCENE-2117:
---------------------------------------

    Assignee: Simon Willnauer

> Fix SnowballAnalyzer casing behavior for Turkish Language
> ---------------------------------------------------------
>
>                 Key: LUCENE-2117
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2117
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior correctly. We should fix the casing behavior in SnowballAnalyzer too as it supports a TurkishStemmer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786515#action_12786515 ]

Simon Willnauer commented on LUCENE-2117:
-----------------------------------------

Robert, the patch looks almost good. You should also change the pom.xml.template to reflect the new dependency. I'm still thinking about moving snowball into analyzers as a analyzers/snowball would that make sense?

Somewhat unrelated but still ugly:
{code}
      Class<?> stemClass = Class.forName("org.tartarus.snowball.ext." + name + "Stemmer");
{code}
When I look through the patch I see this "name" parameter which is used to load a stemmer per reflection. We should really define a factory interface that creates the stemmer and get rid of the refelction code

> Fix SnowballAnalyzer casing behavior for Turkish Language
> ---------------------------------------------------------
>
>                 Key: LUCENE-2117
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2117
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior correctly. We should fix the casing behavior in SnowballAnalyzer too as it supports a TurkishStemmer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2117:
--------------------------------

    Attachment: LUCENE-2117.patch

this patch includes update to pom.xml.template

> Fix SnowballAnalyzer casing behavior for Turkish Language
> ---------------------------------------------------------
>
>                 Key: LUCENE-2117
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2117
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2117.patch, LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior correctly. We should fix the casing behavior in SnowballAnalyzer too as it supports a TurkishStemmer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786518#action_12786518 ]

Robert Muir commented on LUCENE-2117:
-------------------------------------

bq. I'm still thinking about moving snowball into analyzers as a analyzers/snowball would that make sense?

we have to do something about the duplication (LUCENE-2055). There i have suggested we upload the snowball stoplists (which are nice) so that we can get rid of some hand-coded java functionality. It is silly to have the exact same Russian stemmer in two different places in contrib, etc.

then we have open issues like LUCENE-559...

> Fix SnowballAnalyzer casing behavior for Turkish Language
> ---------------------------------------------------------
>
>                 Key: LUCENE-2117
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2117
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2117.patch, LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior correctly. We should fix the casing behavior in SnowballAnalyzer too as it supports a TurkishStemmer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786549#action_12786549 ]

Simon Willnauer commented on LUCENE-2117:
-----------------------------------------

Robert, Patch looks good and all tests pass.
I plan to commit this later tomorrow if nobody objects.

> Fix SnowballAnalyzer casing behavior for Turkish Language
> ---------------------------------------------------------
>
>                 Key: LUCENE-2117
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2117
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2117.patch, LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior correctly. We should fix the casing behavior in SnowballAnalyzer too as it supports a TurkishStemmer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786834#action_12786834 ]

Simon Willnauer commented on LUCENE-2117:
-----------------------------------------

I will commit shortly if nobody objects

> Fix SnowballAnalyzer casing behavior for Turkish Language
> ---------------------------------------------------------
>
>                 Key: LUCENE-2117
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2117
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2117.patch, LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior correctly. We should fix the casing behavior in SnowballAnalyzer too as it supports a TurkishStemmer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer resolved LUCENE-2117.
-------------------------------------

    Resolution: Fixed

committed in revision 888787

thanks robert

> Fix SnowballAnalyzer casing behavior for Turkish Language
> ---------------------------------------------------------
>
>                 Key: LUCENE-2117
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2117
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2117.patch, LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior correctly. We should fix the casing behavior in SnowballAnalyzer too as it supports a TurkishStemmer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]