[jira] Created: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
fix LowerCaseFilter for unicode 4.0
-----------------------------------

                 Key: LUCENE-2069
                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Robert Muir
            Priority: Minor
             Fix For: 3.1
         Attachments: LUCENE-2069.patch

lowercase suppl. characters correctly.

this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2069:
--------------------------------

    Attachment: LUCENE-2069.patch

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778401#action_12778401 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, if you have a moment maybe you can review this one for me?

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778504#action_12778504 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

Robert, I assume you did use those weird chars in the test on purpose - I wonder if there are some "real" codepoints that we could use in the test?

The code looks good to me, this is the way to go for char lowercaseing with Unicode 4.0

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778508#action_12778508 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0!

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778509#action_12778509 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

we might need a changes.txt entry here too?!

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778510#action_12778510 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, yes see LUCENE-1689.
this is my question of the day, how are we handling this which is really a backwards break in a way, but honestly a bugfix because we should have supported Unicode 4.0 in Lucene 3.0, since thats the unicode version of java 5.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778514#action_12778514 ]

Uwe Schindler commented on LUCENE-2069:
---------------------------------------

we can change it whenever we want, we must only supply a matchVersion switch....

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778515#action_12778515 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Uwe, we can use matchVersion for all of this, this is true, and I will help.

but see my comment on LUCENE-1689 (since i feel it affects all the issues), it will result in a lot of code complexity. Just a warning.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2069:
--------------------------------

    Attachment: LUCENE-2069.patch

here is a patch that supports the old broken behavior also via Version.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2069:
--------------------------------

    Attachment: LUCENE-2069.patch

forgot javadocs describing what the version does, sorry.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779156#action_12779156 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

if you want my vote, it is that we treat issues like this as bugs and not do all this Version stuff.

i supplied this patch (22KB versus 2KB) to show how even the smallest issue creates more complexity.
Also, read the javadocs for what Version does, it reads just like a bug:
* As of 3.1, supplementary characters are properly lowercased.

I mean, honestly, its not like we provided a back compat mechanism for 3.0,
where this behavior changed for lots of contrib that uses String-based methods, such as String toLowerCase (they return different results on JRE5 than JRE4)

but we can go either way, doesn't matter to me.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779160#action_12779160 ]

Mark Miller commented on LUCENE-2069:
-------------------------------------

But we try and maintain index back compatibility with bugs too? We don't want terms to be lost in an index.

But it depends as always - if something has long been a problem and broken, then perhaps it doesn't make sense to bend over backwards about it now.  We just have to look at everything, put the priority on making life best for users while balancing somewhat with dev/maintenance headaches and come to a consensus - easy ! :)

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779164#action_12779164 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues! :)

doesn't matter to me, I just present both alternatives! all i want is for us to make a decision.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779319#action_12779319 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

bq. Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0!
thats what I guessed :D otherwise it would not work though :). I was just wondering if there are some more expressive once out there.

bq. Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues!
+1

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779499#action_12779499 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

bq. But we try and maintain index back compatibility with bugs too?

Mark, you are right. The Version description says this: Match settings and bugs in Lucene's 3.0 release.
I guess we should at least try, I think we can do it.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782393#action_12782393 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

btw. this also works for CharArraySet - that way we can easily implement it with Version without duplicating any code. Readable, clean and compatible.

I will update the CharArraySet patch once I got comments on this.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2069:
------------------------------------

    Attachment: LUCENE-2069.patch

I revised the patch and fixed some issues:
- replaced real characters in tests
- extended tests to boundaries
- Removed "code duplication" in LowercaseFilter

the latter is the most important issue. I figured that if we implement a factory with the basic codePointAt method based on a version we can implement the most of the algorithms / methods just by obtaining the version correspondent instance of CharacterUtils (new class I introduced) What this class does is pretty simple - if version >= 3.1 it delegates to the Character correspondent while for earlier versions it convert a character to a codepoint without checking the for high surrogates. Once we have done this conversion we can simply use all the Character.*(int) methods as they are.



> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782432#action_12782432 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Hi Simon, this is a cool idea!

I need to think this through, can you think of other places (non-lowercasing) where we could use this?
Even if we can only use it there, I think it might still be a good idea to keep things simple.

I do think we should mark the class deprecated and only used for lucene back compat purposes if we decide to use it.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782439#action_12782439 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, i took a quick look at contrib analyzers, for example.
This utility class could make back compat easier for a lot of the code, i.e. unicode block calculations in the CJK code, greek diacritic/lowercase folding in the greek code, ...
I think we should go this route.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12