[jira] Updated: (LUCENE-1799) Unicode compression

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Updated: (LUCENE-1799) Unicode compression

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1799:

    Attachment: LUCENE-1799.patch

A new patch that completely separates the BOCU factory from the implementation (which moves to common/miscellaneous). This has the following advantages:

- You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work
- Theoretically you could remove the BOCU classes at all, one that wants to use, can simply get the Charset from ICUs factory and pass it to the AttributeFactory. The convenience class is still useful, especially if we can later natively implement the encoding without NIO (when patent issues are solved...)
- The test for the CustomCharsetTermAttributeFactory uses UTF-8 as charset and verifies that the created BytesRefs have the same format like a BytesRef created using the UnicodeUtils.
- The test also checks that encoding errors are bubbled up as RuntimeExceptions


- docs
- handling of encoding errors configureable (replace with replacement char?)
- If you want your complete index e.g. in ISO-8859-1, there should be also convenience methods that take CharSequences/char[] in the factory/attribute to quickly convert strings to BytesRefs like UnicodeUtil does - by this its possible to create TermQueries directly using e.g. ISO-8859-1 encoding.

> Unicode compression
> -------------------
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>         Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature.
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used.
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]