[jira] Commented: (LUCENE-1799) Unicode compression

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1799) Unicode compression

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893312#action_12893312 ]

Michael McCandless commented on LUCENE-1799:
--------------------------------------------

The char[] -> byte[] encode time is a miniscule part of indexing time.  And, in turn, indexing time is far less important than impact on search performance.  So... let's focus on the search performance here.

Most queries are unaffected by the term encoding; it's only AutomatonQuery (= fuzzy, regexp, wildcard) that do a fair amount of decoding...

Net/net BOCU1 sounds like an awesome win over UTF8.

> Unicode compression
> -------------------
>
>                 Key: LUCENE-1799
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Store
>    Affects Versions: 2.4.1
>            Reporter: DM Smith
>            Priority: Minor
>         Attachments: Benchmark.java, LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a generally useful feature.
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used.
> An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.    

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]