[jira] [Updated] (LUCENE-5081) Compress doc ID sets

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-5081) Compress doc ID sets

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-5081:
---------------------------------

    Attachment: LUCENE-5081.patch

Here is an implementation of a compressed doc ID set. Although it is immutable, only supports sequential access and requires doc IDs to be provided in order at building time, it supports fast iteration (nextDoc), skipping (advance), union and intersection. The detailed format is a bit complex (see javadocs), but the rough idea is to store large sequences of null bytes as a VInt and sequences of non-null bytes as-is. This implementation can compress data as soon as it can find more than 2 consecutive null bytes. Moreover, even incompressible sets only require a few bytes more than FixedBitSet.

I ran a quick benchmark to measure the size (as reported by RamUsageEstimator) depending on the percentage of bits set on a bit set containing 2<sup>23</sup> elements (FixedBitSet requires 1MB) as well as the time required to iterate over all document IDs compared to FixedBitSet:
||Percentage of bits set||Size||Iteration time/FixedBitSet iteration time||
|0.001%|360 bytes|0.01|
|0.01%|2.8KB|0.10|
|0.1%|23.8 KB|0.38|
|1%|187.7 KB|0.80|
|10%|864 KB|1.3|
|50%|1 MB|2.5|
|90%|1 MB|2.3|
|100%|1 MB|1.7|

Even in the worst case, memory usage exceeds the memory usage of FixedBitSet by a few bytes, and iteration is 2.5 times slower.

The patch includes the set implementation but it is not used anywhere yet. I was thinking about using it automatically instead of FixedBitSet in CachingWrapperFilter but it looks like ToParentBlockJoinQuery expects to get a FixedBitSet from the cache.
               

> Compress doc ID sets
> --------------------
>
>                 Key: LUCENE-5081
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5081
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-5081.patch
>
>
> Our filters use bit sets a lot to store document IDs. However, it is likely that most of them are sparse hence easily compressible. Having efficient compressed sets would allow for caching more data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]