[jira] [Commented] (LUCENE-5052) bitset codec for off heap filters

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (LUCENE-5052) bitset codec for off heap filters

Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937947#comment-13937947 ]

Mikhail Khludnev commented on LUCENE-5052:

bq. it'd be better if the postings format wrapped another postings format, and then only used the bitset when the docFreq was high enough

There are two orthogonal conceptions:
* particular format - let's generalize "bitset format" to "no-tf format", and use WAH8, Elas-Fano with off-heap access (TODO). Thus, it works for spare postings;
* API - how consumer can express his intention to use "no-tf" format? e.g. TermFilter or TermsEnum.docs() with special flag;  

I'd like to clarify use-case for this issue (issue summary might need to be improved). It aims Solr's fq or even Heliosearch's GC-lightness. I suppose that user can decide which fields to index with "no-tf" format, these are "string" fields. Then, user requests filtering for these fields, no scoring is needed, for sure.

Hence, I don't think than conditional conditional triggering is a good choice, however I don't know how to do it. I might not understand well how pulsing codec is used (impl idea is clear, though), can you point me on its' usage.


> bitset codec for off heap filters
> ---------------------------------
>                 Key: LUCENE-5052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5052
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Mikhail Khludnev
>              Labels: features
>             Fix For: 5.0
>         Attachments: LUCENE-5052.patch, bitsetcodec.zip, bitsetcodec.zip
> Colleagues,
> When we filter we don’t care any of scoring factors i.e. norms, positions, tf, but it should be fast. The obvious way to handle this is to decode postings list and cache it in heap (CachingWrappingFilter, Solr’s DocSet). Both of consuming a heap and decoding as well are expensive.
> Let’s write a posting list as a bitset, if df is greater than segment's maxdocs/8  (what about skiplists? and overall performance?).
> Beside of the codec implementation, the trickiest part to me is to design API for this. How we can let the app know that a term query don’t need to be cached in heap, but can be held as an mmaped bitset?
> WDYT?  

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]