[jira] [Created] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

classic Classic list List threaded Threaded
129 messages Options
1234 ... 7
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
Mark Harwood created LUCENE-4069:
------------------------------------

             Summary: Segment-level Bloom filters for a 2 x speed up on rare term searches
                 Key: LUCENE-4069
                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
             Project: Lucene - Java
          Issue Type: Improvement
          Components: core/index
    Affects Versions: 3.6
            Reporter: Mark Harwood
            Priority: Minor
             Fix For: 3.6.1


An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.

Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.

Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

Patch based on 3.6 codebase attached.
There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

    Attachment: MHBloomFilterOn3.6Branch.patch

Initial patch
               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: MHBloomFilterOn3.6Branch.patch
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

    Attachment: PrimaryKey40PerformanceTestSrc.zip
                BloomFilterCodec40.patch
   

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

    Attachment: BloomFilterCodec40.patch
                PrimaryKey40PerformanceTestSrc.zip

I've ported this Bloom Filtering code to work as a 4.0 Codec now.
I see a 35% improvement over standard Codecs on random lookups on a warmed index.

I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward?

My test rig (adapted from Mike's original primary key test rig here http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html) is attached as a zip.
The new BloomFilteringCodec is also attached here as a patch.

Searches against plain text fields also look to be faster (using AOL500k queries searching Wikipedia English) but obviously that particular test rig is harder to include as an attachment here.

I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense.


               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

    Attachment:     (was: BloomFilterCodec40.patch)
   

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

    Attachment:     (was: PrimaryKey40PerformanceTestSrc.zip)
   

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283574#comment-13283574 ]

Michael McCandless commented on LUCENE-4069:
--------------------------------------------

bq. I see a 35% improvement over standard Codecs on random lookups on a warmed index.

Impressive!  This is for primary key lookups?

It looks like the primary keys are GUID-like right?  (Ie randomly generated).  I wonder if they had some structure instead (eg '%09d' % (id++)) how the results would look...

bq. I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward?

That's baffling to me: it should only save seeks vs Lucene40 codec, so on a cold index you should see substantial gains, and on a warm index I'd still expect some gains.  Not sure what's up...

bq. I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense.

I think it's fine to do it here?  Really 3.6.x is only for bug fixes now ... so I think we should commit this to trunk.

I wonder if you can wrap any other PostingsFormat (ie instead of hard-coding to Lucene40PostingsFormat)?  This way users can wrap any PF they have w/ the bloom filter...

Can you use FixedBitSet instead of OpenBitSet?  Or is there a reason to use OpenBitSet here...?
               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283583#comment-13283583 ]

Mark Harwood commented on LUCENE-4069:
--------------------------------------

My current focus is speeding up primary key lookups but this may have applications outside of that (Zipf tells us there is a lot of low frequency stuff in free text).

Following the principle of the best IO is no IO the Bloom Filter helps us quickly understand which segments to even bother looking in. That has to be a win overall.

I started trying to write this Codec as a wrapper for any other Codec (it simply listens to a stream of terms and stores a bitset of recorded hashes in a ".blm" file). However that was trickier than I expected - I'd need to encode a special entry in my blm files just to know the name of the delegated codec I needed to instantiate at read-time because Lucene's normal Codec-instantiation logic would be looking for "BloomCodec" and I'd have to discover the delegate that was used to write all of the non-blm files.

Not looked at FixedBitSet but I imagine that could be used instead.
               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283615#comment-13283615 ]

Mark Harwood commented on LUCENE-4069:
--------------------------------------

Update- I've discovered this Bloom Filter Codec currently has a bug where it doesn't handle indexes with >1 field.
It's probably all tangled up in the "PerField..." codec logic so I need to do some more digging.
               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

    Attachment:     (was: BloomFilterCodec40.patch)
   

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

    Attachment: BloomFilterCodec40.patch

Fixed the issue with >1 field in an index.
Tests on random lookups on Wikipedia titles (unique keys) now show a 3 x speed up for a Bloom-filtered index over standard 4.0 Codec for fully warmed indexes.
               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284520#comment-13284520 ]

Simon Willnauer commented on LUCENE-4069:
-----------------------------------------

hey mark, this looks very interesting. I wonder if that also helps indexing in terms of applying deletes. did you test that at all or is that something are looking into as well? For deletes this could be very handy since bloom filters should help a lot when a given key is likely to only in one segment.

Regarding the code, I think BloomFilter should be like Pulsing and only be a PostingsFormat. You should make the BloomFilterPostingsFormat abstract and pass in the "wrapped" postings format and delegate independent of the field. PostingsFormat is per field and is resolved by the codec. An actual implementation should subclass the abstract BloomFilterPostingsFormat that way you have a fixed PostingsFormat per "delegate" and don't need to deal with "which field got which delegate" bla bla.

We should not have a Bloomfilter codec but rather swap in the bloomfilter postings format in test via random codec. I also wonder if we can extract a "bloomfilter" class into utils that builds an BloomFilter this could be useful elsewhere.


               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284536#comment-13284536 ]

Mark Harwood commented on LUCENE-4069:
--------------------------------------

bq.  I wonder if that also helps indexing in terms of applying deletes. did you test that

I've not looked into that particularly but I imagine this may be relevant.

Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it.

bq. I also wonder if we can extract a "bloomfilter" class into utils

There's some reusable stuff in this patch for downsizing the Bitset (according to desired saturation levels) having accumulated a stream of values.
               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284545#comment-13284545 ]

Simon Willnauer commented on LUCENE-4069:
-----------------------------------------

bq. Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it.
YW! A couple of things regarding your patch. I think you should write a header for the bloomfilter file using CodecUtils.writeHeader() to be consistent. Once you are ready you should also look at the other *PostingsFormat impls how we document the file formats there. We try to have versioned file formats instead of the monolithic one on the website. I saw some minor things like

{code}
finally {
  if (output != null)  {
    output.close();
  }
}
{code}

you can use IOUtils.close*() for that which handles some exception cases etc.

I briefly looked at your reuse code for TermsEnum and it seems it has the same issues as pulsing has (sovled). When you get a TermsEnum that is infact a Bloomfilter wrapper but wraps another codec underneath you keep on switching back an forth creating new delegated TermsEnum instances. You can look at PulsingPostingsReader and in particular at PulsingEnumAttributeImpl how we solved that in Pulsing.
               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284562#comment-13284562 ]

Michael McCandless commented on LUCENE-4069:
--------------------------------------------

It looks like MurmurHash.java is missing from the patch?

Also, with LUCENE-4055 landing it looks like the patch no longer compiles (sorry!)...
               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

    Attachment:     (was: BloomFilterCodec40.patch)
   

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

    Attachment: BloomFilterCodec40.patch

Updated to work with trunk.
* Changed to use FixedBitSet
* Is now a PostingsFormat abstract base class
* Added missing MurmurHash class

TODOs
* Move Bloom filter logic to common utils classes
* Use Service Providers for pluggable choice of hash algos?
* Expose settings for memory/saturation
               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Harwood updated LUCENE-4069:
---------------------------------

          Description:
An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.

Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.

Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

Patch based on 3.6 codebase attached.
There are no 3.6 API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

  was:
An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.

Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.

Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

Patch based on 3.6 codebase attached.
There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.

    Affects Version/s: 4.0
        Fix Version/s: 4.0
   

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285054#comment-13285054 ]

Simon Willnauer commented on LUCENE-4069:
-----------------------------------------

hey mark,

I think you should really not provide any field handling at all. this should be done in user code ie. a general codes/postings format per field. What would be very useful here is integration into RandomCodec so we can randomly swap it in. Same is true for the hash algos & memory settings, I think they should be part of the abstract class ctor and if a user wants to use its own algo they should write their own codec ie. subclass. this stuff is very expert and apps like solr can handle this on top.

I am still worried about the TermsEnum reuse code, are you planning to look into this?


               

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285054#comment-13285054 ]

Simon Willnauer edited comment on LUCENE-4069 at 5/29/12 7:45 PM:
------------------------------------------------------------------

hey mark,

I think you should really not provide any field handling at all. this should be done in user code ie. a general codes/postings format per field. What would be very useful here is integration into RandomCodec so we can randomly swap it in. Same is true for the hash algos & memory settings, I think they should be part of the abstract class ctor and if a user wants to use its own algo they should write their own codec ie. subclass. this stuff is very expert and apps like solr can handle this on top.

I am still worried about the TermsEnum reuse code, are you planning to look into this?

you should also add license headers to the files you are adding, note that we don't do javadoc license headers anymore so they should just be block comments.

               
      was (Author: simonw):
    hey mark,

I think you should really not provide any field handling at all. this should be done in user code ie. a general codes/postings format per field. What would be very useful here is integration into RandomCodec so we can randomly swap it in. Same is true for the hash algos & memory settings, I think they should be part of the abstract class ctor and if a user wants to use its own algo they should write their own codec ie. subclass. this stuff is very expert and apps like solr can handle this on top.

I am still worried about the TermsEnum reuse code, are you planning to look into this?


                 

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.
> Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

1234 ... 7