[jira] [Updated] (LUCENE-2312) Search on IndexWriter's RAM Buffer

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Updated] (LUCENE-2312) Search on IndexWriter's RAM Buffer

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2312:

    Attachment: LUCENE-2312.patch

This is a revised version of the LUCENE-2312 patch.  The following are various and miscelaneous notes pertaining to the patch and where it needs to go to be committed.  

Feel free to review the approach taken, eg, we're getting around non-realtime structures through the usage of array copies (of which the arrays can be pooled at some point).

* A copy of FreqProxPostingsArray.termFreqs is made per new reader.  That array can be pooled.  This is no different than the deleted docs BitVector which is created anew per-segment for any deletes that have occurred.

* FreqProxPostingsArray freqUptosRT, proxUptosRT, lastDocIDsRT, lastDocFreqsRT is copied into, per new reader (as opposed to an entirely new array instantiated for each new reader), this is a slight optimization in object allocation.

* For deleting, a DWPT is clothed in an abstract class that exposes the necessary methods from segment info, so that deletes may be applied to the RT RAM reader.  The deleting is still performed in BufferedDeletesStream.  BitVectors are cloned as well.  There is room for improvement, eg, pooling the BV byte[]’s.

* Documents (FieldsWriter) and term vectors are flushed on each get reader call, so that reading will be able to load the data.  We will need to test if this is performant.  We are not creating new files so this way of doing things may well be efficient.

* We need to measure the cost of the native system array copy.  It could very well be quite fast / enough.

* Full posting functionality should be working including payloads

* Field caching may be implemented as a new field cache that is growable and enables lock’d replacement of the underlying array

* String to string ordinal comparison caches needs to be figured out.  The RAM readers cannot maintain a sorted terms index the way statically sized segments do

* When a field cache value is first being created, it needs to obtain the indexing lock on the DWPT.  Otherwise documents will continue to be indexed, new values created, while the array will miss the new values.  The downside is that while the array is initially being created, indexing will stop.  This can probably be solved at some point by only locking during the creation of the field cache array, and then notifying the DWPT of the new array.  New values would then accumulate into the array from the point of the max doc of the reader the values creator is working from.

* The terms dictionary is a ConcurrentSkipListMap.  We can periodically convert it into a sorted [by term] int[], that has an FST on top.

Have fun reviewing! :)

> Search on IndexWriter's RAM Buffer
> ----------------------------------
>                 Key: LUCENE-2312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2312
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/search
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Assignee: Michael Busch
>             Fix For: Realtime Branch
>         Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch, LUCENE-2312.patch
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable.
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing.
> Michael Busch has good suggestions regarding how to handle deletes using max doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here:
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]